FinePDFs: The Hidden Treasure of 3 Trillion Tokens Liberated from PDFs
The HuggingFace team has released the largest PDF dataset to date, covering 475 million documents across 1,733 languages, with an average text length twice that of web data. How does this long-overlooked data source break through the data wall for large model training?
As internet web data nears exhaustion, we’ve finally turned our attention to that well-known 'data forbidden zone'—PDFs.

### The Underestimated Data Goldmine
Everyone knows PDFs are hard to process: complex formats, high extraction costs, and inconsistent OCR accuracy. Yet these very traits make them the last high-quality data source—90% of high-value content like legal documents, academic papers, and technical manuals are locked in PDFs. The largest existing CC-PDF corpus barely scratches the surface of PDF resources in CommonCrawl.
### A Two-Phase Extraction Strategy
We built a tiered processing pipeline:
- **Text-extractable PDFs**: Processed with Docling (cost: $, decent quality)
- **Scanned/Complex Layouts**: Handled by rolmOCR (cost: $$, excellent quality)

### Counterintuitive Findings
After LM filtering and deduplication, the final 3 trillion tokens revealed two key traits:
1. Average document length is twice that of web data
2. When mixed with HTML corpora, it achieves SOTA across multiple benchmarks
The irony? This lightly filtered dataset’s quality rivals that of rigorously cleaned FW-EDU&DCLM web data—the gold standard we once swore by.
### An Oasis in the Data Desert
Among the 1,733 languages covered, 66 languages exceed 1 billion tokens. Notable highlights:
- Legal documents make up an unexpected 32%
- Mathematical formulas in academic papers remain intact
- 17% of documents feature multilingual mixed layouts
### Open-Source and Limitations
The dataset is open-sourced under ODC-By 1.0, but note:
- No NSFW filtering (inherent to PDFs)
- OCR error rates range from 3%-7%
- Tabular content may be misaligned
This may not be the perfect solution, but it’s the most practical attempt to break through the data wall. While others obsess over web data, we chose to crack the tough nut that is PDFs first.
[Download the dataset](https://huggingface.co/datasets/HuggingFaceFW/finepdfs)
发布时间: 2025-09-07 15:02