Wink - AI原生创新，忠于用户，专属智能体验

As internet web data nears exhaustion, we’ve finally turned our attention to that well-known 'data forbidden zone'—PDFs.

![PDF processing flowchart](https://wink.run/image?url=https%3A%2F%2Fpbs.twimg.com%2Fmedia%2FG0Oby1mXYAALARy%3Fformat%3Djpg%26name%3Dlarge)

### The Underestimated Data Goldmine

Everyone knows PDFs are hard to process: complex formats, high extraction costs, and inconsistent OCR accuracy. Yet these very traits make them the last high-quality data source—90% of high-value content like legal documents, academic papers, and technical manuals are locked in PDFs. The largest existing CC-PDF corpus barely scratches the surface of PDF resources in CommonCrawl.

### A Two-Phase Extraction Strategy

We built a tiered processing pipeline:

- **Text-extractable PDFs**: Processed with Docling (cost: $, decent quality)

- **Scanned/Complex Layouts**: Handled by rolmOCR (cost: $$, excellent quality)

![Data statistics chart](https://wink.run/image?url=https%3A%2F%2Fpbs.twimg.com%2Fmedia%2FG0Ob-KSW0AAZhkK%3Fformat%3Djpg%26name%3Dlarge)

### Counterintuitive Findings

After LM filtering and deduplication, the final 3 trillion tokens revealed two key traits:

1. Average document length is twice that of web data

2. When mixed with HTML corpora, it achieves SOTA across multiple benchmarks

The irony? This lightly filtered dataset’s quality rivals that of rigorously cleaned FW-EDU&DCLM web data—the gold standard we once swore by.

### An Oasis in the Data Desert

Among the 1,733 languages covered, 66 languages exceed 1 billion tokens. Notable highlights:

- Legal documents make up an unexpected 32%

- Mathematical formulas in academic papers remain intact

- 17% of documents feature multilingual mixed layouts

### Open-Source and Limitations

The dataset is open-sourced under ODC-By 1.0, but note:

- No NSFW filtering (inherent to PDFs)

- OCR error rates range from 3%-7%

- Tabular content may be misaligned

This may not be the perfect solution, but it’s the most practical attempt to break through the data wall. While others obsess over web data, we chose to crack the tough nut that is PDFs first.

[Download the dataset](https://huggingface.co/datasets/HuggingFaceFW/finepdfs)

Wink Pings

FinePDFs: The Hidden Treasure of 3 Trillion Tokens Liberated from PDFs