DeepSeek-OCR Tested: Finding the Sweet Spot Between Compression Ratio and Accuracy
Testing reveals DeepSeek-OCR maintains 97% accuracy at 10x compression with significant visual token compression, but accuracy drops sharply beyond 12x compression.
When I first saw DeepSeek-OCR claiming to render long documents into images and then perform 'optical compression' using visual encoders, my immediate reaction was: Can this really work? How stable is it? So I pulled the open-source model from Hugging Face and started testing.

The setup was surprisingly smooth. Several resolution presets cover most needs: Tiny mode (512×512) is great for quick browsing; Base mode (1024×1024) serves as the workhorse for daily use; and for ultra-dense pages like newspapers or academic PDFs, you can switch to Gundam mode.
I tested several key metrics:
- For magazine pages at 1024×1024 resolution, DeepEncoder produced only 256 visual tokens without running out of VRAM during inference
- In the public OmniDocBench comparison, the "Small" mode with 100 tokens outperformed GOT-OCR2.0 with 256 tokens
- Gundam mode used fewer than 800 tokens yet surpassed the Min-erU2.0 pipeline that requires about 7000 tokens
This directly demonstrates the "less is more" effect.
Based on practical use and feedback from other users: at 10x compression, OCR accuracy remains around 97%; at 10-12x compression, it stays around 90%; beyond 20x compression, accuracy drops significantly to about 60%. For well-formatted documents (like long-form tech articles), Free OCR typically takes just over 20 seconds (I tested about 24 seconds). Grounding mode requires more processing time, taking nearly a minute (about 58 seconds), but outputs Markdown format that's very convenient for copy-pasting.
My workflow has two steps: first, I use Free OCR to quickly confirm content, and if I need to archive or process further, I run the Grounding version to export Markdown. Tables are directly converted to HTML, and even chemical formulas can be transformed into SMILES format, which is particularly useful for academic PDFs.

A couple of things to note: don't be too aggressive with compression ratios - within 10x is the sweet spot. Also, this isn't yet the chat paradigm with instruction tuning. If you want to use it as a multimodal assistant, some prompt engineering techniques are still needed.
Some users have reported performance on edge cases, such as rotated text or low-quality scans. Most visual encoders perform well on clean documents but struggle with documents repeatedly photocopied in the 90s that have become blurry. It's also worth noting whether compression artifacts first appear with specific content types (like small fonts or dense tables).
Testing environment: RTX 4090, PyTorch framework, with VRAM usage kept within reasonable limits. In terms of time allocation, context processing takes the majority of the time, while actual generation is relatively short.
发布时间: 2025-10-22 15:43