Every RAG pipeline starts with the same unglamorous problem: getting text off a document. PDFs with multi-column layouts, scanned invoices with handwritten annotations, financial tables embedded in slides, scientific papers where equations break OCR engines - these are the inputs that real production systems have to process, and the quality of what comes out sets a ceiling on everything downstream. You can run the best embedding model and the best retrieval layer in the world, but if your text extraction is garbage, your retrieved context will be garbage too. See our guide to RAG systems for context on why extraction quality matters so much at the pipeline level.

This leaderboard focuses specifically on document and text extraction capability. It's a narrower cut than our broader vision-language benchmarks leaderboard and multimodal benchmarks leaderboard, which cover scientific chart reasoning, video understanding, and perceptual tasks. Here we're focused on the part of the stack that handles scanned PDFs, forms, structured documents, and text-heavy images - the workhorses of enterprise document processing.

TL;DR

Qwen2.5-VL 72B leads open-source OCRBench at 877/1000 and DocVQA at 96.1% - the open-source alternative to frontier APIs for high-volume document pipelines
GPT-4.1 Vision and Gemini 2.5 Pro lead on DocVQA among hosted models, both clearing 93%
Tesseract and PaddleOCR remain useful baselines for high-volume local workloads where VLM latency/cost is prohibitive, but they lose badly on handwriting, complex layouts, and multilingual content
Mistral OCR (specialized pipeline tool) and LlamaParse offer strong accuracy on PDFs but lack the broad benchmark coverage to rank fairly against general-purpose VLMs

The Benchmarks Explained

OCRBench and OCRBench v2

OCRBench was designed specifically to evaluate text recognition capability in vision-language models - filling a gap that general visual reasoning benchmarks don't address. The original benchmark covers 29 OCR task types, 5,000 question-answer pairs. OCRBench v2 substantially expanded the scope to 10,000 human-verified QA pairs across 31 scenario types including handwritten text, mathematical notation, multilingual content, and degraded documents. Scores are reported as integers out of 1000. As of early 2026, top models score in the 850-900 range - this benchmark still has headroom.

One important note: OCRBench and its v2 were developed by the same team at SCUT and WeChat AI, and Chinese-language OCR is better represented than most Western benchmarks. Models with strong Chinese text recognition (InternVL3, Qwen2.5-VL) benefit from this. English-only pipelines should weight DocVQA and InfographicVQA more heavily.

DocVQA

DocVQA is the most widely used document understanding benchmark. It uses scanned or photographed documents - forms, receipts, contracts, invoices, research papers, memos - paired with free-form questions about their content. The challenge is handling varied fonts, degraded image quality, complex layouts, and fields that wrap around visual elements. Most frontier models now clear 92-96%; the benchmark is reaching saturation for top-tier models but remains an excellent signal for mid-tier and open-source models where meaningful gaps persist.

InfographicVQA

InfographicVQA is harder than DocVQA specifically because infographics combine text, charts, icons, arrows, and non-linear reading order in ways that require parsing layout as well as text content. A model that reads scanned forms well can still struggle here because the visual structure carries semantic meaning. Top scores hover around 60-75% for frontier models, and the benchmark continues to differentiate performance better than DocVQA at the top tier.

ChartQA

ChartQA tests reasoning about charts: bar charts, line graphs, pie charts, and scatter plots. Questions require both reading the chart correctly and performing inference (e.g., "which month had the third-highest sales?"). Most frontier models now clear 85-92%. It's less useful for differentiating the top tier but still valuable for identifying weaker models and tracking open-source progress.

TextVQA

TextVQA evaluates reading text in natural scene images - signs, menus, labels, book covers, business cards, whiteboards. This is different from document OCR: the text appears in real-world photography at varying angles, in varying lighting, sometimes partially occluded. Top models score 80-85%. TextVQA is a particularly useful test for models deployed in consumer contexts where document quality isn't controlled.

MMMU-Pro (Vision subset)

MMMU-Pro is primarily an academic visual reasoning benchmark, but its Vision subset - which removes textual answer cues and forces models to reason from images alone - is relevant to document AI because many of its questions involve figures, tables, and diagrams from academic papers. The 10-option answer format makes random guessing nearly useless, and scores range from 55-82% for frontier models. We include it as a proxy for structured visual reasoning rather than pure text extraction.

Rankings Table

Scores are the best publicly reported results as of April 2026, sourced from provider technical reports, model cards, and independent evaluators. "Not reported" means no verified public score was found at time of writing - this does not imply the model performs poorly, only that the data isn't published. Where evaluation configurations vary (e.g., resolution settings, prompt formats), the higher verified result is listed.

Rank	Model	Type	OCRBench	DocVQA	InfoVQA	ChartQA	TextVQA	MMMU-Pro (V)	Notes
1	Qwen2.5-VL 72B	Local/API	877	96.1%	84.5%	89.8%	84.9%	70.2%	Open-weight; top open-source across all OCR benchmarks
2	Gemini 2.5 Pro	Hosted	Not reported	93.8%	87.7%	90.0%	Not reported	75.2%	Leads InfoVQA and ChartQA among APIs
3	GPT-4.1 Vision	Hosted	Not reported	93.4%	85.2%	88.3%	82.4%	72.5%	Strong across document tasks; widely deployed
4	Claude 4 Opus	Hosted	Not reported	91.5%	81.8%	86.2%	80.9%	77.3%	Leads MMMU-Pro Vision; best for reasoning-heavy docs
5	InternVL3 76B	Local/API	855	95.4%	83.1%	89.7%	82.7%	Not reported	Close second to Qwen2.5-VL on OCRBench
6	Claude 4 Sonnet	Hosted	Not reported	89.2%	78.3%	84.6%	79.1%	70.4%	Cost-efficient mid-tier option
7	DeepSeek-VL 2 (27B MoE)	Local/API	809	92.8%	79.2%	85.3%	80.3%	62.4%	Efficient MoE architecture; strong DocVQA for model size
8	GPT-5	Hosted	Not reported	Not reported	Not reported	Not reported	Not reported	82.0%	Benchmark coverage limited at time of writing
9	Kimi K2-VL	Hosted	Not reported	90.1%	77.4%	83.8%	78.6%	Not reported	Moonshot AI; competitive mid-tier
10	Mistral OCR	Hosted (API)	Not reported	Not reported	Not reported	Not reported	Not reported	Not reported	Specialized pipeline; not tested on standard VQA benchmarks
11	LlamaParse	Hosted (API)	Not reported	Not reported	Not reported	Not reported	Not reported	Not reported	Document-to-markdown pipeline; no VQA benchmark data
12	Nougat (0.1.0-base)	Local	Not reported	56.2%	Not reported	Not reported	Not reported	Not reported	Meta; specialized for scientific PDFs only
13	MinerU	Local	Not reported	Not reported	Not reported	Not reported	Not reported	Not reported	PDF parsing pipeline; benchmarks not standardized
14	PaddleOCR (v4)	Local	561	Not reported	Not reported	Not reported	Not reported	Not reported	Strong on printed text; limited on complex layouts
15	Tesseract (5.x)	Local	404	Not reported	Not reported	Not reported	Not reported	Not reported	Baseline; no visual reasoning, text-only extraction

Key Takeaways

Open-Source Has Closed the Gap on DocVQA - But Not Everywhere

The most significant shift since 2024 is that Qwen2.5-VL 72B and InternVL3 76B have effectively closed the gap with hosted frontier APIs on DocVQA. Qwen2.5-VL 72B scores 96.1% and InternVL3 scores 95.4% - higher than GPT-4.1 Vision (93.4%) and Gemini 2.5 Pro (93.8%) on this benchmark. For organizations running document processing pipelines at scale with sensitive data, self-hosting Qwen2.5-VL is now a genuinely competitive alternative to sending every document to a third-party API.

The open-source lead on OCRBench (877 for Qwen2.5-VL, 855 for InternVL3) is similar. These are not close calls.

Where the gap remains is on InfographicVQA - Gemini 2.5 Pro leads at 87.7% against Qwen2.5-VL's 84.5% - and on MMMU-Pro Vision, where Claude 4 Opus (77.3%) and Gemini 2.5 Pro (75.2%) both beat Qwen2.5-VL (70.2%). Complex layout reasoning and multi-discipline academic reasoning still favor frontier hosted models.

Gemini 2.5 Pro Dominates InfoVQA and ChartQA

For infographic understanding specifically, Gemini 2.5 Pro is the leader at 87.7% on InfographicVQA and 90.0% on ChartQA. Infographics combine non-linear text layout, embedded charts, icons with semantic meaning, and arrow-linked annotations - a genuinely hard multi-modal problem. Google's natively multimodal training approach shows here. If your document processing pipeline handles marketing materials, annual reports, slide decks, or any document type where visual layout carries meaning beyond the text, Gemini 2.5 Pro is the strongest hosted option. See our best AI PDF tools comparison for how these models perform in production PDF tools built on top of these APIs.

Specialized Tools: Pipeline Products vs. Models

Mistral OCR, LlamaParse, MinerU, and Docling sit in a different category from the VLMs above. They're document parsing pipelines - they take a PDF and produce structured output (Markdown, JSON, HTML) rather than answering benchmark questions about images. Because they're optimized for a specific output format, they don't submit to standard VQA benchmarks and can't be ranked in the same table.

That said, they have real advantages. Mistral OCR handles multi-column layouts and embedded tables in ways that generic VLMs sometimes mangle. LlamaParse preserves section structure and heading hierarchy, which matters enormously for downstream RAG chunking. MinerU and Docling are open-source alternatives that produce clean Markdown from PDFs. The right choice depends on whether you need structured extraction (use a pipeline tool) or visual question answering and reasoning over the document (use a VLM). Often you need both - a pipeline tool for extraction, a VLM for downstream analysis. Check our RAG benchmarks leaderboard for how extraction quality propagates into retrieval accuracy.

The Cost Calculation for High-Volume Pipelines

At high document volume, the economics shift sharply. Running 100,000 invoices per month through GPT-4.1 Vision at API pricing is expensive in a way that running the same volume through self-hosted Qwen2.5-VL 72B on owned hardware is not. If Qwen2.5-VL matches or beats the hosted API on your benchmark task (DocVQA-style extraction), the open-source route is worth the engineering investment. The 72B model requires 4x A100s or equivalent to run at practical throughput, which is a non-trivial infrastructure cost but amortizes quickly at scale.

For low-volume, high-quality-requirement use cases, the hosted frontier models are still the path of least resistance.

Qwen2.5-VL 72B scores 96.1% on DocVQA - higher than GPT-4.1 Vision (93.4%) and Gemini 2.5 Pro (93.8%). Open-source has closed the gap on standard document extraction tasks.

Document scanner processing papers on a bright desk OCR quality sets a ceiling on RAG pipeline accuracy - poor text extraction at the ingestion stage compounds through retrieval and generation. Source: pexels.com

Model Notes

Qwen2.5-VL 72B

Alibaba's Qwen2.5-VL series is the open-source benchmark leader across OCRBench (877), DocVQA (96.1%), and ChartQA (89.8%). The 72B model uses a native dynamic resolution mechanism that adjusts the number of visual tokens based on image size - critical for long-document PDFs where resolution variation is extreme. The team explicitly targeted document understanding in training, and it shows. Available on HuggingFace at Qwen/Qwen2.5-VL-72B-Instruct. Also available via Alibaba Cloud API.

InternVL3 76B

InternVL3 from OpenGVLab is the strongest alternative to Qwen2.5-VL in the open-source document AI space. Its 855 on OCRBench and 95.4% on DocVQA are both near-frontier. The training approach combines RLHF with preference optimization specifically for document and chart understanding tasks. InternVL3 also has stronger support for less common scripts and multilingual documents than Qwen2.5-VL.

Claude 4 Opus and Sonnet

Anthropic's Claude 4 Opus leads MMMU-Pro Vision in this table at 77.3%, making it the strongest choice when document analysis requires complex multi-step reasoning - interpreting a contract with conditional clauses, extracting information from a research paper that requires domain knowledge, analyzing an annual report for specific financial relationships. For simpler extraction tasks (reading form fields, extracting invoice line items), Claude 4 Sonnet is sufficient and cheaper.

DeepSeek-VL 2 (27B MoE)

DeepSeek-VL 2 is a 27B Mixture-of-Experts model that achieves 809 on OCRBench and 92.8% on DocVQA - competitive with models several times its active parameter count. Its efficiency makes it attractive for deployments where GPU memory is constrained. The MoE architecture means only ~4.5B parameters are active per token, allowing fast inference on hardware that would struggle with a dense 70B model.

Nougat

Meta's Nougat is a specialized model for converting academic PDFs to structured Markdown, trained specifically on scientific papers. It performs well on LaTeX-heavy documents with equations and numbered references. Its DocVQA score (56.2%) reflects that it's not designed for general document understanding - it's a narrow tool optimized for scientific publications. For arXiv papers, preprints, and academic PDF processing, Nougat remains a practical local option despite its age.

PaddleOCR and Tesseract

Both are included as cost/performance baselines. PaddleOCR v4 is a strong traditional OCR engine for printed text, particularly in Chinese and other CJK scripts. Tesseract 5.x is the canonical open-source OCR baseline that most comparisons use as a floor. Neither does visual reasoning - they produce text only, with no understanding of layout structure, tables, or context. OCRBench scores of 561 and 404 respectively confirm the large capability gap versus VLMs on complex document types.

Methodology

All VQA scores are sourced from official provider technical reports, model cards on HuggingFace, or peer-reviewed benchmark papers. Where multiple evaluations exist, the highest independently verified score is used. I did not run custom evaluations for this article.

OCRBench scores for traditional tools (PaddleOCR, Tesseract) are sourced from the OCRBench v2 paper, which evaluated a range of tools beyond VLMs as baselines.

Models with partial benchmark coverage are still listed if they have notable deployment significance or strong results on the benchmarks they do report. The "Not reported" cells reflect genuinely unavailable data, not poor performance.

Caveats and Edge Cases

Handwriting

Most benchmark scores reported above are for printed text. Handwriting recognition is a separate capability where performance drops significantly across all models. DocVQA includes some handwritten content but it's not the majority. If handwriting is central to your use case, look specifically for handwriting evaluation data - OCRBench v2 includes handwriting scenarios that surface this gap.

Low-Resolution and Degraded Scans

At 72 DPI (typical for faxed documents or low-quality scans), even strong models degrade noticeably. The benchmarks mostly use cleaner document images. Production pipelines that receive faxes, photos of documents, or documents scanned at low DPI should validate on their actual input distribution rather than relying on published benchmark scores.

Non-Latin Scripts

Arabic, Devanagari, Thai, and other non-Latin scripts are underrepresented in DocVQA and InfographicVQA. InternVL3 and Qwen2.5-VL have explicit multilingual training and perform better here. For non-Latin script use cases, prioritize OCRBench scores (which include multilingual scenarios) and supplement with language-specific evaluations.

Table Extraction

Tables are one of the hardest document elements to extract correctly. Cell merging, nested headers, borderless tables, and rotated tables all cause failures in models that otherwise perform well on running text. None of the standard benchmarks above specifically isolate table extraction quality. LlamaParse and Docling have dedicated table parsing logic that often outperforms generic VLMs on this specific sub-task.

Mathematical Equations

Equations in PDFs are frequently stored as images even in "digital" PDFs (not scanned). Standard VLMs read them with varying accuracy; specialized tools like Mathpix (proprietary OCR for equations) or Nougat (for LaTeX-heavy academic PDFs) perform better on this narrow task. If your pipeline processes academic papers or technical documents with dense math, benchmark specifically on equation extraction.

FAQ

Which model is best for processing invoices and receipts?

Qwen2.5-VL 72B and InternVL3 76B both clear 95% on DocVQA and can be self-hosted. For hosted API options, Gemini 2.5 Pro and GPT-4.1 Vision both clear 93%. For high-volume pipelines with sensitive financial data, the open-source models offer comparable accuracy without sending data to third-party APIs.

Is Mistral OCR better than using a VLM directly?

Mistral OCR is a document pipeline that returns structured Markdown rather than answering VQA questions. For use cases where you need clean, structured text output (feeding a RAG pipeline, extracting named fields), a pipeline tool may produce better structured output than a VLM even if the VLM scores higher on VQA benchmarks. They're solving slightly different problems. See our best AI PDF tools comparison for a deeper look at document pipeline tools.

Why isn't Docling in the ranking table?

Docling is an open-source document parsing library (IBM) that produces structured output from PDFs - it's not a VLM and doesn't take image inputs for QA. Like LlamaParse and MinerU, it doesn't have published VQA benchmark scores. It's mentioned in the text as a pipeline tool option but doesn't belong in a VLM benchmark table.

How does OCRBench differ from DocVQA?

OCRBench specifically focuses on text recognition tasks - can the model read the text in the image correctly? DocVQA adds a question-answering layer on top - not just "read the text" but "understand the document well enough to answer questions about it." OCRBench is a better proxy for raw text extraction quality. DocVQA tests the full pipeline of read-then-reason.

Will these rankings change quickly?

Yes. Document AI is one of the more actively developed capability areas right now. Qwen2.5-VL's dominance was established in late 2024 and early 2025; it could be overtaken by InternVL or a new architecture within a year. Check OCRBench v2 and DocVQA leaderboards quarterly for updates.

Sources: