Best AI for Document Understanding - March 2026

TL;DR

Claude Opus 4.6 leads DocVQA at 96.1% and ChartQA at 93.4%, making it the strongest overall model for PDF and document analysis
Qwen2.5-VL-72B scores 96.4% on DocVQA as an open-weight alternative you can self-host, beating every proprietary model on that specific benchmark
Rankings rely on DocVQA (form/report extraction), ChartQA (visual data reasoning), and OCRBench (raw text recognition) - three benchmarks testing different document skills

The best AI model for document understanding in March 2026 depends on whether you need general-purpose PDF analysis or specialized extraction. For broad document Q&A across forms, reports, scanned pages, and charts, Claude Opus 4.6 posts the strongest combined scores: 96.1% on DocVQA and 93.4% on ChartQA. If you need an open-weight model you can deploy on your own infrastructure, Qwen2.5-VL-72B edges ahead on DocVQA alone at 96.4%.

The practical gap between the top five models has narrowed to 1-3 percentage points on most benchmarks. Choosing the right one now comes down to pricing, context window size, and whether your documents involve mostly text, tables, charts, or handwritten content.

Rankings Table

Rank	Model	Provider	DocVQA	ChartQA	Price (Input/Output)	Verdict
1	Claude Opus 4.6	Anthropic	96.1%	93.4%	$5/$25	Best combined score across document benchmarks
2	Qwen2.5-VL-72B	Alibaba	96.4%	88.3%	Open-weight	Top DocVQA score, self-hostable
3	GPT-5.4	OpenAI	95.0%	92.5%	$2.50/$15	Strong all-rounder with 1M token context
4	Qwen2.5-VL-7B	Alibaba	95.7%	87.3%	Open-weight	Remarkable accuracy for a 7B model
5	Claude 3.5 Sonnet	Anthropic	95.2%	-	$3/$15	Reliable mid-tier option
6	Gemini 3.1 Pro	Google	92.0%	90.0%	$2/$12	Best value among proprietary models
7	Qwen2.5-VL-32B	Alibaba	94.8%	-	Open-weight	Middle ground between 7B and 72B
8	Llama 4 Maverick	Meta	94.4%	-	Open-weight	Competitive open-source entry from Meta
9	DeepSeek VL2	DeepSeek	93.3%	86.0%	Open-weight	MoE efficiency with only 4.5B active params
10	Nova Pro	Amazon	93.5%	-	~$0.80/$3.20	Budget-friendly AWS-native option

Detailed Analysis

Claude Opus 4.6 - The Complete Package

Opus 4.6 doesn't win DocVQA outright (Qwen2.5-VL-72B technically edges it by 0.3 points), but it posts the highest combined performance across document benchmarks when you factor in ChartQA at 93.4%. That ChartQA lead matters because real-world document workflows rarely involve pure text extraction. Financial reports mix prose with bar charts. Research papers embed tables inside figures. Opus 4.6 handles all of these in a single pass.

Anthropic's PDF processing pipeline converts each page into an image and pairs it with extracted text, giving the model both visual layout cues and raw content to work with. Each page consumes roughly 1,500 to 3,000 tokens depending on density. With a 1M token context window, that means you can feed it around 300 to 600 pages in a single request.

The tradeoff is cost. At $5/$25 per million tokens, Opus 4.6 is the most expensive model on this list. For teams processing thousands of documents daily, the bill adds up. Consider Claude Sonnet 4.6 for high-volume extraction where near-flagship accuracy at 40% lower cost makes better economic sense.

Document analysis workflow showing data extraction from structured reports Structured document analysis involves extracting data from forms, invoices, and reports - tasks where AI models now routinely exceed 93% accuracy. Source: pexels.com

GPT-5.4 - The Developer-Friendly Choice

OpenAI's GPT-5.4 scores 95.0% on DocVQA and 92.5% on ChartQA, placing it in a tight cluster with the Anthropic and Alibaba models. What sets it apart is the developer tooling around it. OpenAI's cookbook includes a dedicated guide on document and multimodal understanding that walks through four key configuration levers: image detail settings, verbosity controls, reasoning effort, and Code Interpreter for multi-pass inspection.

GPT-5.4 can natively ingest PDFs with integrated OCR for scanned pages, a meaningful improvement over earlier GPT-4 models that required users to convert pages to images first. Its 1.05M token context window is competitive with Opus 4.6. At $2.50/$15, it costs half as much on input tokens.

For developers building document processing pipelines, GPT-5.4's detail="original" mode is worth noting. It preserves full-resolution page images for handwritten content, small labels, and dense tables where downscaling would lose critical information.

Qwen2.5-VL-72B - Open-Weight Leader

The Qwen2.5-VL series from Alibaba's Qwen team controls the open-weight document understanding space. The 72B-parameter flagship posts 96.4% on DocVQA, the highest score on that benchmark from any model we track. The 7B variant hits 95.7%, which is remarkable for a model small enough to run on a single consumer GPU with quantization.

For teams that need to self-host due to data privacy requirements, regulatory constraints, or sheer volume economics, the Qwen2.5-VL family is the clear first choice. Running the 7B model on your own hardware costs approximately $0.09 per 1,000 pages with a consumer GPU, a fraction of any API-based alternative.

The weakness is ChartQA. At 87.3% for the 7B model and 88.3% for the 72B, the Qwen series trails proprietary models by 4-6 points on visual data reasoning. If your documents are chart-heavy, Claude Opus 4.6 or GPT-5.4 will handle them more reliably.

Gemini 3.1 Pro - Best Value Proprietary

Gemini 3.1 Pro sits at 92.0% on DocVQA and 90.0% on ChartQA. Those numbers trail the leaders by 3-4 points, but at $2/$12 per million tokens, it delivers roughly 96% of GPT-5.4's document accuracy at 80% of the cost. Its 1M token context window can process entire document collections in a single request.

Where Gemini 3.1 Pro especially shines is multimodal reasoning across mixed-format inputs. Google's model handles text, images, video, and audio within the same context, making it effective for workflows that combine document analysis with other media types. For enterprise teams already on Google Cloud, the Vertex AI integration reduces friction.

DeepSeek VL2 - Efficiency Through MoE

DeepSeek VL2 takes a different approach with its Mixture-of-Experts architecture. Only 4.5 billion parameters activate per forward pass despite the model's total size, which keeps inference costs low while maintaining 93.3% on DocVQA and 86.0% on ChartQA.

The model excels at OCR-heavy tasks: text recognition, form field extraction, and structured data parsing from invoices and receipts. The DeepSeek-VL2-Tiny variant (1.0B active parameters) reaches 88.9% on DocVQA and 81.0% on ChartQA, making it viable for edge deployment on resource-constrained hardware.

Charts and data visualizations being analyzed by AI models Chart understanding remains one of the harder document analysis tasks, requiring models to interpret visual data representations and perform calculations from graphs. Source: pexels.com

Specialized OCR: Mistral OCR 3

General-purpose multimodal models handle most document tasks well, but dedicated OCR systems still lead on specific extraction challenges. Mistral OCR 3, released in December 2025, claims a 74% win rate over its predecessor across forms, scanned documents, complex tables, and handwriting.

The numbers on table extraction are striking: 96.6% accuracy versus AWS Textract's 84.8% on the same structured table dataset. For handwriting recognition, Mistral OCR 3 hits 88.9% compared to Azure Document Intelligence's 78.2%. At $2 per 1,000 pages, it sits between the open-weight models (effectively free on your hardware) and the general-purpose APIs.

If your pipeline is primarily OCR and structured data extraction rather than document Q&A, Mistral OCR 3 deserves a spot in your evaluation. It outputs HTML table tags that preserve merged cells, headers, and column hierarchies, which downstream systems can parse more reliably than free-text extraction.

Methodology

Rankings in this table rely on three primary benchmarks:

DocVQA (Document Visual Question Answering) tests models on 50,000 questions across 12,000+ document images sourced from the UCSF Industry Documents Library. Documents include letters, memos, reports, and forms with printed, typewritten, and handwritten content. The evaluation metric is Average Normalized Levenshtein Similarity (ANLS). Human performance sits at 94.36%, a threshold that several models now exceed.

ChartQA measures a model's ability to extract data from and reason about charts and graphs. This requires interpreting bar charts, line graphs, pie charts, and scatter plots, then performing calculations or comparisons based on visual information. It's a harder test than pure text extraction because spatial reasoning and numerical inference both factor in.

OCRBench evaluates raw text recognition accuracy across document types including printed text, handwriting, and scene text. While less complete than DocVQA, it provides a useful baseline for pure extraction quality.

A model can lead DocVQA while trailing on ChartQA because these benchmarks test overlapping but distinct skills. DocVQA rewards text extraction and simple reasoning. ChartQA rewards visual data interpretation. Neither perfectly predicts real-world performance on your specific document types, which is why we recommend benchmarking against your own data before committing.

Historical Progression

March 2025 - GPT-4o held the DocVQA lead at roughly 92.8%, with Claude 3.5 Sonnet close behind.
July 2025 - Qwen2-VL-72B pushed DocVQA past 94%, establishing open-weight models as serious competitors in document understanding.
October 2025 - GPT-5.0 introduced native PDF ingestion with integrated OCR, removing the image conversion step that slowed earlier workflows.
December 2025 - Mistral OCR 3 launched with 96.6% table extraction accuracy, challenging the assumption that general-purpose models are always better than specialists.
February 2026 - Claude Opus 4.6 and Qwen2.5-VL-72B both crossed 96% on DocVQA, surpassing human-level performance on the benchmark.

The rate of improvement on DocVQA has slowed. Going from 88% to 93% took about eight months. Going from 93% to 96% took another eight months but yielded diminishing real-world impact. At this point, the benchmark is approaching saturation, and practical differences between top models come down to handling edge cases like low-quality scans, unusual layouts, and multilingual documents.

FAQ

What's the best model for analyzing scanned PDFs?

Claude Opus 4.6 handles scanned pages well at 96.1% DocVQA. For dedicated OCR, Mistral OCR 3 leads on handwriting (88.9%) and tables (96.6%) at lower cost.

Is open-source competitive for document understanding?

Yes. Qwen2.5-VL-72B scores 96.4% on DocVQA, beating every proprietary model on that benchmark. The 7B variant hits 95.7% and runs on consumer hardware.

How often do document understanding rankings change?

Major shifts happen 2-3 times per year with new model families. Incremental updates every 6-8 weeks. DocVQA scores above 95% have been stable since late 2025.

What's the cheapest option that still works well?

Qwen2.5-VL-7B is free and open-weight at 95.7% DocVQA. Among APIs, Amazon Nova Pro at roughly $0.80/$3.20 per million tokens offers 93.5%.

Can AI handle multi-page document analysis?

Yes. GPT-5.4 and Claude Opus 4.6 both support 1M+ token contexts, enough for 300-600 PDF pages. Gemini 3.1 Pro also supports 1M tokens with native PDF input.

Should I use a general model or dedicated OCR?

General models (Opus 4.6, GPT-5.4) are better for Q&A and reasoning over documents. Dedicated OCR (Mistral OCR 3) wins on structured extraction like tables and forms.

Sources: