Vision-Language Benchmarks: Image Reasoning Ranked

Rankings of AI models on the key visual reasoning benchmarks - MMMU, MathVista, ChartQA, DocVQA, OCRBench, AI2D, CharXiv, and more - focused on image and document understanding.

Vision-Language Benchmarks: Image Reasoning Ranked

If you've read our multimodal benchmarks leaderboard, you've already seen the summary view - top models, aggregate rankings, how MMMU-Pro and Video-MMMU stack up. This leaderboard covers narrower but harder ground: image reasoning and document understanding specifically, where the tasks are interpreting charts from scientific papers, reading scanned invoices, solving math problems from geometry diagrams, and recognizing structure in complex visual layouts. Audio, video, and general multimodal capability sit elsewhere. The question here is whether a model can actually think about what it sees in a still image.

The benchmarks in this space have grown substantially more demanding since 2023. The original MMMU was largely saturated by mid-2025. ChartQA and DocVQA, while still useful, are now cleared by most frontier models above 90%. The interesting signal in 2026 comes from MMMU-Pro (harder exam questions, 10-option answers), CharXiv-R (realistic scientific charts from arXiv papers), and OCRBench v2 (bilingual text recognition with 31 task scenarios). These are the benchmarks where gaps between models still tell you something.

TL;DR

  • Claude Opus 4.7 leads CharXiv-R at 91.0% - the strongest chart reasoning result from any currently available model
  • Gemini 3.1 Pro tops MMMU-Pro at 82%, with strong document and chart scores across the board
  • Open-source picks: Qwen3-VL and InternVL3 both clear 95% on DocVQA and lead the community on RealWorldQA and BLINK

The Benchmarks Explained

MMMU and MMMU-Pro

MMMU (Massive Multi-discipline Multimodal Understanding) contains 11,500 college-level problems across 30 subjects, each requiring understanding of images such as figures, charts, diagrams, and photographs. The original version is no longer useful for frontier model differentiation - top models score above 80% and gaps have compressed to noise.

MMMU-Pro matters more now. It expanded the answer option count from 4 to 10 (reducing random guessing value), removed questions answerable without reading the image, and added harder multi-step problems. Where top models scored 75% on the original, they score 65-82% on MMMU-Pro. The floor rose but the ceiling spread back out, which is exactly what you want from an evaluation benchmark.

MathVista

MathVista tests mathematical reasoning in visual contexts: bar charts with partial data, geometry diagrams with unlabeled angles, function plots that require inference, statistical tables where you need to calculate something that isn't written. It combines 28 existing visual math datasets with three new ones. The test isn't just vision - it requires the model to correctly read the visual element and then perform non-trivial math.

ChartQA and CharXiv

ChartQA asks questions about charts drawn in fairly clean presentation styles. It's scored at 90%+ by most frontier models today, which means it's best used to identify weak performers rather than rank the top tier.

CharXiv is meaningfully harder. The charts come from actual arXiv papers - the kind with dual y-axes, inset subplots, color-coded multi-series data, unconventional labeling, and statistical notation. CharXiv-R specifically assesses reasoning questions that require synthesizing information across the full chart, not just reading off a number. Human performance sits at 80.5%. Top models reached that range only in late 2025.

DocVQA

DocVQA uses scanned or photographed documents - forms, invoices, contracts, research papers with tables - and asks free-form questions about their content. The challenge is handling poor image quality, varied fonts, complex layouts, and text that wraps around form fields. It tests a capability with direct enterprise value: automated document processing.

OCRBench v2

OCRBench v2 is a sizable upgrade over the original, expanding to 10,000 human-verified question-answer pairs across 31 scenario types including handwritten text, mathematical notation, multilingual content, and degraded documents. English OCR leader as of Q1 2026 is Seed1.6-vision at 62.2%, which tells you this is still a hard benchmark with real headroom.

AI2D

AI2D tests scientific diagram comprehension - the kind of figures found in textbooks and biology papers showing cell structures, geological processes, and physical systems. Most frontier models now clear 89-94%, placing this benchmark in the same category as ChartQA for differentiation purposes. It's useful for identifying capable models at lower capability tiers.

BLINK reformats 14 classic computer vision tasks - relative depth, visual correspondence, spatial reasoning, multi-view geometry - into 3,807 multiple-choice questions. These are perceptual tasks that humans solve almost instantly but models struggle with. The benchmark's name is literal: the tasks should take humans a blink. When the original BLINK launched, GPT-4V scored 51% while humans averaged 95%. The gap has since closed substantially, but BLINK remains a good test of genuine perceptual grounding rather than pattern matching from training data.

RealWorldQA

Released by xAI with Grok-1.5 Vision, RealWorldQA uses over 700 images taken from vehicles and real-world outdoor scenarios, each paired with a spatial reasoning question. It tests the kind of situational awareness relevant to robotics and autonomous systems, not academic benchmarks. The question set is small enough that statistical noise is a concern, but it adds a practical dimension that most other benchmarks lack.


Rankings Table

Scores reflect the best publicly reported results as of April 2026. Where a model has multiple evaluation configurations (e.g., with vs. without extended thinking), the higher result is listed. Scores marked with a dash were not publicly reported at time of writing.

RankModelProviderMMMU-ProMathVistaCharXiv-RDocVQAChartQA
1Gemini 3.1 ProGoogle DeepMind82%75%-92%90%
2GPT-5.4OpenAI81%78.4%-95%92.5%
3Claude Opus 4.7Anthropic--91.0%--
4Gemini 3 ProGoogle DeepMind81%82.3%81.4%89%89.1%
5Claude Opus 4.6Anthropic77.3%-77.4%--
6GPT-5.2OpenAI79.5%-82.1%--
7Qwen3.6 PlusAlibaba78.8%-81.5%--
8Qwen3-VL (72B)Alibaba69.3%85.8%-96.5%-
9InternVL3-78BOpenGVLab-79.0%-95.4%89.7%
10Llama 4 MaverickMeta-73.7%-94.4%90.0%

A second table covering the benchmarks designed to test open-source models specifically:

RankModelProviderBLINKRealWorldQAAI2DOCRBench
1Qwen3 VL 235BAlibaba70.7%85.1%89.7%-
2Qwen3.6 PlusAlibaba-85.4%94.4%-
3InternVL3-78BOpenGVLab-78.0%89.7%906
4Qwen3 VL 32BAlibaba67.3%79.0%89.5%-
5Claude 3.5 SonnetAnthropic--94.7%-
6Llama 4 MaverickMeta----
7Phi-4 MultimodalMicrosoft61.3%---

Close-up of a business graph on paper showing data analysis Chart understanding is one of the harder visual reasoning tasks - models must identify axes, read legends, and reason about relationships that aren't explicitly stated. Source: pexels.com


Key Takeaways

Gemini 3.1 Pro Leads on MMMU-Pro, GPT-5.4 Leads on Documents

Gemini 3.1 Pro scores 82% on MMMU-Pro, the highest confirmed score on that benchmark from any currently available model. Its 92% on DocVQA and 90% on ChartQA are competitive but not dominant - GPT-5.4 scores 95% and 92.5% respectively on those two. The split makes sense architecturally. Google trained Gemini as a natively multimodal model from the beginning, which advantages it on broad academic visual reasoning. GPT-5.4 applies a stronger language model to extracted document representations, which advantages it on structured text extraction from forms and invoices.

For practitioners: Gemini 3.1 Pro is the better pick when the visual content is complex or diagram-heavy. GPT-5.4 is the better pick for production document processing pipelines where DocVQA-style extraction is the core task. You can read our Gemini 3.1 Pro review and GPT-5.4 review for hands-on testing beyond benchmark scores.

Claude Opus 4.7 Wins Scientific Charts

Claude Opus 4.7 holds the top CharXiv-R score at 91.0%, 2.6 points above Muse Spark and well ahead of other frontier models. This result is directly tied to the resolution upgrade introduced with Opus 4.7 - maximum image input increased from 1,568px (1.15MP) to 2,576px (3.75MP), roughly 3.3x more pixels. Scientific charts from arXiv papers often pack dense information into small visual areas, and the resolution upgrade makes a concrete difference on exactly those cases.

The result also stands apart because CharXiv-R is scored independently from provider benchmarking pipelines. It uses charts from real papers rather than constructed datasets, making contamination less likely. For research and data analysis workflows where scientific figures are central, Opus 4.7's lead on this benchmark translates to practical value.

Claude Opus 4.7's 91.0% on CharXiv-R is the highest chart reasoning score from any generally available model - and it's directly attributable to a 3.3x resolution increase.

Qwen Dominates Open-Source Visual Reasoning

Alibaba's Qwen VL family has established clear leadership among open-source vision-language models. Qwen3 VL 235B tops the BLINK leaderboard at 70.7% and holds the best RealWorldQA scores. Qwen3-VL's larger model (72B) scores 85.8% on MathVista and 96.5% on DocVQA - the latter beating every closed-source frontier model except GPT-5.4. Qwen3.6 Plus leads AI2D at 94.4%.

This is a reversal from where things stood 18 months ago, when open-source multimodal models lagged frontier proprietary models by 15-20 points on most benchmarks. The gap on DocVQA is now essentially closed for Qwen3-VL and InternVL3. On MMMU-Pro, a meaningful gap remains - Qwen3-VL scores 69.3% against the 81-82% from Gemini 3.1 Pro and GPT-5.4. Broad multi-discipline academic reasoning at the graduate level still favors proprietary models. Specialized document and chart tasks don't.

See our Qwen 3 review for context on how the broader Qwen model family performs outside visual benchmarks.

BLINK remains the most humbling benchmark for visual reasoning. Qwen3 VL 235B leads at 70.7%, while a human would score around 95% on the same set of tasks. The gap isn't about knowledge or reasoning sophistication - it's about perceptual grounding. Tasks that require comparing relative depths in a photograph, tracing visual correspondence between two views of an object, or identifying which 3D object a 2D shadow belongs to still trip up current models at a rate that should give pause to anyone building systems where physical-world perception is safety-critical.

Worth noting: a 2026 paper showed BLINK scores are sensitive to design choices like visual marker size and style, which can reorder model rankings. The absolute scores should be read with some skepticism.

MathVista: Qwen-VL Beats the Frontier

The MathVista result is striking enough to call out directly. Qwen3-VL 72B scores 85.8% on MathVista, higher than any frontier proprietary model in this comparison. Gemini 3 Pro (82.3%) and GPT-5.4 (78.4%) trail it. Mathematical visual reasoning - reading geometry figures, interpreting function graphs, extracting data from statistical tables - is a domain where the Qwen team has invested heavily. InternVL3-78B (79.0%) also beats GPT-5.4 on this specific benchmark.

For applications involving quantitative analysis from images - financial charts, scientific figures, engineering diagrams - the assumption that proprietary frontier models are always the best choice doesn't hold. The open-source alternatives are worth assessing on your actual task distribution.

Close-up of smartphone camera lens showing precision optics Increasing camera input resolution was a key architectural decision for Opus 4.7 - from 1.15MP to 3.75MP - directly improving performance on dense visual content. Source: pexels.com


Practical Guidance

Best for Enterprise Document Processing

GPT-5.4 is the clear choice if your pipeline mostly processes invoices, contracts, forms, and receipts. Its 95% on DocVQA and 92.5% on ChartQA combined with strong language model capability for following complex extraction instructions makes it the most reliable option for production document workflows. The pricing isn't cheap, but document processing use cases usually justify it on value recovered per document.

Qwen3-VL 72B is the open-source alternative worth evaluating seriously. At 96.5% on DocVQA it matches or beats every proprietary model, and it can be self-hosted on infrastructure you control - which matters for document workflows involving sensitive data.

Best for Scientific Research Assistance

Claude Opus 4.7 takes the top spot for scientific figure analysis. If your application needs to extract insights from arXiv-style charts, interpret experimental results, or process multi-panel research figures, the CharXiv-R result at 91.0% is the most relevant signal available. The extended resolution also helps with poster presentations and conference slides where content density is high.

Best for Mathematical and Technical Reasoning from Images

Qwen3-VL 72B scores highest on MathVista (85.8%) among all models with confirmed scores. For geometry problems, quantitative chart analysis, or any task requiring mathematical inference from a visual input, this is the model to test first. InternVL3-78B (79.0%) is a strong second and has a more mature deployment ecosystem.

Budget Option

Llama 4 Maverick from Meta scores 73.7% on MathVista, 94.4% on DocVQA, and 90% on ChartQA - truly competitive with models from a year ago that commanded much higher pricing. It's open-weight and deployable via standard inference infrastructure. Our Llama 4 Maverick review covers performance in more depth.


Methodology Caveats

Two issues specific to visual reasoning benchmarks are worth acknowledging. First, several of these benchmarks - especially DocVQA and ChartQA - are now in a range where small differences in evaluation configuration (prompt format, few-shot examples, extraction parsing) can swing results by 1-3 points. Numbers from different evaluators shouldn't be compared directly. The scores in this table are sourced from provider-reported results where available and independent evaluators where not, which introduces some inconsistency.

Second, MMMU contamination is a genuine concern. The original MMMU questions are widely available online, and models trained after 2024 on broad web datasets will have seen some fraction of them. MMMU-Pro and CharXiv use harder-to-replicate content, but no benchmark is immune. This is why the field is developing new evaluations like OCRBench v2, which used human-verified pairs not published before the benchmark's release.

For more context on how to read AI benchmark data critically, see our guide to understanding AI benchmarks.


FAQ

Which model is best for reading scanned documents?

GPT-5.4 leads DocVQA at 95%, making it the top choice for scanned document extraction. Qwen3-VL 72B scores 96.5% and is the best open-source alternative for sensitive or self-hosted workflows.

Is MMMU still useful for comparing models in 2026?

MMMU-Pro is the more useful variant. The original MMMU is too saturated above 80% to differentiate frontier models. MMMU-Pro, with 10-option questions and harder visual reasoning requirements, still provides meaningful signal.

Why does Qwen-VL beat frontier models on some benchmarks?

Qwen3-VL has been optimized specifically for document and mathematical visual tasks with targeted training data. On MathVista (85.8%) and DocVQA (96.5%) it outperforms all closed-source models. On broader academic multi-discipline benchmarks like MMMU-Pro, proprietary frontier models maintain a lead.

How is CharXiv different from ChartQA?

ChartQA uses presentation-style charts and is now cleared at 90%+ by most models. CharXiv uses charts from actual arXiv scientific papers, which are significantly denser and less standardized. CharXiv-R specifically tests multi-step reasoning across full charts rather than single-fact lookup questions.

BLINK tests core visual perception - depth estimation, visual correspondence, multi-view reasoning, and similar tasks humans solve instantly. Models still score far below human performance (70% vs. 95%), making BLINK one of the few benchmarks where frontier models clearly fail at something humans find trivial.

How often should I re-check these rankings?

Visual reasoning is one of the faster-moving benchmark categories right now. Claude Opus 4.7's resolution upgrade and Qwen3-VL's document improvements both shipped in early 2026. Check this leaderboard quarterly - major capability jumps are still happening at multi-month intervals.


Sources:

✓ Last verified April 17, 2026

Vision-Language Benchmarks: Image Reasoning Ranked
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.