Multimodal AI Leaderboard: Vision, Video, and Beyond
Rankings of the best multimodal AI models for image understanding, video analysis, and visual reasoning, covering MMMU-Pro, Video-MMMU, and more.

The era of text-only AI models is over. Every major frontier model now supports image input, and the best ones understand video, diagrams, charts, screenshots, and handwritten notes with remarkable accuracy. Multimodal capability has gone from a novelty feature to a core requirement, and the benchmarks measuring it have evolved accordingly. This leaderboard covers the state of the art in visual AI understanding.
The Benchmarks Explained
MMMU-Pro (Massive Multi-discipline Multimodal Understanding - Pro)
MMMU-Pro is the successor to the original MMMU benchmark, designed to be significantly harder and more resistant to shortcuts. It consists of 3,460 questions across 14 academic disciplines (including art, engineering, science, and medicine) that require understanding complex images such as diagrams, charts, photographs, and technical figures. Questions demand genuine visual reasoning, not just image description.
The "Pro" version introduced three key improvements over the original: augmented candidate options (10 choices instead of 4, reducing the impact of guessing), a vision-only input setting (removing textual cues that let models bypass the image), and a stricter evaluation protocol. Where the original MMMU saw top models scoring above 70%, MMMU-Pro brought those scores back down to earth.
Video-MMMU
Video-MMMU extends the multimodal evaluation paradigm to video understanding. It contains questions that require watching and comprehending video content across academic and professional domains. Models must process temporal information, understand sequences of events, and integrate visual information across multiple frames. This is substantially harder than single-image understanding because it requires tracking context over time.
Other Key Benchmarks
MathVista tests mathematical reasoning from visual inputs like graphs, geometry diagrams, and data visualizations. ChartQA evaluates the ability to extract and reason about data from charts. DocVQA measures document understanding from scanned or photographed documents.
Multimodal Rankings
| Rank | Model | Provider | MMMU-Pro | Video-MMMU | MathVista | ChartQA |
|---|---|---|---|---|---|---|
| 1 | Gemini 3 Pro | Google DeepMind | 81.0% | 87.6% | 82.3% | 89.1% |
| 2 | GPT-5.2 Pro | OpenAI | 78.5% | 82.1% | 80.8% | 87.5% |
| 3 | Claude Opus 4.6 | Anthropic | 76.8% | 78.5% | 79.2% | 86.8% |
| 4 | Gemini 2.5 Pro | Google DeepMind | 75.2% | 80.3% | 77.5% | 85.6% |
| 5 | Grok 4 Heavy | xAI | 73.6% | 74.2% | 76.1% | 83.2% |
| 6 | GPT-5.2 | OpenAI | 72.8% | 76.8% | 74.5% | 84.1% |
| 7 | Qwen 3.5 VL | Alibaba | 71.5% | 72.1% | 73.8% | 82.5% |
| 8 | Claude Opus 4.5 | Anthropic | 70.2% | 71.6% | 72.4% | 81.3% |
| 9 | DeepSeek VL 3 | DeepSeek | 68.4% | 65.8% | 70.1% | 78.2% |
| 10 | Llama 4 Maverick VL | Meta | 64.2% | 60.5% | 66.3% | 74.8% |
Analysis
Gemini 3 Pro Dominates Vision
Google DeepMind's Gemini 3 Pro is the clear leader in multimodal understanding, topping every benchmark in our table. Its 81% on MMMU-Pro represents a major leap, and its 87.6% on Video-MMMU is particularly impressive given how challenging video understanding is. This dominance reflects Google's foundational advantage: Gemini was designed as a multimodal model from the ground up, trained on images, video, and text jointly rather than having vision bolted on after the fact.
The practical implications are significant. For applications involving document processing, visual data analysis, video understanding, or any task where images are central to the workflow, Gemini 3 Pro offers a meaningful edge over competitors.
The Video Understanding Gap
Video-MMMU scores reveal an interesting stratification. The top three models (Gemini 3 Pro, GPT-5.2 Pro, and Gemini 2.5 Pro) all score above 80%, while most competitors cluster in the 60-78% range. Video understanding requires processing hundreds or thousands of frames and maintaining context across them, a capability that demands both architectural support and massive computational resources during training. Models that were primarily trained for text and single-image understanding struggle to keep up.
Open-Source Multimodal Models Lag Behind
The gap between proprietary and open-source multimodal models is larger than in text-only benchmarks. Llama 4 Maverick VL and DeepSeek VL 3, while capable, trail the best proprietary models by 13-17 percentage points on MMMU-Pro. Multimodal training requires enormous datasets of high-quality image-text pairs, and proprietary labs have significant advantages in data curation and computational resources for this purpose.
Why MMMU-Pro Replaced MMMU
The original MMMU benchmark, released in late 2023, quickly became saturated. By mid-2025, top models were scoring above 75%, and researchers discovered that some questions could be answered correctly without even looking at the image. MMMU-Pro addresses these flaws by expanding answer options from 4 to 10 (making random guessing nearly useless), removing text-based shortcuts, and including more questions that require genuine visual reasoning.
The result is a much harder benchmark. Models that scored 75% on MMMU typically score 15-20 percentage points lower on MMMU-Pro, giving the community a clearer picture of actual visual understanding capability.
Practical Applications
Multimodal capability unlocks applications that pure text models cannot touch:
- Document processing: Extracting data from invoices, receipts, contracts, and forms
- Medical imaging: Assisting with interpretation of X-rays, pathology slides, and medical charts
- Data analysis: Understanding charts, graphs, and dashboards without manual data extraction
- Accessibility: Describing images and video content for visually impaired users
- Education: Explaining diagrams, solving geometry problems, and interpreting scientific figures
- Quality assurance: Inspecting products, detecting defects, and verifying designs
For these use cases, benchmark performance translates directly into practical value, and the current generation of multimodal models has reached a level of competence that makes real-world deployment viable for many of them.