MMLU-Pro Leaderboard: Graduate-Level Knowledge Rankings
Complete MMLU-Pro benchmark rankings measuring graduate-level knowledge across 14 subjects with 12,000 questions and 10 answer options per question.

MMLU-Pro is the definitive benchmark for measuring how well an AI model understands and reasons about expert-level knowledge across a wide range of academic disciplines. It is the successor to the original MMLU (Massive Multitask Language Understanding), which became the most widely cited AI benchmark of the GPT-4 era but eventually grew too easy to meaningfully differentiate top models. MMLU-Pro fixes those problems and remains one of the most informative single benchmarks available.
What Makes MMLU-Pro Different from MMLU
The original MMLU, released in 2020, consisted of 14,042 multiple-choice questions across 57 subjects with 4 answer options each. It was groundbreaking at the time, but by early 2025, frontier models were scoring above 90%, and the benchmark was losing its ability to separate the best from the very best.
MMLU-Pro addresses these limitations in three important ways:
More answer options. Each question has 10 answer choices instead of 4. This alone reduces the impact of random guessing from 25% to 10% and makes elimination strategies much less effective. Models need genuine understanding, not just the ability to rule out obviously wrong answers.
Harder questions. The 12,000+ questions in MMLU-Pro are filtered to be more challenging, requiring multi-step reasoning rather than simple recall. Many questions combine concepts from different parts of a subject, requiring integration of knowledge rather than isolated fact retrieval.
14 focused subjects. Rather than spreading thin across 57 subjects, MMLU-Pro concentrates on 14 disciplines where question quality can be more carefully controlled: biology, business, chemistry, computer science, economics, engineering, health, history, law, math, philosophy, physics, psychology, and other.
MMLU-Pro Rankings
| Rank | Model | Provider | MMLU-Pro (Overall) | Best Subject | Worst Subject |
|---|---|---|---|---|---|
| 1 | Gemini 3 Pro | Google DeepMind | 89.8% | Physics (93.1%) | Law (84.2%) |
| 2 | Claude Opus 4.5 Reasoning | Anthropic | 89.5% | Math (94.2%) | History (83.8%) |
| 3 | GPT-5.2 Pro | OpenAI | 88.7% | Chemistry (92.5%) | Philosophy (83.5%) |
| 4 | Claude Opus 4.6 | Anthropic | 88.2% | CS (93.8%) | Economics (82.9%) |
| 5 | Gemini 2.5 Pro | Google DeepMind | 87.5% | Physics (91.8%) | Law (82.1%) |
| 6 | GPT-5.2 | OpenAI | 86.3% | Math (90.5%) | History (81.3%) |
| 7 | Grok 4 Heavy | xAI | 86.4% | Physics (91.2%) | Law (80.5%) |
| 8 | DeepSeek V3.2-Speciale | DeepSeek | 85.9% | Math (92.8%) | Philosophy (79.2%) |
| 9 | Qwen 3.5 | Alibaba | 84.6% | Engineering (89.5%) | Philosophy (78.1%) |
| 10 | Claude Opus 4.5 | Anthropic | 87.1% | CS (92.1%) | History (82.5%) |
Subject Breakdown: Where Models Excel and Struggle
The subject-level data reveals fascinating patterns about how different models encode and reason about knowledge.
STEM Subjects
Math, physics, chemistry, and computer science are where the highest scores appear. This makes sense because these subjects have well-defined answers and reasoning chains that align well with how language models process information. The top models now score above 90% in their best STEM subjects, approaching the performance of subject-matter experts.
| Model | Math | Physics | Chemistry | CS | Engineering |
|---|---|---|---|---|---|
| Gemini 3 Pro | 92.3% | 93.1% | 91.5% | 91.2% | 90.8% |
| Claude Opus 4.5 Reasoning | 94.2% | 91.8% | 90.1% | 90.5% | 89.2% |
| GPT-5.2 Pro | 91.8% | 91.5% | 92.5% | 89.8% | 89.5% |
| DeepSeek V3.2-Speciale | 92.8% | 88.5% | 87.2% | 88.1% | 86.3% |
Humanities and Social Sciences
History, philosophy, law, and economics consistently produce the lowest scores across all models. These subjects often require nuanced interpretation, understanding of context and precedent, and reasoning about ambiguous or contested concepts. The gap between STEM and humanities performance is typically 8-12 percentage points for top models.
This gap matters because it reveals a systematic limitation: current language models are better at formal, well-defined reasoning than at the kind of interpretive, context-dependent thinking that humanities demand.
Why MMLU-Pro Matters
MMLU-Pro serves as the closest thing to a comprehensive IQ test for AI models. While no single benchmark can capture everything, MMLU-Pro's breadth across 14 subjects and depth in terms of question difficulty make it uniquely informative.
For researchers, MMLU-Pro provides a standardized way to measure knowledge and reasoning improvements across model generations. The jump from GPT-4's roughly 72% to GPT-5.2 Pro's 88.7% on the same benchmark provides a concrete measure of progress.
For practitioners, MMLU-Pro scores correlate well with real-world performance on knowledge-intensive tasks. If you need a model to assist with legal research, medical diagnosis support, financial analysis, or scientific literature review, MMLU-Pro gives you a reasonable first approximation of which model will perform best.
For the broader AI community, MMLU-Pro serves as a calibration point. When someone says a model has "graduate-level knowledge," MMLU-Pro provides the evidence. The benchmark's questions genuinely require the kind of reasoning expected of a graduate student, with multi-step problems that demand integrating concepts from across a discipline.
The Saturation Question
Are we approaching saturation on MMLU-Pro the way we did on the original MMLU? With the top model at 89.8%, there is still meaningful headroom. The 10 answer options and multi-step reasoning requirements make it substantially harder to reach the ceiling. That said, if progress continues at the current pace, we may see 95%+ scores by late 2026, at which point the community will need yet another successor benchmark.
The pattern is clear: as AI models improve, benchmarks must evolve to keep pace. MMLU gave way to MMLU-Pro, and eventually MMLU-Pro will give way to something harder still. For now, it remains one of the most useful single numbers for comparing language model capability.