Leaderboards

Reasoning Benchmarks: GPQA, AIME, and Humanity's Last Exam

Rankings of AI models on the hardest reasoning benchmarks available: GPQA Diamond, AIME competition math, and the notoriously difficult Humanity's Last Exam.

Reasoning Benchmarks: GPQA, AIME, and Humanity's Last Exam

Reasoning is the capability that separates a sophisticated autocomplete engine from a genuinely intelligent system. The hardest reasoning benchmarks are designed to resist memorization and pattern matching, forcing models to actually think through novel problems. This leaderboard covers three of the most challenging: GPQA Diamond (expert-level science), AIME (competition mathematics), and Humanity's Last Exam (the hardest AI benchmark ever created).

The Benchmarks Explained

GPQA Diamond

GPQA (Graduate-Level Google-Proof Questions Answering) Diamond is a set of 198 extremely difficult multiple-choice questions written by domain experts in physics, chemistry, and biology. The "Google-Proof" in the name is literal: even expert humans with PhDs in the relevant field and unrestricted internet access only achieve about 81% accuracy. Non-expert humans with internet access score around 34%, barely above random chance on four-option questions.

These are not trivia questions. Each one requires multi-step expert reasoning. A chemistry question might require applying thermodynamic principles to a novel molecular system. A physics question might demand deriving a result from first principles under unusual boundary conditions.

AIME (American Invitational Mathematics Examination)

AIME is a prestigious math competition for high school students who score in the top 5% on the AMC 12. Problems require creative mathematical reasoning and produce integer answers from 0 to 999. Unlike multiple-choice tests, there is no opportunity to guess. Each problem typically requires 20-40 minutes for talented human mathematicians.

AIME tests exactly the kind of multi-step mathematical reasoning that has historically been hardest for AI: setting up equations from word problems, finding clever substitutions, applying theorems in non-obvious ways, and chaining together multiple reasoning steps without error.

Humanity's Last Exam (HLE)

HLE was designed with one goal: create a benchmark that no AI system could score well on. It contains 3,000 questions contributed by over 1,000 subject-matter experts from 500+ institutions, spanning everything from advanced mathematics to obscure humanities to cutting-edge scientific research. Questions are intentionally designed to require deep expertise and novel reasoning that cannot be retrieved from training data.

When HLE launched, the best AI models scored in the single digits. Any progress on this benchmark represents genuine advancement in reasoning capability.

Reasoning Rankings

RankModelProviderGPQA DiamondAIME 2025HLE (Overall)HLE (Exact Match)
1GPT-5.2 ProOpenAI93.2%100%26.3%18.2%
2Grok 4 HeavyxAI88.9%96.7%50.0%10.0%
3Claude Opus 4.6Anthropic89.0%93.3%32.1%20.5%
4GPT-5.2OpenAI88.0%86.7%21.5%14.8%
5Gemini 3 ProGoogle DeepMind87.5%90.0%24.8%15.1%
6Claude Opus 4.5 ReasoningAnthropic86.5%86.7%19.2%12.4%
7DeepSeek V3.2-SpecialeDeepSeek85.3%96.7%18.5%11.8%
8Qwen 3.5Alibaba82.1%83.3%14.2%9.1%
9Grok 4xAI84.2%80.0%16.8%10.5%
10Llama 4 MaverickMeta78.5%66.7%8.3%5.2%

Analysis

GPT-5.2 Pro: The Reasoning King

GPT-5.2 Pro's 93.2% on GPQA Diamond is a remarkable achievement. This means the model now outperforms human PhD experts who have access to the internet. Its perfect score on AIME 2025 (solving all 30 problems correctly across both AIME I and II) puts it at a level that would qualify for the International Mathematical Olympiad team in many countries. OpenAI's investment in chain-of-thought reasoning and extended computation at inference time has clearly paid dividends.

Grok 4 Heavy: The HLE Surprise

The most striking number on this table is Grok 4 Heavy's 50% overall score on Humanity's Last Exam. When this benchmark was released, experts predicted it would take years before any model reached 50%. xAI's approach, which involves extensive self-play and reasoning chain verification, appears to have made a breakthrough on the kinds of cross-domain expert questions that HLE emphasizes. However, its exact match rate of 10% on open-ended questions suggests that much of this performance comes from the multiple-choice subset.

Claude Opus 4.6: Balanced and Strong

Claude Opus 4.6 posts the highest HLE exact match score at 20.5%, meaning it provides the most accurate precise answers on open-ended questions that have no multiple-choice options to select from. This is arguably the purest measure of reasoning, since there are no answer options to eliminate. Its balanced performance across all three benchmarks reflects broad reasoning capability rather than specialization on any single test type.

The HLE Gap Is Still Enormous

Even the best models solve only about a quarter to half of Humanity's Last Exam. This benchmark remains the clearest reminder that current AI systems, despite their impressive capabilities, still have significant limitations in deep reasoning. The gap between GPQA Diamond scores (where top models exceed 88%) and HLE scores (where they struggle past 50%) shows how much harder truly novel, cross-domain expert reasoning is compared to within-domain expert reasoning.

Why Reasoning Benchmarks Matter

Reasoning ability underpins virtually every high-value application of AI. A model that reasons well can serve as a research assistant, a tutor, a strategic advisor, or an autonomous agent. The progression from models that could barely solve algebra to models that ace competition mathematics in under two years represents one of the fastest capability gains in the history of technology.

For practitioners, these benchmarks help identify which models to trust for high-stakes analytical tasks. If your application requires reliable scientific reasoning, you want a model that scores well on GPQA Diamond. If it requires creative mathematical problem-solving, AIME scores are your guide. And if you need a model that can handle genuinely novel challenges outside its training distribution, HLE performance is the most informative signal available.

About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.