Best AI Models for Math Reasoning - March 2026
GPT-5.2 and Claude Opus 4.6 both score 100% on AIME 2025, while Gemini 3.1 Pro leads GPQA Diamond at 94.3% for PhD-level scientific reasoning.

TL;DR
- GPT-5.2 and Claude Opus 4.6 both hit 100% on AIME 2025, saturating the benchmark completely
- Gemini 3.1 Pro leads GPQA Diamond at 94.3%, the best score on PhD-level science questions, at just $2/$12 per million tokens
- Open-weight models like Kimi K2.5 (96.1% AIME) and DeepSeek R1 (97.3% MATH-500) compete within striking distance of closed-source leaders
Math reasoning has become the most saturated AI benchmark category in 2026. Both GPT-5.2 and Claude Opus 4.6 reached perfect 100% scores on AIME 2025, the 30-problem competition math exam that once separated frontier models by double digits. The real differentiation now happens on harder benchmarks: Gemini 3.1 Pro leads GPQA Diamond (PhD-level science) at 94.3%, and Gemini 3 Deep Think pushes Humanity's Last Exam to 41%.
For practical math applications, the choice comes down to price. Gemini 3.1 Pro delivers 91.2% on AIME and 94.3% on GPQA Diamond at $2/$12 per million tokens, undercutting GPT-5.2's $1.75/$14 and Opus 4.6's $5/$25.
Rankings Table
| Rank | Model | Provider | AIME 2025 | GPQA Diamond | MATH-500 | Price (Input/Output) | Verdict |
|---|---|---|---|---|---|---|---|
| 1 | GPT-5.2 | OpenAI | 100% | 92.4% | 98% | $1.75/$14 | First to hit perfect AIME, strong across all math |
| 2 | Claude Opus 4.6 | Anthropic | 100% | 91.3% | 93% | $5/$25 | Matches GPT-5.2 on AIME, trails on MATH-500 |
| 3 | Gemini 3.1 Pro | 91.2% | 94.3% | - | $2/$12 | GPQA Diamond leader, best price-to-performance | |
| 4 | Kimi K2.5 | Moonshot AI | 96.1% | 87.6% | 98% | $1/$5 | Open-weight surprise, matches GPT-5.2 on MATH-500 |
| 5 | Gemini 3 Pro | 95% | 91.9% | - | $1.25/$10 | Previous gen still competitive on competition math | |
| 6 | Grok 4 | xAI | 88-95% | 88% | - | $3/$15 | Strong but scores vary by evaluation source |
| 7 | DeepSeek R1 | DeepSeek | 79.8% | - | 97.3% | $0.55/$2.19 | Best open-source MATH-500, weaker on AIME |
| 8 | Claude Opus 4.5 | Anthropic | 84.5% | 88% | - | $5/$25 | Previous gen, superseded by 4.6 |
| 9 | Qwen 3.5 | Alibaba | - | 87.6% | - | $0.50/$2 | Budget open-weight option with strong GPQA |
| 10 | GPT-5.4 | OpenAI | - | 93% | - | $2.50/$20 | Newest OpenAI model, limited math data available |
Detailed Analysis
GPT-5.2 - The First Perfect AIME Score
OpenAI's GPT-5.2 made history in December 2025 by becoming the first model to solve all 30 AIME 2025 problems correctly. That 100% score, combined with 98% on MATH-500 and 92.4% on GPQA Diamond, makes it the most consistently strong math model across benchmark types.
The "xhigh" reasoning effort setting pushes AIME performance to 99-100% but costs substantially more in compute time and tokens. For most users, the default setting delivers 95%+ accuracy on competition-level problems, which is more than sufficient for real applications.
At $1.75/$14 per million tokens, GPT-5.2 is priced lower than Claude Opus 4.6 but higher than Gemini 3.1 Pro. The premium buys you the highest combined score across AIME, MATH-500, and GPQA Diamond. No other model matches its breadth of math performance.
Claude Opus 4.6 - Matching the Top on Competition Math
Claude Opus 4.6 also reached 100% on AIME 2025 with tool access enabled, tying GPT-5.2 for the top competition math score. Its 91.3% on GPQA Diamond and 93% on the broader MATH benchmark show strong reasoning across domains.
Where Opus 4.6 separates itself is in explanatory quality. The model produces step-by-step derivations that read like textbook solutions, making it especially useful in educational settings. For pure benchmark performance, GPT-5.2 edges it on MATH-500 (98% vs 93%). For applications where understanding the reasoning chain matters as much as the answer, Opus 4.6 has an advantage.
The $5/$25 pricing makes it the most expensive option in the top five. Teams running high-volume math processing should consider Gemini 3.1 Pro or Kimi K2.5 as cost-effective alternatives.
Gemini 3.1 Pro - The GPQA Diamond Champion
Google's Gemini 3.1 Pro holds the highest GPQA Diamond score at 94.3%, a benchmark that tests PhD-level science and math questions where domain experts score around 70%. That 24-point lead over human experts is the widest gap any model has achieved on this benchmark.
On competition math, Gemini 3.1 Pro scores 91.2% on AIME 2025, behind the perfect scores from GPT-5.2 and Opus 4.6 but still strong enough to place in the top percentile of human test-takers. The Deep Think mode available on Gemini 3 Deep Think pushes Humanity's Last Exam from 37.5% to 41%, the highest score any model has reached on questions designed to resist AI solution.
At $2/$12, Gemini 3.1 Pro costs 63% less than Opus 4.6 on output tokens while delivering the top GPQA Diamond score. For teams that focus on scientific reasoning and graduate-level problem solving, it's the clear value pick.
Kimi K2.5 - Open Weights, Closed Gap
Moonshot AI's Kimi K2.5 scores 96.1% on AIME 2025 and 98% on MATH-500, matching GPT-5.2 on the latter benchmark. As an open-weight model, it can be self-hosted and fine-tuned, offering a flexibility that closed-source models can't provide.
The K2.5 Thinking mode extends reasoning chains for complex problems, using a budget of up to 96K tokens for multi-step derivations. AIME scores were averaged over 32 runs, which provides a more statistically reliable estimate than single-pass evaluations.
At $1/$5 per million tokens through the API, Kimi K2.5 represents the best value proposition in this category for teams that don't need GPQA Diamond performance. Its 87.6% on GPQA Diamond is roughly 7 points behind Gemini 3.1 Pro, a gap that matters for PhD-level science but rarely affects practical math applications.
DeepSeek R1 - The MATH-500 Specialist
DeepSeek R1 achieved 97.3% on MATH-500, putting it in the same tier as GPT-5.2 and Kimi K2.5 on this benchmark. Built using pure reinforcement learning for reasoning, R1 shows that open-source models can match proprietary ones on structured math problem-solving.
The catch is AIME 2025 performance. At 79.8%, R1 trails the leaders by 15-20 percentage points on competition math, where problems require more creative mathematical insight. The distilled 32B variant scores 72.6% on AIME, still usable but not competitive with frontier models. At $0.55/$2.19 per million tokens, it's the cheapest strong math model available.
Methodology
Three benchmarks drive this ranking:
AIME 2025 consists of 30 competition-level math problems from the American Invitational Mathematics Examination. Problems require integer answers between 000 and 999, testing algebra, geometry, number theory, and combinatorics. With multiple models now scoring 95-100%, AIME is approaching saturation as a differentiating benchmark.
GPQA Diamond contains 198 PhD-level science questions that stump non-domain experts but that domain experts answer correctly. It tests deeper scientific reasoning than AIME and remains unsaturated, with the top score at 94.3%.
MATH-500 is a 500-problem subset of the MATH dataset covering competition-level mathematics across six domains. It provides a broader evaluation than AIME but shows similar saturation at the top.
A critical caveat: AIME 2025 scores should be interpreted carefully. Some models (like GPT-5.2) report averages over many runs, while others report best-of-N results. Grok 4's score ranges from 88% to 95% depending on the evaluation methodology, which depicts how reporting differences can distort rankings.
Humanity's Last Exam, where no model passes 41%, is emerging as the next frontier for differentiating math reasoning capability. We'll add it more prominently as more models report scores.
Historical Progression
March 2025 - OpenAI's o1 led AIME at roughly 83%. GPQA Diamond top score sat around 78%. Math reasoning was a clear differentiator between model tiers.
July 2025 - Claude Opus 4.0 and Gemini 2.5 Pro pushed AIME past 90%. The race to perfect scores began.
October 2025 - Grok 4 launched claiming #1 math performance. Independent evaluations showed 88-90% on AIME, strong but not the claimed 95%.
December 2025 - GPT-5.2 reached the first verified 100% AIME score. Kimi K2.5 and Gemini 3 Pro crossed 95%.
March 2026 - AIME 2025 is effectively saturated at the top. GPQA Diamond and Humanity's Last Exam become the new frontier for measuring math reasoning.
The velocity of improvement from March 2025 (83%) to December 2025 (100%) represents a 17-point gain in nine months. That pace can't continue on a capped benchmark. Harder evaluations like Humanity's Last Exam (top score: 41%) and ARC-AGI-2 (top score: 77.1%) now carry the weight of differentiation.
FAQ
What's the cheapest model that's still good at math?
DeepSeek R1 at $0.55/$2.19 per million tokens scores 97.3% on MATH-500. For competition math, Kimi K2.5 at $1/$5 hits 96.1% on AIME 2025.
Is open-source competitive for math reasoning?
Yes. Kimi K2.5 (96.1% AIME, 98% MATH-500) and DeepSeek R1 (97.3% MATH-500) match or beat several proprietary models. The gap is wider on GPQA Diamond, where proprietary models lead by 5-7 points.
How often do math reasoning rankings change?
AIME rankings have stabilized since multiple models hit 95-100%. GPQA Diamond rankings shift with each major model release, roughly every 6-8 weeks.
Do these benchmarks reflect real-world math ability?
AIME and MATH-500 test competition-style problems with clean solutions. Real-world math often involves messy data, ambiguous constraints, and domain-specific knowledge that these benchmarks don't capture.
Which model explains its reasoning best?
Claude Opus 4.6 produces the most detailed step-by-step explanations. GPT-5.2's chain-of-thought is also strong. Gemini models tend toward more concise reasoning traces.
Sources:
- Artificial Analysis - AIME 2025 Benchmark Leaderboard
- Artificial Analysis - GPQA Diamond Benchmark Leaderboard
- Artificial Analysis - MATH-500 Benchmark Leaderboard
- MathArena - Math Competition Evaluations
- Epoch AI - Evaluating Grok 4's Math Capabilities
- DeepSeek R1 - GitHub
- Google DeepMind - Gemini 3.1 Pro
✓ Last verified March 11, 2026
