Best AI Models for Math Reasoning - March 2026

GPT-5.2 and Claude Opus 4.6 both score 100% on AIME 2025, while Gemini 3.1 Pro leads GPQA Diamond at 94.3% for PhD-level scientific reasoning.

Math Reasoning Top: GPT-5.2 Updated monthly
Best AI Models for Math Reasoning - March 2026

TL;DR

  • GPT-5.2 and Claude Opus 4.6 both hit 100% on AIME 2025, saturating the benchmark completely
  • Gemini 3.1 Pro leads GPQA Diamond at 94.3%, the best score on PhD-level science questions, at just $2/$12 per million tokens
  • Open-weight models like Kimi K2.5 (96.1% AIME) and DeepSeek R1 (97.3% MATH-500) compete within striking distance of closed-source leaders

Math reasoning has become the most saturated AI benchmark category in 2026. Both GPT-5.2 and Claude Opus 4.6 reached perfect 100% scores on AIME 2025, the 30-problem competition math exam that once separated frontier models by double digits. The real differentiation now happens on harder benchmarks: Gemini 3.1 Pro leads GPQA Diamond (PhD-level science) at 94.3%, and Gemini 3 Deep Think pushes Humanity's Last Exam to 41%.

For practical math applications, the choice comes down to price. Gemini 3.1 Pro delivers 91.2% on AIME and 94.3% on GPQA Diamond at $2/$12 per million tokens, undercutting GPT-5.2's $1.75/$14 and Opus 4.6's $5/$25.

Rankings Table

RankModelProviderAIME 2025GPQA DiamondMATH-500Price (Input/Output)Verdict
1GPT-5.2OpenAI100%92.4%98%$1.75/$14First to hit perfect AIME, strong across all math
2Claude Opus 4.6Anthropic100%91.3%93%$5/$25Matches GPT-5.2 on AIME, trails on MATH-500
3Gemini 3.1 ProGoogle91.2%94.3%-$2/$12GPQA Diamond leader, best price-to-performance
4Kimi K2.5Moonshot AI96.1%87.6%98%$1/$5Open-weight surprise, matches GPT-5.2 on MATH-500
5Gemini 3 ProGoogle95%91.9%-$1.25/$10Previous gen still competitive on competition math
6Grok 4xAI88-95%88%-$3/$15Strong but scores vary by evaluation source
7DeepSeek R1DeepSeek79.8%-97.3%$0.55/$2.19Best open-source MATH-500, weaker on AIME
8Claude Opus 4.5Anthropic84.5%88%-$5/$25Previous gen, superseded by 4.6
9Qwen 3.5Alibaba-87.6%-$0.50/$2Budget open-weight option with strong GPQA
10GPT-5.4OpenAI-93%-$2.50/$20Newest OpenAI model, limited math data available

Detailed Analysis

GPT-5.2 - The First Perfect AIME Score

OpenAI's GPT-5.2 made history in December 2025 by becoming the first model to solve all 30 AIME 2025 problems correctly. That 100% score, combined with 98% on MATH-500 and 92.4% on GPQA Diamond, makes it the most consistently strong math model across benchmark types.

The "xhigh" reasoning effort setting pushes AIME performance to 99-100% but costs substantially more in compute time and tokens. For most users, the default setting delivers 95%+ accuracy on competition-level problems, which is more than sufficient for real applications.

At $1.75/$14 per million tokens, GPT-5.2 is priced lower than Claude Opus 4.6 but higher than Gemini 3.1 Pro. The premium buys you the highest combined score across AIME, MATH-500, and GPQA Diamond. No other model matches its breadth of math performance.

Claude Opus 4.6 - Matching the Top on Competition Math

Claude Opus 4.6 also reached 100% on AIME 2025 with tool access enabled, tying GPT-5.2 for the top competition math score. Its 91.3% on GPQA Diamond and 93% on the broader MATH benchmark show strong reasoning across domains.

Where Opus 4.6 separates itself is in explanatory quality. The model produces step-by-step derivations that read like textbook solutions, making it especially useful in educational settings. For pure benchmark performance, GPT-5.2 edges it on MATH-500 (98% vs 93%). For applications where understanding the reasoning chain matters as much as the answer, Opus 4.6 has an advantage.

The $5/$25 pricing makes it the most expensive option in the top five. Teams running high-volume math processing should consider Gemini 3.1 Pro or Kimi K2.5 as cost-effective alternatives.

Gemini 3.1 Pro - The GPQA Diamond Champion

Google's Gemini 3.1 Pro holds the highest GPQA Diamond score at 94.3%, a benchmark that tests PhD-level science and math questions where domain experts score around 70%. That 24-point lead over human experts is the widest gap any model has achieved on this benchmark.

On competition math, Gemini 3.1 Pro scores 91.2% on AIME 2025, behind the perfect scores from GPT-5.2 and Opus 4.6 but still strong enough to place in the top percentile of human test-takers. The Deep Think mode available on Gemini 3 Deep Think pushes Humanity's Last Exam from 37.5% to 41%, the highest score any model has reached on questions designed to resist AI solution.

At $2/$12, Gemini 3.1 Pro costs 63% less than Opus 4.6 on output tokens while delivering the top GPQA Diamond score. For teams that focus on scientific reasoning and graduate-level problem solving, it's the clear value pick.

Kimi K2.5 - Open Weights, Closed Gap

Moonshot AI's Kimi K2.5 scores 96.1% on AIME 2025 and 98% on MATH-500, matching GPT-5.2 on the latter benchmark. As an open-weight model, it can be self-hosted and fine-tuned, offering a flexibility that closed-source models can't provide.

The K2.5 Thinking mode extends reasoning chains for complex problems, using a budget of up to 96K tokens for multi-step derivations. AIME scores were averaged over 32 runs, which provides a more statistically reliable estimate than single-pass evaluations.

At $1/$5 per million tokens through the API, Kimi K2.5 represents the best value proposition in this category for teams that don't need GPQA Diamond performance. Its 87.6% on GPQA Diamond is roughly 7 points behind Gemini 3.1 Pro, a gap that matters for PhD-level science but rarely affects practical math applications.

DeepSeek R1 - The MATH-500 Specialist

DeepSeek R1 achieved 97.3% on MATH-500, putting it in the same tier as GPT-5.2 and Kimi K2.5 on this benchmark. Built using pure reinforcement learning for reasoning, R1 shows that open-source models can match proprietary ones on structured math problem-solving.

The catch is AIME 2025 performance. At 79.8%, R1 trails the leaders by 15-20 percentage points on competition math, where problems require more creative mathematical insight. The distilled 32B variant scores 72.6% on AIME, still usable but not competitive with frontier models. At $0.55/$2.19 per million tokens, it's the cheapest strong math model available.

Methodology

Three benchmarks drive this ranking:

AIME 2025 consists of 30 competition-level math problems from the American Invitational Mathematics Examination. Problems require integer answers between 000 and 999, testing algebra, geometry, number theory, and combinatorics. With multiple models now scoring 95-100%, AIME is approaching saturation as a differentiating benchmark.

GPQA Diamond contains 198 PhD-level science questions that stump non-domain experts but that domain experts answer correctly. It tests deeper scientific reasoning than AIME and remains unsaturated, with the top score at 94.3%.

MATH-500 is a 500-problem subset of the MATH dataset covering competition-level mathematics across six domains. It provides a broader evaluation than AIME but shows similar saturation at the top.

A critical caveat: AIME 2025 scores should be interpreted carefully. Some models (like GPT-5.2) report averages over many runs, while others report best-of-N results. Grok 4's score ranges from 88% to 95% depending on the evaluation methodology, which depicts how reporting differences can distort rankings.

Humanity's Last Exam, where no model passes 41%, is emerging as the next frontier for differentiating math reasoning capability. We'll add it more prominently as more models report scores.

Historical Progression

  • March 2025 - OpenAI's o1 led AIME at roughly 83%. GPQA Diamond top score sat around 78%. Math reasoning was a clear differentiator between model tiers.

  • July 2025 - Claude Opus 4.0 and Gemini 2.5 Pro pushed AIME past 90%. The race to perfect scores began.

  • October 2025 - Grok 4 launched claiming #1 math performance. Independent evaluations showed 88-90% on AIME, strong but not the claimed 95%.

  • December 2025 - GPT-5.2 reached the first verified 100% AIME score. Kimi K2.5 and Gemini 3 Pro crossed 95%.

  • March 2026 - AIME 2025 is effectively saturated at the top. GPQA Diamond and Humanity's Last Exam become the new frontier for measuring math reasoning.

The velocity of improvement from March 2025 (83%) to December 2025 (100%) represents a 17-point gain in nine months. That pace can't continue on a capped benchmark. Harder evaluations like Humanity's Last Exam (top score: 41%) and ARC-AGI-2 (top score: 77.1%) now carry the weight of differentiation.

FAQ

What's the cheapest model that's still good at math?

DeepSeek R1 at $0.55/$2.19 per million tokens scores 97.3% on MATH-500. For competition math, Kimi K2.5 at $1/$5 hits 96.1% on AIME 2025.

Is open-source competitive for math reasoning?

Yes. Kimi K2.5 (96.1% AIME, 98% MATH-500) and DeepSeek R1 (97.3% MATH-500) match or beat several proprietary models. The gap is wider on GPQA Diamond, where proprietary models lead by 5-7 points.

How often do math reasoning rankings change?

AIME rankings have stabilized since multiple models hit 95-100%. GPQA Diamond rankings shift with each major model release, roughly every 6-8 weeks.

Do these benchmarks reflect real-world math ability?

AIME and MATH-500 test competition-style problems with clean solutions. Real-world math often involves messy data, ambiguous constraints, and domain-specific knowledge that these benchmarks don't capture.

Which model explains its reasoning best?

Claude Opus 4.6 produces the most detailed step-by-step explanations. GPT-5.2's chain-of-thought is also strong. Gemini models tend toward more concise reasoning traces.


Sources:

✓ Last verified March 11, 2026

Best AI Models for Math Reasoning - March 2026
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.