Best AI Models for Math Reasoning - March 2026

TL;DR

GPT-5.2 and Claude Opus 4.6 both hit 100% on AIME 2025, saturating the benchmark completely
Gemini 3.1 Pro leads GPQA Diamond at 94.3%, the best score on PhD-level science questions, at just $2/$12 per million tokens
Open-weight models like Kimi K2.5 (96.1% AIME) and DeepSeek R1 (97.3% MATH-500) compete within striking distance of closed-source leaders

Math reasoning has become the most saturated AI benchmark category in 2026. Both GPT-5.2 and Claude Opus 4.6 reached perfect 100% scores on AIME 2025, the 30-problem competition math exam that once separated frontier models by double digits. The real differentiation now happens on harder benchmarks: Gemini 3.1 Pro leads GPQA Diamond (PhD-level science) at 94.3%, and Gemini 3 Deep Think pushes Humanity's Last Exam to 41%.

For practical math applications, the choice comes down to price. Gemini 3.1 Pro delivers 91.2% on AIME and 94.3% on GPQA Diamond at $2/$12 per million tokens, undercutting GPT-5.2's $1.75/$14 and Opus 4.6's $5/$25.

Rankings Table

Rank	Model	Provider	AIME 2025	GPQA Diamond	MATH-500	Price (Input/Output)	Verdict
1	GPT-5.2	OpenAI	100%	92.4%	98%	$1.75/$14	First to hit perfect AIME, strong across all math
2	Claude Opus 4.6	Anthropic	100%	91.3%	93%	$5/$25	Matches GPT-5.2 on AIME, trails on MATH-500
3	Gemini 3.1 Pro	Google	91.2%	94.3%	-	$2/$12	GPQA Diamond leader, best price-to-performance
4	Kimi K2.5	Moonshot AI	96.1%	87.6%	98%	$1/$5	Open-weight surprise, matches GPT-5.2 on MATH-500
5	Gemini 3 Pro	Google	95%	91.9%	-	$1.25/$10	Previous gen still competitive on competition math
6	Grok 4	xAI	88-95%	88%	-	$3/$15	Strong but scores vary by evaluation source
7	DeepSeek R1	DeepSeek	79.8%	-	97.3%	$0.55/$2.19	Best open-source MATH-500, weaker on AIME
8	Claude Opus 4.5	Anthropic	84.5%	88%	-	$5/$25	Previous gen, superseded by 4.6
9	Qwen 3.5	Alibaba	-	87.6%	-	$0.50/$2	Budget open-weight option with strong GPQA
10	GPT-5.4	OpenAI	-	93%	-	$2.50/$20	Newest OpenAI model, limited math data available

Detailed Analysis

GPT-5.2 - The First Perfect AIME Score

OpenAI's GPT-5.2 made history in December 2025 by becoming the first model to solve all 30 AIME 2025 problems correctly. That 100% score, combined with 98% on MATH-500 and 92.4% on GPQA Diamond, makes it the most consistently strong math model across benchmark types.

The "xhigh" reasoning effort setting pushes AIME performance to 99-100% but costs substantially more in compute time and tokens. For most users, the default setting delivers 95%+ accuracy on competition-level problems, which is more than sufficient for real applications.

At $1.75/$14 per million tokens, GPT-5.2 is priced lower than Claude Opus 4.6 but higher than Gemini 3.1 Pro. The premium buys you the highest combined score across AIME, MATH-500, and GPQA Diamond. No other model matches its breadth of math performance.

Claude Opus 4.6 - Matching the Top on Competition Math

Claude Opus 4.6 also reached 100% on AIME 2025 with tool access enabled, tying GPT-5.2 for the top competition math score. Its 91.3% on GPQA Diamond and 93% on the broader MATH benchmark show strong reasoning across domains.

Where Opus 4.6 separates itself is in explanatory quality. The model produces step-by-step derivations that read like textbook solutions, making it especially useful in educational settings. For pure benchmark performance, GPT-5.2 edges it on MATH-500 (98% vs 93%). For applications where understanding the reasoning chain matters as much as the answer, Opus 4.6 has an advantage.

The $5/$25 pricing makes it the most expensive option in the top five. Teams running high-volume math processing should consider Gemini 3.1 Pro or Kimi K2.5 as cost-effective alternatives.

Gemini 3.1 Pro - The GPQA Diamond Champion

Google's Gemini 3.1 Pro holds the highest GPQA Diamond score at 94.3%, a benchmark that tests PhD-level science and math questions where domain experts score around 70%. That 24-point lead over human experts is the widest gap any model has achieved on this benchmark.

On competition math, Gemini 3.1 Pro scores 91.2% on AIME 2025, behind the perfect scores from GPT-5.2 and Opus 4.6 but still strong enough to place in the top percentile of human test-takers. The Deep Think mode available on Gemini 3 Deep Think pushes Humanity's Last Exam from 37.5% to 41%, the highest score any model has reached on questions designed to resist AI solution.

At $2/$12, Gemini 3.1 Pro costs 63% less than Opus 4.6 on output tokens while delivering the top GPQA Diamond score. For teams that focus on scientific reasoning and graduate-level problem solving, it's the clear value pick.

Kimi K2.5 - Open Weights, Closed Gap

Moonshot AI's Kimi K2.5 scores 96.1% on AIME 2025 and 98% on MATH-500, matching GPT-5.2 on the latter benchmark. As an open-weight model, it can be self-hosted and fine-tuned, offering a flexibility that closed-source models can't provide.

The K2.5 Thinking mode extends reasoning chains for complex problems, using a budget of up to 96K tokens for multi-step derivations. AIME scores were averaged over 32 runs, which provides a more statistically reliable estimate than single-pass evaluations.

At $1/$5 per million tokens through the API, Kimi K2.5 represents the best value proposition in this category for teams that don't need GPQA Diamond performance. Its 87.6% on GPQA Diamond is roughly 7 points behind Gemini 3.1 Pro, a gap that matters for PhD-level science but rarely affects practical math applications.

DeepSeek R1 - The MATH-500 Specialist

DeepSeek R1 achieved 97.3% on MATH-500, putting it in the same tier as GPT-5.2 and Kimi K2.5 on this benchmark. Built using pure reinforcement learning for reasoning, R1 shows that open-source models can match proprietary ones on structured math problem-solving.

The catch is AIME 2025 performance. At 79.8%, R1 trails the leaders by 15-20 percentage points on competition math, where problems require more creative mathematical insight. The distilled 32B variant scores 72.6% on AIME, still usable but not competitive with frontier models. At $0.55/$2.19 per million tokens, it's the cheapest strong math model available.

Methodology

Three benchmarks drive this ranking:

AIME 2025 consists of 30 competition-level math problems from the American Invitational Mathematics Examination. Problems require integer answers between 000 and 999, testing algebra, geometry, number theory, and combinatorics. With multiple models now scoring 95-100%, AIME is approaching saturation as a differentiating benchmark.

GPQA Diamond contains 198 PhD-level science questions that stump non-domain experts but that domain experts answer correctly. It tests deeper scientific reasoning than AIME and remains unsaturated, with the top score at 94.3%.

MATH-500 is a 500-problem subset of the MATH dataset covering competition-level mathematics across six domains. It provides a broader evaluation than AIME but shows similar saturation at the top.

A critical caveat: AIME 2025 scores should be interpreted carefully. Some models (like GPT-5.2) report averages over many runs, while others report best-of-N results. Grok 4's score ranges from 88% to 95% depending on the evaluation methodology, which depicts how reporting differences can distort rankings.

Humanity's Last Exam, where no model passes 41%, is emerging as the next frontier for differentiating math reasoning capability. We'll add it more prominently as more models report scores.

Historical Progression

March 2025 - OpenAI's o1 led AIME at roughly 83%. GPQA Diamond top score sat around 78%. Math reasoning was a clear differentiator between model tiers.
July 2025 - Claude Opus 4.0 and Gemini 2.5 Pro pushed AIME past 90%. The race to perfect scores began.
October 2025 - Grok 4 launched claiming #1 math performance. Independent evaluations showed 88-90% on AIME, strong but not the claimed 95%.
December 2025 - GPT-5.2 reached the first verified 100% AIME score. Kimi K2.5 and Gemini 3 Pro crossed 95%.
March 2026 - AIME 2025 is effectively saturated at the top. GPQA Diamond and Humanity's Last Exam become the new frontier for measuring math reasoning.

The velocity of improvement from March 2025 (83%) to December 2025 (100%) represents a 17-point gain in nine months. That pace can't continue on a capped benchmark. Harder evaluations like Humanity's Last Exam (top score: 41%) and ARC-AGI-2 (top score: 77.1%) now carry the weight of differentiation.

FAQ

What's the cheapest model that's still good at math?

DeepSeek R1 at $0.55/$2.19 per million tokens scores 97.3% on MATH-500. For competition math, Kimi K2.5 at $1/$5 hits 96.1% on AIME 2025.

Is open-source competitive for math reasoning?

Yes. Kimi K2.5 (96.1% AIME, 98% MATH-500) and DeepSeek R1 (97.3% MATH-500) match or beat several proprietary models. The gap is wider on GPQA Diamond, where proprietary models lead by 5-7 points.

How often do math reasoning rankings change?

AIME rankings have stabilized since multiple models hit 95-100%. GPQA Diamond rankings shift with each major model release, roughly every 6-8 weeks.

Do these benchmarks reflect real-world math ability?

AIME and MATH-500 test competition-style problems with clean solutions. Real-world math often involves messy data, ambiguous constraints, and domain-specific knowledge that these benchmarks don't capture.

Which model explains its reasoning best?

Claude Opus 4.6 produces the most detailed step-by-step explanations. GPT-5.2's chain-of-thought is also strong. Gemini models tend toward more concise reasoning traces.

Sources: