Best AI Models for Math Reasoning - April 2026
Gemini 3.1 Pro leads GPQA Diamond at 94.1% and HLE at 44.7% as AIME 2025 saturates; Claude Opus 4.7 and Kimi K2.6 join the top tier in April 2026.

TL;DR
- Gemini 3.1 Pro leads every unsaturated math benchmark: GPQA Diamond (94.1%), HLE (44.7%), and ARC-AGI-2 (77.1%)
- AIME 2025 is dead as a ranking tool - five models score 98%+ now; AIME 2026 is the new competition math test
- Claude Opus 4.7 (April 16) claims 94.2% GPQA Diamond pending independent verification; Kimi K2.6 (April 20) hits 96.4% on AIME 2026 as the open-weight leader
The benchmark picture for math reasoning changed substantially in April 2026. AIME 2025, the standard that defined the race for the past year, is now saturated: GPT-5.4, Gemini 3.1 Pro, Claude Opus 4.6, and GPT-5.2 all score 98% or higher, making it nearly useless for distinguishing tier-1 models. The new competition math test is AIME 2026, where GPT-5.4 leads at roughly 99%, with Gemini 3.1 Pro and Claude Opus 4.6 close behind at 98.1% and 98.2%.
On the harder benchmarks that still separate models - GPQA Diamond and Humanity's Last Exam - Gemini 3.1 Pro holds the top position with 94.1% and 44.7% respectively. Two models entered the top tier this month: Claude Opus 4.7 (April 16) claims 94.2% on GPQA Diamond per Anthropic's own evaluations, and Kimi K2.6 (April 20) scores 96.4% on AIME 2026 as the strongest open-weight math option currently available.
Rankings Table
| Rank | Model | Provider | AIME 2026 | GPQA Diamond | HLE | Price (Input) | Verdict |
|---|---|---|---|---|---|---|---|
| 1 | Gemini 3.1 Pro | 98.1% | 94.1% | 44.7% | $2/M | Leads every unsaturated benchmark; best math value | |
| 2 | GPT-5.4 | OpenAI | ~99% | 92.8% | 41.6% | $2.50/M | AIME 2026 leader; trails Gemini on PhD-level science |
| 3 | Claude Opus 4.7 | Anthropic | N/V | 94.2%* | N/V | $5/M | GPQA matches Gemini per Anthropic; AIME scores pending |
| 4 | GPT-5.2 | OpenAI | ~96% | 93.2% | - | $1.75/M | Previous leader; GPQA still edges newer GPT-5.4 |
| 5 | Claude Opus 4.6 | Anthropic | ~98.2% | 91.3% | - | $5/M | Strong AIME 2026; superseded by Opus 4.7 |
| 6 | Kimi K2.6 | Moonshot AI | 96.4% | ~91% | 54.0† | $0.95/M | Top open-weight math model by a clear margin |
| 7 | Qwen 3.5 | Alibaba | 91.3% | 88.4% | - | $0.50/M | Solid budget math model with decent PhD-level coverage |
| 8 | GLM-5.1 | Zhipu AI | 95.3% | 86.2% | - | $0.95/M | Open-weight with strong AIME; GPQA is the weak point |
| 9 | DeepSeek R1 | DeepSeek | ~81% | 71.5% | - | $0.55/M | Cheapest reliable option; MATH-500 specialist at 97.3% |
| 10 | Grok 4.20 | xAI | N/A | N/A | - | $2/M | Competitive on Arena; math benchmarks not independently verified |
*Anthropic-claimed; Artificial Analysis evaluation of Opus 4.7 not yet published as of April 22 †HLE with tools enabled; text-only HLE for top models sits under 45%
For live scores, see Math Olympiad AI Leaderboard and Reasoning Benchmarks Leaderboard.
Competition-level math problems require extended multi-step derivations - the kind that once separated model tiers by 15+ points and now separate them by 2.
Source: unsplash.com
Detailed Analysis
Gemini 3.1 Pro - The Comprehensive Math Leader
Google's Gemini 3.1 Pro holds the top combined math position in April 2026. On GPQA Diamond - 198 PhD-level science and math questions designed to be unsolvable through web search - it scores 94.1% according to Artificial Analysis's independent evaluation. That's more than 24 points above the human expert baseline of 69.7%. On Humanity's Last Exam, it reaches 44.7% on text-only evaluation, the highest published score among publicly available models. Add a 77.1% score on ARC-AGI-2 (last updated April 21 by the ARC Prize leaderboard), and no other model holds that combination across all three hard-reasoning benchmarks simultaneously.
AIME 2026 adds to the picture: 98.13% on Vals AI's independent run. This puts Gemini at or near the top of competition math as well, within statistical noise of GPT-5.4's ~99%.
At $2/$12 per million tokens, Gemini 3.1 Pro delivers the highest combined math performance at roughly 60% of the cost of Claude Opus 4.7.
For research teams and developers focused on hard scientific and mathematical reasoning, that price-to-performance ratio is compelling. The overall LLM rankings for April 2026 show Gemini near the top across all capabilities, not just math.
One caveat worth noting: Artificial Analysis assessed the "Gemini 3.1 Pro Preview" checkpoint from late February 2026. Whether the standard API endpoint routes to the same checkpoint isn't disclosed, and score drift between preview and production variants is common.
GPT-5.4 - AIME 2026 Champion
GPT-5.4 leads on competition math specifically. Its AIME 2026 score sits between 98.7% and 99.2% depending on the evaluator - normal variance on a 30-problem test. OpenAI's own benchmark publication shows 92.8% on GPQA Diamond (Artificial Analysis's independent run shows 92.0%, reflecting different reasoning effort settings). HLE comes in at 41.6% text-only.
The strongest practical case for GPT-5.4 on math isn't the raw benchmark scores. It's breadth: the model scores competitively on competition math, PhD-level reasoning, and abstract logic while also offering a 1M-token context window and native computer use for automated math workflows. Teams building agentic systems that solve problems end-to-end - running code, verifying answers, iterating on solutions - get capabilities that AIME scores don't measure.
At $2.50/$15 per million tokens, it's priced between Gemini 3.1 Pro and Claude Opus 4.7.
Claude Opus 4.7 - Pending the Full Picture
Claude Opus 4.7 launched April 16 with Anthropic claiming 94.2% on GPQA Diamond, a 1.8-point improvement over Opus 4.6 that would put it in a virtual tie with Gemini 3.1 Pro. Anthropic also reports a +2.9 percentage point gain on HLE over Opus 4.6, though the absolute baseline wasn't published.
The math-specific benchmarks - AIME 2026, MATH-500 - haven't appeared in Anthropic's announcements or independent evaluations as of April 22. Launch coverage focused on coding: +13% on GitHub AI's 93-task evaluation, 3x more resolved tasks on Rakuten-SWE-Bench, 70% on CursorBench versus 58% for Opus 4.6. That's not an indication of weak math performance. It's an absence of data.
Opus 4.7 is ranked #3 based on the GPQA Diamond claim. That ranking will be confirmed or revised once Artificial Analysis completes its evaluation.
Kimi K2.6 - Best Open-Weight Math Model
Kimi K2.6, released April 20, is the strongest open-weight option for math reasoning on AIME 2026. MathArena's evaluation places it at 96.4%, ahead of predecessor Kimi K2.5 at 95.83% and GLM-5.1 at 95.3%. On GPQA Diamond, K2.6 trails the proprietary frontier by roughly 2-3 points (around 91%), consistent with the model card's description of pure reasoning as the soft spot.
The HLE score of 54.0 with tools enabled runs notably higher than the text-only Gemini and GPT-5.4 figures, but this comparison requires care. Tool-augmented evaluation differs fundamentally from text-only, and these come from different evaluators. Treat the 54.0 as a tool-use figure, not a direct GPQA substitute.
The practical appeal is price and availability: $0.95/$4 per million tokens with open weights on HuggingFace for teams that can run inference locally.
MATH-500 and AIME use structured problems with integer answers; GPQA Diamond and HLE probe whether models can reason through truly novel questions.
Source: unsplash.com
Methodology
This update replaces AIME 2025 with AIME 2026 as the primary competition math benchmark. AIME 2025 is saturated at the top - five models score 98% or higher - making it useless for ranking tier-1 systems. AIME 2026 uses the freshly released 2026 American Invitational Mathematics Examination problems, which weren't in training data for any current model.
Three benchmarks drive the April 2026 rankings:
AIME 2026 - 30 competition math problems covering algebra, geometry, number theory, and combinatorics. MathArena evaluates models 4 times per problem and reports the average. Vals AI and other independent evaluators use pass@1 or pass@k methodologies, which is why scores vary by 1-3 points across sources - treat anything within 2 points as a statistical tie.
GPQA Diamond - 198 PhD-level science questions designed to be "Google-proof." Domain experts score 69.7%. No model publicly beats 95% as of April 2026. Primary independent source: Artificial Analysis.
HLE (Humanity's Last Exam) - 2,500 questions across math, science, and humanities. Text-only leaderboard sits under 45% for all models. Tool-augmented scores run significantly higher but aren't comparable to text-only results and shouldn't be mixed in the same ranking column.
MATH-500 is no longer a useful differentiator at the top: most frontier models score 97-99%+. It's retained for the DeepSeek R1 entry because it's that model's strongest benchmark.
Historical Progression
March 2025 - OpenAI o1 led AIME at roughly 83%. GPQA Diamond top score sat around 78%. Clear separation existed between model tiers on both benchmarks.
July 2025 - Claude Opus 4.0 and Gemini 2.5 Pro pushed AIME past 90%. The race to reach 100% began in earnest.
October 2025 - Grok 4 launched claiming top math performance. Independent evaluations showed 88-90% on AIME 2025, not the claimed 95%.
December 2025 - GPT-5.2 became the first model to score 100% on AIME 2025. Kimi K2.5 and Gemini 3 Pro crossed 95%.
March 2026 - AIME 2025 saturated at the top. GPQA Diamond and HLE took over as differentiators. GPT-5.4 launched with AIME 2026 competitiveness.
April 2026 - AIME 2026 becomes the primary competition math benchmark. Gemini 3.1 Pro leads across GPQA Diamond, HLE, and ARC-AGI-2. Claude Opus 4.7 claims GPQA parity with Gemini pending independent verification. Kimi K2.6 takes the open-weight math crown.
The step from 78% GPQA Diamond in March 2025 to 94% in April 2026 happened in 13 months. With three models now clustered between 91% and 94%, GPQA is on the same saturation track AIME 2025 just left. The scientific reasoning leaderboard will show when the next benchmark needs to step in.
FAQ
What's the cheapest model that's still good at math?
Qwen 3.5 at $0.50/M input scores 91.3% on AIME 2026 and 88.4% on GPQA Diamond. For MATH-500 heavy workloads, DeepSeek R1 at $0.55/M hits 97.3% at the lowest cost of any reliable math model.
Is open-source competitive for math reasoning?
Yes, more than before. Kimi K2.6 scores 96.4% on AIME 2026 with open weights available for self-hosting. The gap to the proprietary frontier is now under 2 points on AIME 2026. GPQA Diamond still shows a 3-4 point gap between the best open-weight models and the closed-source leaders.
How often do math reasoning rankings change?
AIME 2026 rankings have already shifted twice since March 2026. Major model releases from any of the five or six labs with frontier math capability will reshuffle the top five within weeks. GPQA Diamond rankings are more stable - they move when a new model generation ships, not when minor updates drop.
Which benchmark replaced AIME 2025?
AIME 2026 for competition math, and Humanity's Last Exam for frontier reasoning. AIME 2025 is saturated with five models at 98%+. HLE remains under 45% for the best text-only models, giving it years of runway before the same saturation problem hits.
Is Claude Opus 4.7 better than Gemini 3.1 Pro for math?
Anthropic claims 94.2% on GPQA Diamond for Opus 4.7, which would match Gemini's 94.1%. AIME 2026 and HLE scores aren't published as of April 22. The honest answer is that the benchmarks that would settle this question don't exist in the public record yet - check back once Artificial Analysis completes its evaluation.
Sources:
- Artificial Analysis - GPQA Diamond Benchmark Leaderboard
- Artificial Analysis - Humanity's Last Exam Leaderboard
- Vals AI - AIME Benchmark
- MathArena - AIME 2026
- MathArena AIME 2026 Dataset - Hugging Face
- ARC Prize Leaderboard
- BenchLM - AIME 2026
- Vellum AI - Claude Opus 4.7 Benchmarks
- LLM Stats - AIME 2026 Leaderboard
- Price Per Token - GPQA Leaderboard
✓ Last verified April 22, 2026
