Math Olympiad AI Leaderboard - March 2026 Rankings

Rankings of AI models on competition mathematics benchmarks including AIME 2025, IMO, MathArena, and FrontierMath, measuring the cutting edge of mathematical reasoning.

Math Olympiad AI Leaderboard - March 2026 Rankings

Competition mathematics remains one of the purest tests of reasoning ability. No shortcuts, no memorization tricks, no ambiguity in the answers. A model either solves the problem correctly or it doesn't. Since I last updated this leaderboard in January, the landscape has shifted dramatically. Multiple models now score a perfect 100% on AIME 2025, Gemini Deep Think earned an official gold medal at IMO 2025, and FrontierMath scores have jumped from single digits to around 40%. The benchmarks we track are being consumed faster than anyone expected.

This leaderboard ranks models on four major competition math benchmarks: AIME 2025, IMO-level problems, MathArena (a composite across multiple recent competitions), and FrontierMath.

TL;DR

  • Gemini 3.1 Pro leads MathArena's composite ranking at 91.1%, dominating across multiple competition formats
  • AIME 2025 is effectively saturated - GPT-5.2 Thinking, Claude Opus 4.6, Grok 4 Heavy, and Gemini 3 Pro all hit 100% with tools
  • Best value pick: DeepSeek V3.2-Speciale scores 96-97% on AIME at a fraction of the cost of closed-source frontier models

The Benchmarks Explained

AIME (American Invitational Mathematics Examination)

AIME is the second round of the American mathematics competition pipeline, taken by the top 5% of AMC 12 scorers. Each exam has 15 problems with integer answers ranging from 000 to 999. Problems cover algebra, combinatorics, geometry, and number theory, and require creative insight rather than rote computation. A score of 10+ on AIME qualifies a student for the USA Mathematical Olympiad (USAMO), placing them among the top few hundred math students in the country.

We assess models on AIME 2025 (combining AIME I and AIME II, 30 problems total). Most frontier models now score above 90%, pushing this benchmark toward saturation.

IMO (International Mathematical Olympiad)

IMO is the most prestigious mathematics competition in the world. Six problems are given over two days, each requiring a complete proof rather than just a numerical answer. Problems demand deep mathematical insight and the ability to construct rigorous arguments. An IMO gold medal represents roughly the top 8% of participants, who are themselves the best young mathematicians from 100+ countries.

In July 2025, Google DeepMind's Gemini Deep Think became the first end-to-end language model to officially achieve gold-medal standard at IMO, solving five of six problems and scoring 35 out of 42 points within the competition time limit.

MathArena

MathArena is an independent evaluation platform run by ETH Zurich researchers that tests models on the latest math competition problems. Unlike single-exam benchmarks, MathArena aggregates performance across multiple recent competitions - including AIME, HMMT, AMC, and others - producing a composite "expected performance" score. This makes it harder for models to game through data contamination on any single exam, and gives a more solid picture of mathematical reasoning ability.

FrontierMath

FrontierMath, developed by Epoch AI, is a benchmark of hundreds of original, exceptionally challenging problems crafted by expert mathematicians. Problems cover most major branches of modern mathematics - from computationally intensive number theory to abstract algebraic geometry. Solving a typical problem requires multiple hours of effort from a researcher in the relevant field. Top models now solve around 40% of Tier 1-3 problems, up from about 2% when the benchmark launched in late 2024.

Mathematical equations on a blackboard Competition math benchmarks test deep reasoning - not just pattern matching - making them among the most meaningful measures of AI intelligence.

Math Olympiad Rankings - March 2026

RankModelProviderAIME 2025 (no tools)MathArena OverallIMO 2025FrontierMath (T1-3)
1Gemini 3.1 ProGoogle DeepMind97.0%91.1%Gold-
2GPT-5.2 ThinkingOpenAI100% (30/30)83.8%Gold40.3%
3Claude Opus 4.6Anthropic~100%---
4Gemini 3 ProGoogle DeepMind95.0%80.1%Gold (Deep Think)-
5DeepSeek V3.2-SpecialeDeepSeek96.0-97.5%76.0%Gold-
6Kimi K2.5 ThinkingMoonshot96.1%~80%--
7Step 3.5 FlashStepFun97.3%78.1%--
8Grok 4 HeavyxAI95.0%--~14%
9GLM-5Zhipu AI84.0%77.1%--
10Qwen 3.5-397BAlibaba81.4-91.3%75.8%--
11GPT-5.2 (standard)OpenAI95.0%83.8%Silver-
12Gemini 3 FlashGoogle DeepMind-79.8%--
13Llama 4 MaverickMeta~70%-No medal-

Rankings ordered primarily by MathArena composite where available, with AIME 2025 and IMO as tiebreakers. Dashes suggest scores not publicly reported or independently verified. Ranges indicate conflicting reports across evaluation sources.

Important caveats: AIME 2025 scores vary across evaluation sources depending on methodology (pass@1 vs. consensus, with or without tools, temperature settings). MathArena scores are independently verified; self-reported scores from model providers should be treated with appropriate skepticism. Grok 4 Heavy lacks API access, so independent benchmarking is limited.

Analysis

Google Takes the MathArena Crown

The biggest story since January is Gemini 3.1 Pro seizing the top spot on MathArena's composite leaderboard at 91.1%. This is a significant margin over GPT-5.2 High at 83.8%. Google now holds both the gold and silver positions on MathArena, with Gemini 3 Flash at 79.8% - giving Google two models in the top four.

What makes MathArena particularly credible is that it aggregates across multiple recent competitions, making single-exam contamination less of a factor. Gemini 3.1 Pro's dominance here suggests truly superior mathematical reasoning rather than benchmark-specific optimization.

Google's math prowess extends to formal proof construction too. Gemini Deep Think officially earned gold at IMO 2025 in July, solving five of six problems with rigorous proofs - all within the 4.5-hour time limit. This was a major step up from the previous year's silver-medal performance using AlphaProof and AlphaGeometry.

AI model reasoning on complex mathematical problems Mathematical reasoning has become a key battleground for frontier AI labs, with multiple models now solving problems that stump most human competitors.

AIME 2025 Is Saturated

When I first published this leaderboard in January, a perfect AIME score was the headline. Now it's the norm. GPT-5.2 Thinking, Claude Opus 4.6, Gemini 3 Pro, and Grok 4 Heavy all achieve or approach 100% on AIME 2025 under various configurations. The benchmark that once separated frontier models from the pack has become a checkbox.

This doesn't mean AIME is worthless - it still filters out weaker models effectively (Llama 4 Maverick sits around 70%, for instance). But for distinguishing between top-tier reasoning models, we need harder tests. MathArena's composite approach and FrontierMath's research-level problems fill that gap.

The Chinese Model Surge

One of the most notable shifts since January is the emergence of multiple Chinese models as serious math contenders. Kimi K2.5 from Moonshot reached 96.1% on AIME 2025 and landed at number three on MathArena, tied for silver. StepFun's Step 3.5 Flash scored an extraordinary 97.3% on AIME 2025 - tied for the highest score on the entire board - with only 196 billion parameters. GLM-5 from Zhipu AI scored 84% on AIME and 77.1% on MathArena's composite.

Of the top 60 models on MathArena, 26 come from Chinese organizations - 43% of the leaderboard. This is a structural shift. The open-source ecosystem is delivering competition-level math reasoning at a fraction of frontier costs.

DeepSeek V3.2-Speciale: The Value King

DeepSeek V3.2-Speciale continues to punch far above its price bracket. Its 96-97.5% on AIME 2025 and gold-medal-level performance on IMO 2025 benchmark problems (35/42 in December 2025) make it the strongest open-weight math reasoning model available. At roughly one-thirtieth the token cost of GPT-5.2 Pro, the math-per-dollar ratio is staggering.

The caveat is that DeepSeek's IMO gold came months after the actual competition, on benchmark problems rather than live evaluation. Still, the raw capability is there.

FrontierMath: The New Frontier

With AIME approaching saturation, FrontierMath has emerged as the benchmark that actually separates models. GPT-5.2 leads at 40.3% on Tiers 1-3 - a massive jump from the ~2% that top models scored when the benchmark launched in late 2024. But 40% still means the majority of research-level math problems remain unsolved. Grok 4 sits at roughly 14%, showing that AIME performance doesn't automatically translate to research-grade mathematical reasoning.

FrontierMath Tier 4, which contains the hardest problems (think multi-day research efforts for human mathematicians), remains almost completely unsolved. Grok 4 managed just 1 out of 48 Tier 4 problems. This benchmark has years of headroom before saturation becomes a concern.

Competition podium representing AI model rankings The podium keeps reshuffling as new models and updates arrive at an accelerating pace.

The Historical Trajectory

The pace of improvement on math benchmarks continues to accelerate:

YearBest AIME ScoreBest MATH-500IMO PerformanceFrontierMath (T1-3)
2023~30%~75%Below thresholdN/A
2024~70%~90%Bronze level~2%
2025~95%~96%Silver -> Gold~25%
2026100%~97%+Gold (official)~40%

AIME went from challenging to saturated in roughly 18 months. FrontierMath is following a steeper trajectory than anyone predicted - Epoch AI now estimates it could saturate within two years.

Practical Guidance

For maximum mathematical accuracy: Gemini 3.1 Pro leads on composite benchmarks and is priced competitively at $2.00/$12.00 per million input/output tokens. If you need reliable math reasoning across a range of difficulty levels, it's the current best pick.

For research-grade problems: GPT-5.2 Thinking with its 40.3% on FrontierMath is the strongest option for genuinely hard mathematical tasks. It's more expensive ($1.75/$14.00 per million tokens for the base model, markedly more for Pro/extended thinking), but the reasoning depth justifies the cost for demanding applications.

For budget-conscious use: DeepSeek V3.2-Speciale offers 96%+ AIME performance and IMO gold-level capability at a fraction of closed-source pricing. As an open-weight model, you can also self-host it to remove per-token costs completely.

For coding-adjacent math: Claude Opus 4.6 combines near-perfect AIME scores with industry-leading coding benchmarks (80.8% on SWE-bench). If your math problems involve implementation - algorithm design, numerical methods, scientific computing - the combination of mathematical and coding reasoning makes it a strong all-rounder, despite being the most expensive option at $5.00/$25.00 per million tokens.

For open-source deployment: Qwen 3.5-397B (91.3% on AIME 2026 problems) and the surprisingly strong Step 3.5 Flash (97.3% on AIME 2025 at just 196B parameters) give you competition-level math locally. If you're running open-source models on your own hardware, these are worth evaluating.

What Comes Next

AIME 2025 and MATH-500 are effectively solved benchmarks for frontier models. The action has moved to FrontierMath, where the best models solve 40% of problems and the hardest tier remains nearly untouched. The AIME 2026 competition problems are already available for evaluation, but no major model has published scores yet - expect those results to land in the coming weeks.

The next true milestone will be an AI system contributing a novel mathematical proof to an open problem. Gemini Deep Think's gold at IMO 2025 shows the proof-construction machinery is getting sophisticated, but olympiad problems are, by definition, problems with known solutions. Solving something nobody has solved before is a different challenge completely.

For now, if you're choosing a model for mathematical reasoning, the data points to a clear hierarchy: Gemini 3.1 Pro for broad competition math, GPT-5.2 for the hardest problems, and DeepSeek V3.2-Speciale for the best value.

Sources:

Last updated

Math Olympiad AI Leaderboard - March 2026 Rankings
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.