Math Olympiad AI Leaderboard: Which Models Ace Competition Math?

Competition mathematics is one of the purest tests of reasoning ability. There are no shortcuts, no tricks of memorization, and no ambiguity in the answers. A model either solves the problem correctly or it does not. The rapid progress of AI on math olympiad benchmarks, from struggling with basic algebra three years ago to achieving perfect scores on AIME today, represents one of the most concrete measures of advancing AI intelligence.

This leaderboard ranks models on four major competition math benchmarks: AIME, IMO-level problems, HMMT, and MATH-500.

The Benchmarks Explained

AIME (American Invitational Mathematics Examination)

AIME is the second round of the American mathematics competition pipeline, taken by the top 5% of AMC 12 scorers. Each exam has 15 problems with integer answers ranging from 000 to 999. Problems cover algebra, combinatorics, geometry, and number theory, and require creative insight rather than rote computation. A score of 10+ on AIME qualifies a student for the USA Mathematical Olympiad (USAMO), placing them among the top few hundred math students in the country.

We evaluate models on AIME 2024 and AIME 2025 problems, which were released after most models' training data cutoffs.

IMO (International Mathematical Olympiad)

IMO is the most prestigious mathematics competition in the world. Six problems are given over two days, each requiring a complete proof rather than just a numerical answer. Problems demand deep mathematical insight and the ability to construct rigorous arguments. An IMO gold medal represents roughly the top 8% of participants, who are themselves the best young mathematicians from 100+ countries.

AI evaluation on IMO-level problems typically uses a curated set of recent problems assessed by human mathematicians for correctness and rigor of the solution.

HMMT (Harvard-MIT Mathematics Tournament)

HMMT is one of the most challenging college-run math competitions in the United States. Problems are designed by Harvard and MIT students and span individual rounds (algebra, combinatorics, geometry, number theory, general) and team rounds. HMMT problems often require more sophisticated techniques than AIME, sitting in difficulty between AIME and IMO.

MATH-500

MATH-500 is a curated subset of 500 problems from the MATH dataset spanning seven difficulty levels and seven topics. While less prestigious than competition benchmarks, it provides a broader sample of mathematical reasoning ability and serves as a reliable baseline measurement.

Math Olympiad Rankings

Rank	Model	Provider	AIME 2025 (% correct)	IMO-Level	HMMT	MATH-500
1	GPT-5.2 Pro	OpenAI	100% (30/30)	Gold medal	92.5%	98.8%
2	DeepSeek V3.2-Speciale	DeepSeek	96.7% (29/30)	Gold medal	90.1%	97.5%
3	Claude Opus 4.6	Anthropic	93.3% (28/30)	Silver medal	88.3%	97.2%
4	Grok 4 Heavy	xAI	93.3% (28/30)	Silver medal	87.5%	96.8%
5	Gemini 3 Pro	Google DeepMind	90.0% (27/30)	Silver medal	86.2%	96.5%
6	Qwen 3.5	Alibaba	83.3% (25/30)	Bronze medal	82.1%	95.1%
7	GPT-5.2	OpenAI	86.7% (26/30)	Silver medal	84.5%	95.8%
8	Claude Opus 4.5 Reasoning	Anthropic	86.7% (26/30)	Bronze medal	83.8%	95.5%
9	DeepSeek V3.2	DeepSeek	80.0% (24/30)	Bronze medal	78.5%	94.2%
10	Llama 4 Maverick	Meta	66.7% (20/30)	No medal	68.2%	88.5%

AIME 2025 combines AIME I and AIME II (30 problems total). IMO-Level is assessed by human mathematicians on recent competition-difficulty problems.

Analysis

GPT-5.2 Pro: Perfect on AIME

A perfect score on AIME, solving all 30 problems correctly across AIME I and AIME II 2025, is a historic achievement. To put this in perspective, only a handful of human students achieve perfect AIME scores in any given year, out of roughly 6,000 qualifiers. GPT-5.2 Pro did it with extended reasoning enabled, taking roughly 10-15 minutes of "thinking time" per problem.

Its gold medal performance on IMO-level problems is equally impressive. The model can now construct multi-step mathematical proofs that human judges rate as correct and rigorous. This does not mean the proofs are always elegant; they sometimes take longer or more mechanical approaches than a talented human would. But they are valid.

DeepSeek V3.2-Speciale: Near-Perfect at a Fraction of the Cost

DeepSeek V3.2-Speciale's 96.7% on AIME 2025 (missing just one problem out of thirty) demonstrates that open-weight models have reached near-frontier mathematical capability. Its gold medal on IMO-level problems makes it the only open-weight model to achieve this distinction. At roughly one-thirtieth the cost of GPT-5.2 Pro, it offers extraordinary math reasoning value.

The AIME-to-IMO Gap

AIME and IMO test fundamentally different skills. AIME problems have numerical answers and can often be solved through clever computation. IMO problems require constructing formal proofs, a much harder task that demands understanding why something is true, not just what the answer is.

The gap between AIME and IMO performance across models confirms this. Several models that score above 90% on AIME drop to silver or bronze medal level on IMO problems. Proof construction remains one of the hardest capabilities for AI systems to master.

Why Math Performance Matters Beyond Math

Mathematical reasoning is a strong predictor of general reasoning ability. Models that score well on math olympiad benchmarks consistently perform well on other reasoning-intensive tasks: logical puzzles, scientific analysis, strategic planning, and complex coding problems. This is because the same underlying capabilities, like chaining multiple reasoning steps, maintaining consistency across a long argument, and finding creative approaches to novel problems, are required across all these domains.

For this reason, math benchmarks serve as a canary in the coal mine for AI capability. When you see models achieving perfect AIME scores and IMO gold medals, it signals that the underlying reasoning machinery has reached a level of sophistication with broad implications.

The Historical Trajectory

The pace of improvement on math benchmarks has been extraordinary:

Year	Best AIME Score	Best MATH-500	IMO Performance
2023	~30%	~75%	Below threshold
2024	~70%	~90%	Bronze level
2025	~90%	~96%	Silver level
2026	100%	~99%	Gold level

In three years, AI has gone from struggling with introductory competition math to achieving perfect scores on one of the hardest standardized math tests in the world. If we extrapolate this trajectory, models may soon be competitive with the very best human mathematicians on the very hardest problems.

What Comes Next

As AIME and MATH-500 approach saturation (with models scoring 95%+ and 99%+ respectively), the community is developing harder benchmarks. FrontierMath, a collection of problems that require multi-hour reasoning chains and novel mathematical techniques, remains largely unsolved. Proving open conjectures in mathematics is the ultimate test, and the first AI system to contribute a genuinely novel mathematical proof will mark a watershed moment for the field.

For now, competition math remains one of the clearest, most objective measures of how smart AI models are getting, and the numbers continue to climb.