SU-01 is a compact Mixture-of-Experts reasoning model from Shanghai AI Laboratory and collaborators at Tsinghua, Peking University, CUHK, and SJTU. The 30B-A3B architecture activates only 3B parameters per forward pass, yet matches or beats models several times its active-parameter count on competition mathematics and physics. With test-time scaling enabled, SU-01 scored 35 points on IMO 2025 and 35 points on USAMO 2026 - both gold-medal thresholds - along with gold-level scores on IPhO 2024 and IPhO 2025.

TL;DR

Gold-medal performance on IMO 2025, USAMO 2026, and IPhO 2024/2025 using test-time scaling (TTS)
30B total / 3B active parameters (MoE), Apache 2.0, fully open weights on HuggingFace
Beats similar-size models like Qwen3.6-35B-A3B and Nemotron-Cascade-2 on olympiad-level benchmarks

Overview

The paper's full title is "Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling," and that framing is deliberate. SU-01's training recipe doesn't introduce a radically new architecture - it starts from P1-30B-A3B, itself a fine-tune of a Qwen3-30B-A3B backbone, and applies a disciplined three-stage post-training procedure. The result is a model that supports stable reasoning chains passing 100K tokens while scoring at the top of public olympiad leaderboards for its size class.

SU-01 is trained and open-sourced by the Simplified-Reasoning organization, a collaboration anchored at Shanghai AI Laboratory. The arXiv preprint (2605.13301) was submitted May 13, 2026, and the weights are already live on HuggingFace under Apache 2.0 - meaning you can run, fine-tune, and deploy commercially without restriction.

What makes SU-01 interesting from an engineering standpoint isn't the benchmark numbers alone. It's that the team achieved gold-medal results on four separate olympiad competitions using a 3B-active-parameter model trained on 8 GPUs. The USAMO 2026 score of 35/42 matched the highest total among the 340 human competitors in that competition - a data point the authors are careful to present without overclaiming, since the testing conditions differ from official competition protocols.

Math equations covering a complex blackboard, representing the kind of olympiad-level problems SU-01 solves Complex mathematical derivations of the kind found in IMO and USAMO problems. SU-01 creates full proof-search traces up to 100K+ tokens long. Source: unsplash.com

Key Specifications

Specification	Details
Provider	Shanghai AI Laboratory / Simplified-Reasoning
Model Family	SU (Simple and Unified)
Parameters	30B total, ~3B active (Mixture of Experts)
Backbone	P1-30B-A3B (built on Qwen3-30B-A3B)
Context Window	100K+ tokens for reasoning trajectories
Input Price	Not applicable (open weights)
Output Price	Not applicable (open weights)
Release Date	2026-05-13
License	Apache 2.0
HuggingFace	Simplified-Reasoning/SU-01

Benchmark Performance

SU-01's headline numbers come from olympiad competitions graded against the official gold cutoffs. Direct generation means a single sampling pass; TTS (test-time scaling) uses a produce-verify-revise loop that iterates until a proof-checker approves or a compute budget is exhausted.

Benchmark	SU-01 Direct	SU-01 + TTS	Gold Cutoff
IMO 2025 (pts/42)	21	35	35
USAMO 2026 (pts/42)	15	35	25
IPhO 2024 (pts)	23.5	25.3	20.8
IPhO 2025 (pts)	20.3	21.7	19.7
IMO-ProofBench	57.6%	70.2%	-
AIME 2025	94.6%	-	-
AIME 2026	93.3%	-	-

Direct SU-01 already clears both IPhO gold thresholds without TTS - a result that holds even before the compute budget is expanded. On IMO and USAMO, the direct scores clear bronze but fall short of gold; TTS closes that gap completely.

Comparison against similar-size models on answer-verifiable benchmarks:

Model	AnswerBench	AMO-Bench	AIME 25/26 Avg
SU-01	77.5%	59.8%	93.9%
Qwen3.6-35B-A3B	78.0%	58.8%	92.7%
Nemotron-Cascade-2-30B-A3B	80.5%	40.8%	92.1%
GLM-4.7-Flash	73.8%	53.8%	89.8%
Gemma-4-31B	74.0%	39.3%	90.1%

SU-01 leads on AMO-Bench (competition math) and ties Qwen3.6-35B-A3B on AnswerBench. The proof-oriented IMO-ProofBench number (70.2% with TTS) brings it close to Gemini 3.1 Pro Thinking (72.6%), which is a much larger model. On FrontierScience-Research - OpenAI's cross-domain science bench - SU-01 scores 11.7% overall (Physics: 10%, Chemistry: 10%, Biology: 15%), showing the reasoning transfer is real but limited at research-paper level.

See the math olympiad AI leaderboard for broader rankings across competitions.

Key Capabilities

Three-Stage Post-Training

The training pipeline is the core of SU-01. Stage 1 applies SFT on 338K trajectories capped at 8K tokens, using a reverse-perplexity curriculum: examples are ordered so each epoch starts with the most perplexity-mismatched samples first. This reshapes proof-search behavior without losing the starting model's capabilities.

Stage 2 is coarse RL using GSPO (Group Sequence Policy Optimization) on 8,967 verifiable prompts with outcome rewards - correct answer or not. GSPO was developed by the Qwen team to address the expert-activation volatility that makes standard GRPO unstable on MoE models; roughly 10% of experts activate differently after each gradient update in large MoE training, and GSPO's sequence-level clipping sidesteps that instability.

Stage 3 is proof-level RL: verifiable rewards are replaced by a proof-quality signal from DeepSeekMath-V2, combined with self-refinement and experience replay. The entire RL training runs for about 200 steps total across both stages. The result is a model that generates long, internally consistent proofs rather than just correct final answers.

Researcher examining complex equations on a chalkboard, reflecting the extended proof-search behavior SU-01 was trained to perform SU-01 was trained to emulate extended proof-search and self-checking behavior, creating traces topping 100K tokens for hard olympiad problems. Source: unsplash.com

Test-Time Scaling

SU-01's TTS implementation follows a produce-verify-revise pattern: the model creates a candidate solution, a verifier scores it, and the model revises on failure. This loop is where the IMO and USAMO scores jump from bronze to gold. TTS is compute-intensive - the authors don't disclose exact costs - but the fact that gold-level math reasoning is achievable on a 3B-active-parameter model at inference is remarkable for anyone thinking about deployment economics.

Generalization Beyond Math

The model was trained on mathematics and physics data but transfers to chemistry and biology on FrontierScience-Research, scoring 10% on chemistry and 15% on biology despite no explicit training signal there. These are low absolute numbers, but they suggest the proof-search behavior learned from olympiad data has domain-general structure. The scientific reasoning leaderboard has more context on where frontier models currently sit.

Pricing and Availability

SU-01 is fully open-weight under Apache 2.0. The weights are on HuggingFace at Simplified-Reasoning/SU-01. Training code, SFT scripts, and RL scripts are on GitHub at Simplified-Reasoning/SU-01, including Docker images for reproducing the training environment.

There's no hosted API and no pricing to track. Running SU-01 locally requires hardware capable of serving a 30B BF16 model - roughly 60GB VRAM. For TTS workloads that produce 100K-token traces, expect multi-GPU setups. The model uses the qwen3_moe architecture and loads with standard transformers/vLLM tooling.

Strengths and Weaknesses

Strengths

Gold-medal verified performance on four olympiad competitions, not just one
USAMO 2026 score matched the top human total among 340 competitors
Apache 2.0 license - commercial use and fine-tuning unrestricted
Full training recipe is open: SFT scripts, coarse RL, proof-level RL
GSPO resolves MoE training instability without complex workarounds
Direct (no TTS) IPhO scores already exceed gold cutoff

Weaknesses

100K+ token reasoning traces are expensive to produce; TTS compounds this
FrontierScience-Research scores (11.7%) are low for a research-grade model
No hosted API; self-hosting requires 60GB+ VRAM
IMO/USAMO testing conditions differ from the official competition (open problems, different time limits)
No multimodal capability - text-only, which limits physics problems with diagrams
Proof-level RL signal depends on DeepSeekMath-V2, a dependency that matters for reproducibility

Math Olympiad AI Leaderboard - See where SU-01 and competitors rank across IMO, USAMO, and AIME
Scientific Reasoning LLM Leaderboard - Broader science benchmarks beyond math
Qwen3.6-35B-A3B - The closest same-size competitor on most benchmarks
Reasoning Benchmarks Leaderboard - Full reasoning benchmark coverage

Sources: