SU-01
SU-01 is a 30B-A3B MoE reasoning model from Shanghai AI Lab that achieves gold-medal performance on IMO 2025, USAMO 2026, and IPhO 2024/2025 using a three-stage training recipe and test-time scaling.

SU-01 is a compact Mixture-of-Experts reasoning model from Shanghai AI Laboratory and collaborators at Tsinghua, Peking University, CUHK, and SJTU. The 30B-A3B architecture activates only 3B parameters per forward pass, yet matches or beats models several times its active-parameter count on competition mathematics and physics. With test-time scaling enabled, SU-01 scored 35 points on IMO 2025 and 35 points on USAMO 2026 - both gold-medal thresholds - along with gold-level scores on IPhO 2024 and IPhO 2025.
TL;DR
- Gold-medal performance on IMO 2025, USAMO 2026, and IPhO 2024/2025 using test-time scaling (TTS)
- 30B total / 3B active parameters (MoE), Apache 2.0, fully open weights on HuggingFace
- Beats similar-size models like Qwen3.6-35B-A3B and Nemotron-Cascade-2 on olympiad-level benchmarks
Overview
The paper's full title is "Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling," and that framing is deliberate. SU-01's training recipe doesn't introduce a radically new architecture - it starts from P1-30B-A3B, itself a fine-tune of a Qwen3-30B-A3B backbone, and applies a disciplined three-stage post-training procedure. The result is a model that supports stable reasoning chains passing 100K tokens while scoring at the top of public olympiad leaderboards for its size class.
SU-01 is trained and open-sourced by the Simplified-Reasoning organization, a collaboration anchored at Shanghai AI Laboratory. The arXiv preprint (2605.13301) was submitted May 13, 2026, and the weights are already live on HuggingFace under Apache 2.0 - meaning you can run, fine-tune, and deploy commercially without restriction.
What makes SU-01 interesting from an engineering standpoint isn't the benchmark numbers alone. It's that the team achieved gold-medal results on four separate olympiad competitions using a 3B-active-parameter model trained on 8 GPUs. The USAMO 2026 score of 35/42 matched the highest total among the 340 human competitors in that competition - a data point the authors are careful to present without overclaiming, since the testing conditions differ from official competition protocols.
Complex mathematical derivations of the kind found in IMO and USAMO problems. SU-01 creates full proof-search traces up to 100K+ tokens long.
Source: unsplash.com
Key Specifications
| Specification | Details |
|---|---|
| Provider | Shanghai AI Laboratory / Simplified-Reasoning |
| Model Family | SU (Simple and Unified) |
| Parameters | 30B total, ~3B active (Mixture of Experts) |
| Backbone | P1-30B-A3B (built on Qwen3-30B-A3B) |
| Context Window | 100K+ tokens for reasoning trajectories |
| Input Price | Not applicable (open weights) |
| Output Price | Not applicable (open weights) |
| Release Date | 2026-05-13 |
| License | Apache 2.0 |
| HuggingFace | Simplified-Reasoning/SU-01 |
Benchmark Performance
SU-01's headline numbers come from olympiad competitions graded against the official gold cutoffs. Direct generation means a single sampling pass; TTS (test-time scaling) uses a produce-verify-revise loop that iterates until a proof-checker approves or a compute budget is exhausted.
| Benchmark | SU-01 Direct | SU-01 + TTS | Gold Cutoff |
|---|---|---|---|
| IMO 2025 (pts/42) | 21 | 35 | 35 |
| USAMO 2026 (pts/42) | 15 | 35 | 25 |
| IPhO 2024 (pts) | 23.5 | 25.3 | 20.8 |
| IPhO 2025 (pts) | 20.3 | 21.7 | 19.7 |
| IMO-ProofBench | 57.6% | 70.2% | - |
| AIME 2025 | 94.6% | - | - |
| AIME 2026 | 93.3% | - | - |
Direct SU-01 already clears both IPhO gold thresholds without TTS - a result that holds even before the compute budget is expanded. On IMO and USAMO, the direct scores clear bronze but fall short of gold; TTS closes that gap completely.
Comparison against similar-size models on answer-verifiable benchmarks:
| Model | AnswerBench | AMO-Bench | AIME 25/26 Avg |
|---|---|---|---|
| SU-01 | 77.5% | 59.8% | 93.9% |
| Qwen3.6-35B-A3B | 78.0% | 58.8% | 92.7% |
| Nemotron-Cascade-2-30B-A3B | 80.5% | 40.8% | 92.1% |
| GLM-4.7-Flash | 73.8% | 53.8% | 89.8% |
| Gemma-4-31B | 74.0% | 39.3% | 90.1% |
SU-01 leads on AMO-Bench (competition math) and ties Qwen3.6-35B-A3B on AnswerBench. The proof-oriented IMO-ProofBench number (70.2% with TTS) brings it close to Gemini 3.1 Pro Thinking (72.6%), which is a much larger model. On FrontierScience-Research - OpenAI's cross-domain science bench - SU-01 scores 11.7% overall (Physics: 10%, Chemistry: 10%, Biology: 15%), showing the reasoning transfer is real but limited at research-paper level.
See the math olympiad AI leaderboard for broader rankings across competitions.
Key Capabilities
Three-Stage Post-Training
The training pipeline is the core of SU-01. Stage 1 applies SFT on 338K trajectories capped at 8K tokens, using a reverse-perplexity curriculum: examples are ordered so each epoch starts with the most perplexity-mismatched samples first. This reshapes proof-search behavior without losing the starting model's capabilities.
Stage 2 is coarse RL using GSPO (Group Sequence Policy Optimization) on 8,967 verifiable prompts with outcome rewards - correct answer or not. GSPO was developed by the Qwen team to address the expert-activation volatility that makes standard GRPO unstable on MoE models; roughly 10% of experts activate differently after each gradient update in large MoE training, and GSPO's sequence-level clipping sidesteps that instability.
Stage 3 is proof-level RL: verifiable rewards are replaced by a proof-quality signal from DeepSeekMath-V2, combined with self-refinement and experience replay. The entire RL training runs for about 200 steps total across both stages. The result is a model that generates long, internally consistent proofs rather than just correct final answers.
SU-01 was trained to emulate extended proof-search and self-checking behavior, creating traces topping 100K tokens for hard olympiad problems.
Source: unsplash.com
Test-Time Scaling
SU-01's TTS implementation follows a produce-verify-revise pattern: the model creates a candidate solution, a verifier scores it, and the model revises on failure. This loop is where the IMO and USAMO scores jump from bronze to gold. TTS is compute-intensive - the authors don't disclose exact costs - but the fact that gold-level math reasoning is achievable on a 3B-active-parameter model at inference is remarkable for anyone thinking about deployment economics.
Generalization Beyond Math
The model was trained on mathematics and physics data but transfers to chemistry and biology on FrontierScience-Research, scoring 10% on chemistry and 15% on biology despite no explicit training signal there. These are low absolute numbers, but they suggest the proof-search behavior learned from olympiad data has domain-general structure. The scientific reasoning leaderboard has more context on where frontier models currently sit.
Pricing and Availability
SU-01 is fully open-weight under Apache 2.0. The weights are on HuggingFace at Simplified-Reasoning/SU-01. Training code, SFT scripts, and RL scripts are on GitHub at Simplified-Reasoning/SU-01, including Docker images for reproducing the training environment.
There's no hosted API and no pricing to track. Running SU-01 locally requires hardware capable of serving a 30B BF16 model - roughly 60GB VRAM. For TTS workloads that produce 100K-token traces, expect multi-GPU setups. The model uses the qwen3_moe architecture and loads with standard transformers/vLLM tooling.
Strengths and Weaknesses
Strengths
- Gold-medal verified performance on four olympiad competitions, not just one
- USAMO 2026 score matched the top human total among 340 competitors
- Apache 2.0 license - commercial use and fine-tuning unrestricted
- Full training recipe is open: SFT scripts, coarse RL, proof-level RL
- GSPO resolves MoE training instability without complex workarounds
- Direct (no TTS) IPhO scores already exceed gold cutoff
Weaknesses
- 100K+ token reasoning traces are expensive to produce; TTS compounds this
- FrontierScience-Research scores (11.7%) are low for a research-grade model
- No hosted API; self-hosting requires 60GB+ VRAM
- IMO/USAMO testing conditions differ from the official competition (open problems, different time limits)
- No multimodal capability - text-only, which limits physics problems with diagrams
- Proof-level RL signal depends on DeepSeekMath-V2, a dependency that matters for reproducibility
Related Coverage
- Math Olympiad AI Leaderboard - See where SU-01 and competitors rank across IMO, USAMO, and AIME
- Scientific Reasoning LLM Leaderboard - Broader science benchmarks beyond math
- Qwen3.6-35B-A3B - The closest same-size competitor on most benchmarks
- Reasoning Benchmarks Leaderboard - Full reasoning benchmark coverage
Sources:
- Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling (arXiv 2605.13301)
- SU-01 HTML paper
- Simplified-Reasoning/SU-01 on HuggingFace
- SU-01 Project Page
- SU-01 GitHub Repository
- GSPO: Group Sequence Policy Optimization (arXiv 2507.18071)
- P1: Mastering Physics Olympiads with Reinforcement Learning (arXiv 2511.13612)
✓ Last verified May 14, 2026
