SU-01

SU-01 is a 30B-A3B MoE reasoning model from Shanghai AI Lab that achieves gold-medal performance on IMO 2025, USAMO 2026, and IPhO 2024/2025 using a three-stage training recipe and test-time scaling.

SU-01

SU-01 is a compact Mixture-of-Experts reasoning model from Shanghai AI Laboratory and collaborators at Tsinghua, Peking University, CUHK, and SJTU. The 30B-A3B architecture activates only 3B parameters per forward pass, yet matches or beats models several times its active-parameter count on competition mathematics and physics. With test-time scaling enabled, SU-01 scored 35 points on IMO 2025 and 35 points on USAMO 2026 - both gold-medal thresholds - along with gold-level scores on IPhO 2024 and IPhO 2025.

TL;DR

  • Gold-medal performance on IMO 2025, USAMO 2026, and IPhO 2024/2025 using test-time scaling (TTS)
  • 30B total / 3B active parameters (MoE), Apache 2.0, fully open weights on HuggingFace
  • Beats similar-size models like Qwen3.6-35B-A3B and Nemotron-Cascade-2 on olympiad-level benchmarks

Overview

The paper's full title is "Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling," and that framing is deliberate. SU-01's training recipe doesn't introduce a radically new architecture - it starts from P1-30B-A3B, itself a fine-tune of a Qwen3-30B-A3B backbone, and applies a disciplined three-stage post-training procedure. The result is a model that supports stable reasoning chains passing 100K tokens while scoring at the top of public olympiad leaderboards for its size class.

SU-01 is trained and open-sourced by the Simplified-Reasoning organization, a collaboration anchored at Shanghai AI Laboratory. The arXiv preprint (2605.13301) was submitted May 13, 2026, and the weights are already live on HuggingFace under Apache 2.0 - meaning you can run, fine-tune, and deploy commercially without restriction.

What makes SU-01 interesting from an engineering standpoint isn't the benchmark numbers alone. It's that the team achieved gold-medal results on four separate olympiad competitions using a 3B-active-parameter model trained on 8 GPUs. The USAMO 2026 score of 35/42 matched the highest total among the 340 human competitors in that competition - a data point the authors are careful to present without overclaiming, since the testing conditions differ from official competition protocols.

Math equations covering a complex blackboard, representing the kind of olympiad-level problems SU-01 solves Complex mathematical derivations of the kind found in IMO and USAMO problems. SU-01 creates full proof-search traces up to 100K+ tokens long. Source: unsplash.com

Key Specifications

SpecificationDetails
ProviderShanghai AI Laboratory / Simplified-Reasoning
Model FamilySU (Simple and Unified)
Parameters30B total, ~3B active (Mixture of Experts)
BackboneP1-30B-A3B (built on Qwen3-30B-A3B)
Context Window100K+ tokens for reasoning trajectories
Input PriceNot applicable (open weights)
Output PriceNot applicable (open weights)
Release Date2026-05-13
LicenseApache 2.0
HuggingFaceSimplified-Reasoning/SU-01

Benchmark Performance

SU-01's headline numbers come from olympiad competitions graded against the official gold cutoffs. Direct generation means a single sampling pass; TTS (test-time scaling) uses a produce-verify-revise loop that iterates until a proof-checker approves or a compute budget is exhausted.

BenchmarkSU-01 DirectSU-01 + TTSGold Cutoff
IMO 2025 (pts/42)213535
USAMO 2026 (pts/42)153525
IPhO 2024 (pts)23.525.320.8
IPhO 2025 (pts)20.321.719.7
IMO-ProofBench57.6%70.2%-
AIME 202594.6%--
AIME 202693.3%--

Direct SU-01 already clears both IPhO gold thresholds without TTS - a result that holds even before the compute budget is expanded. On IMO and USAMO, the direct scores clear bronze but fall short of gold; TTS closes that gap completely.

Comparison against similar-size models on answer-verifiable benchmarks:

ModelAnswerBenchAMO-BenchAIME 25/26 Avg
SU-0177.5%59.8%93.9%
Qwen3.6-35B-A3B78.0%58.8%92.7%
Nemotron-Cascade-2-30B-A3B80.5%40.8%92.1%
GLM-4.7-Flash73.8%53.8%89.8%
Gemma-4-31B74.0%39.3%90.1%

SU-01 leads on AMO-Bench (competition math) and ties Qwen3.6-35B-A3B on AnswerBench. The proof-oriented IMO-ProofBench number (70.2% with TTS) brings it close to Gemini 3.1 Pro Thinking (72.6%), which is a much larger model. On FrontierScience-Research - OpenAI's cross-domain science bench - SU-01 scores 11.7% overall (Physics: 10%, Chemistry: 10%, Biology: 15%), showing the reasoning transfer is real but limited at research-paper level.

See the math olympiad AI leaderboard for broader rankings across competitions.

Key Capabilities

Three-Stage Post-Training

The training pipeline is the core of SU-01. Stage 1 applies SFT on 338K trajectories capped at 8K tokens, using a reverse-perplexity curriculum: examples are ordered so each epoch starts with the most perplexity-mismatched samples first. This reshapes proof-search behavior without losing the starting model's capabilities.

Stage 2 is coarse RL using GSPO (Group Sequence Policy Optimization) on 8,967 verifiable prompts with outcome rewards - correct answer or not. GSPO was developed by the Qwen team to address the expert-activation volatility that makes standard GRPO unstable on MoE models; roughly 10% of experts activate differently after each gradient update in large MoE training, and GSPO's sequence-level clipping sidesteps that instability.

Stage 3 is proof-level RL: verifiable rewards are replaced by a proof-quality signal from DeepSeekMath-V2, combined with self-refinement and experience replay. The entire RL training runs for about 200 steps total across both stages. The result is a model that generates long, internally consistent proofs rather than just correct final answers.

Researcher examining complex equations on a chalkboard, reflecting the extended proof-search behavior SU-01 was trained to perform SU-01 was trained to emulate extended proof-search and self-checking behavior, creating traces topping 100K tokens for hard olympiad problems. Source: unsplash.com

Test-Time Scaling

SU-01's TTS implementation follows a produce-verify-revise pattern: the model creates a candidate solution, a verifier scores it, and the model revises on failure. This loop is where the IMO and USAMO scores jump from bronze to gold. TTS is compute-intensive - the authors don't disclose exact costs - but the fact that gold-level math reasoning is achievable on a 3B-active-parameter model at inference is remarkable for anyone thinking about deployment economics.

Generalization Beyond Math

The model was trained on mathematics and physics data but transfers to chemistry and biology on FrontierScience-Research, scoring 10% on chemistry and 15% on biology despite no explicit training signal there. These are low absolute numbers, but they suggest the proof-search behavior learned from olympiad data has domain-general structure. The scientific reasoning leaderboard has more context on where frontier models currently sit.

Pricing and Availability

SU-01 is fully open-weight under Apache 2.0. The weights are on HuggingFace at Simplified-Reasoning/SU-01. Training code, SFT scripts, and RL scripts are on GitHub at Simplified-Reasoning/SU-01, including Docker images for reproducing the training environment.

There's no hosted API and no pricing to track. Running SU-01 locally requires hardware capable of serving a 30B BF16 model - roughly 60GB VRAM. For TTS workloads that produce 100K-token traces, expect multi-GPU setups. The model uses the qwen3_moe architecture and loads with standard transformers/vLLM tooling.

Strengths and Weaknesses

Strengths

  • Gold-medal verified performance on four olympiad competitions, not just one
  • USAMO 2026 score matched the top human total among 340 competitors
  • Apache 2.0 license - commercial use and fine-tuning unrestricted
  • Full training recipe is open: SFT scripts, coarse RL, proof-level RL
  • GSPO resolves MoE training instability without complex workarounds
  • Direct (no TTS) IPhO scores already exceed gold cutoff

Weaknesses

  • 100K+ token reasoning traces are expensive to produce; TTS compounds this
  • FrontierScience-Research scores (11.7%) are low for a research-grade model
  • No hosted API; self-hosting requires 60GB+ VRAM
  • IMO/USAMO testing conditions differ from the official competition (open problems, different time limits)
  • No multimodal capability - text-only, which limits physics problems with diagrams
  • Proof-level RL signal depends on DeepSeekMath-V2, a dependency that matters for reproducibility

Sources:

✓ Last verified May 14, 2026

James Kowalski
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.