Nemotron-Cascade 2: 30B Open MoE, One GPU, Beats 120B

NVIDIA's new Nemotron-Cascade-2-30B-A3B activates just 3B parameters per token, runs on a single RTX 4090, and outscores NVIDIA's own 120B model on coding and math benchmarks.

Nemotron-Cascade 2: 30B Open MoE, One GPU, Beats 120B

NVIDIA dropped Nemotron-Cascade-2-30B-A3B on March 20, and the efficiency numbers are worth paying attention to. The model has 30B total parameters but activates only 3B per token at inference - the same active compute as a mid-size dense model. It beats NVIDIA's own Nemotron-3-Super 120B on coding and instruction-following benchmarks while fitting into 24GB of VRAM with Q4 quantization. That's a single RTX 4090.

Key Specs

SpecValue
Total parameters30B
Active parameters / token3B
ArchitectureHybrid Mamba-Transformer MoE
Context window1M tokens
LicenseNVIDIA Open Model License
OllamaYes (24GB Q4_K_M variant)
HuggingFacenvidia/Nemotron-Cascade-2-30B-A3B
AIME 202592.4 (98.6 with TIR)
LiveCodeBench v687.2

A Different Kind of MoE

Most open Mixture-of-Experts models trade inference cost for raw benchmark scores - you run a 671B model and pay for 37B of active compute per token. Nemotron-Cascade-2 goes further. The 30B total parameter footprint is itself modest by MoE standards, and the 3B active compute budget means it sits in the same inference bracket as Qwen3.5-35B-A3B, another 3B-active model that targets the same efficiency sweet spot.

The base architecture comes from the Nemotron 3 Nano 30B-A3B family - a hybrid design that mixes Mamba-2 state-space layers with standard Transformer attention layers. Mamba layers handle long-context token processing more efficiently than full attention, which helps explain how the 1M token context window is actually usable on consumer hardware rather than just listed in a spec sheet.

How Cascade RL Works

The model was trained using what NVIDIA calls Cascade RL - sequential reinforcement learning that moves through domains in stages. Instead of training on everything simultaneously and hoping the gradient updates balance out, Cascade RL trains one domain at a time, letting each stage build on the last. NVIDIA paired this with Multi-Domain On-Policy Distillation (MOPD), which pulls supervision from the strongest available teacher model in each domain during RL training. The dataset underneath is sizable: 1.9M Python reasoning traces, 1.3M Python tool-calling samples, 816K math proofs, and roughly 500K software engineering examples split between agentic and agentless approaches. Both the SFT and RL datasets are public on HuggingFace.

The model ships with a thinking mode that wraps chain-of-thought in <think> tags and an instruct mode for direct responses. Both run from the same set of weights.

Cascade RL training overview diagram showing the sequential reinforcement learning pipeline The Cascade RL training pipeline applies sequential domain-by-domain reinforcement learning. Each stage uses the strongest available teacher model for that domain. Source: research.nvidia.com

Benchmark Numbers

The comparison that stands out most is against NVIDIA's own 120B model. Nemotron-Cascade-2 beats Nemotron-3-Super on LiveCodeBench and instruction-following while using 4x fewer active parameters. The Qwen3.5-35B-A3B comparison is arguably more meaningful for most developers since the two models have identical active compute budgets - and the gap on coding and math is wide.

BenchmarkCascade-2 30B-A3BQwen3.5-35B-A3BDifference
AIME 202592.491.9+0.5
HMMT Feb 202594.689.0+5.6
LiveCodeBench v687.274.6+12.6
IOI 2025439.3348.6+90.7
ArenaHard v283.565.4+18.1
GPQA-Diamond76.1--
MMLU-Pro79.8--
IFBench82.970.2+12.7

The IOI and AIME scores stand out. NVIDIA claims Cascade-2 reaches gold medal-level performance at IMO 2025 (35 points), IOI 2025 (439.3 points), and ICPC World Finals 2025 (10/12 problems). The model card describes it as the second open-weight model to hit gold medal thresholds across all three competitions. The first is almost certainly one of the large DeepSeek variants, though NVIDIA doesn't name it. The AIME evaluation used avg@64 with a 131K token thinking budget, which is a generous setup - single-shot performance will be lower.

Tool-Integrated Reasoning (TIR) pushes the AIME score to 98.6 and LiveCodeBench to 88.4, which requires giving the model access to a Python interpreter during reasoning. Those numbers are impressive but not directly comparable to evaluations that don't use tool access.

Nemotron-Cascade-2 benchmark performance chart comparing against leading open and closed models Benchmark comparison from the official model card. Cascade-2 holds its own across math, coding, and instruction-following against both open and proprietary models. Source: huggingface.co

Reasoning vs. Instruct Mode

The thinking mode adds latency but moves AIME from whatever non-reasoning baseline exists up to 92.4. For time-critical workloads - chat, classification, document processing - the instruct mode runs faster and skips the chain-of-thought overhead completely. Both are the same model; it's a parameter at inference time, not a separate download.

Running It Yourself

Hardware Requirements

The Q4_K_M GGUF variant weighs 24.5GB and fits a single RTX 4090. Full BF16 precision requires roughly 63GB VRAM - two H100 80GB cards or better. The quantized path handles most dev and research workloads fine; you'd only need the full-precision weights if you're running evals that are sensitive to quantization noise or building something for production serving.

FormatSizeGPU requirement
Q4_K_M GGUF24.5 GBSingle RTX 4090 or equivalent
Q5_K_M GGUF26.1 GBSingle H100 40GB+
Q8_0 GGUF33.6 GBSingle H100 80GB
BF16 (full)63.2 GB2x H100 80GB or DGX

Deployment Options

Ollama is the fastest path for local use: ollama run nemotron-cascade-2. The default Ollama variant uses Q4_K_M at 256K context. For full 1M context or custom serving, use GGUF weights directly in llama.cpp or LM Studio, or the official vLLM Docker image from NVIDIA for production deployments:

docker pull nvcr.io/nvidia/nemotron-cascade-2:latest

The model uses ChatML chat template format. Function calling, structured JSON output, and Fill-in-the-Middle are all supported. NVIDIA NIM containers are available on build.nvidia.com if you'd rather not handle serving infrastructure yourself.

NVIDIA H100 GPU hardware - the target data center GPU for production deployments of Nemotron-Cascade-2 The H100 is the minimum recommended GPU for full-precision Nemotron-Cascade-2 serving. The Q4 GGUF variant runs on consumer RTX 4090 hardware. Source: commons.wikimedia.org

What To Watch

The license isn't Apache 2.0. The NVIDIA Open Model License is permissive - it allows commercial use, modification, and redistribution - but it's not the same as Apache 2.0. Organizations with strict open-source policies should review the exact terms before launching. The training datasets, by contrast, are under a separate permissive license and are fully public on HuggingFace.

The benchmark setup matters. The AIME avg@64 methodology gives the model 64 attempts per problem and takes the best result. That's useful for understanding ceiling performance but not representative of single-query production behavior. The LiveCodeBench numbers are more directly useful since they measure pass@1.

NVIDIA's self-interest is worth keeping in mind. Nemotron-Cascade-2 runs best on NVIDIA hardware. A model that fits a single RTX 4090 in Q4 but needs two H100s in full precision creates a clear path to NVIDIA NIM contracts for teams that outgrow local inference. The open-weight release is truly useful, and the efficiency engineering is real - but NVIDIA is also building its own ecosystem here.

The training data and RL code are public, which is more than most vendors release. If the efficiency gains from Cascade RL generalize to other architectures, that's a recipe worth studying regardless of where you run it.


Sources:

Nemotron-Cascade 2: 30B Open MoE, One GPU, Beats 120B
About the author AI Infrastructure & Open Source Reporter

Sophie is a journalist and former systems engineer who covers AI infrastructure, open-source models, and the developer tooling ecosystem.