NVIDIA Nemotron 3 Super 120B-A12B
NVIDIA Nemotron 3 Super is a 120B-parameter open model with 12B active at inference, combining Mamba-2, LatentMoE, and Multi-Token Prediction for agentic workloads with a 1M token context window.

Nemotron 3 Super is NVIDIA's open model play for agentic AI. The pitch: 120 billion parameters of learned capacity, 12 billion active at inference, running at throughput levels that make multi-agent pipelines economically viable. The architecture is genuinely novel - a Mamba-2 and Transformer hybrid with latent Mixture-of-Experts routing and Multi-Token Prediction - and NVIDIA released not just the weights but the full training dataset (10T+ tokens) and all 15 RL environments. That level of openness from a company this size is unusual.
TL;DR
- 120B total / 12B active MoE with Mamba-2 + Transformer hybrid architecture
- 1M token context (91.75% RULER@1M), competitive with Qwen3.5-122B across benchmarks
- 5x throughput over previous Nemotron Super, NVFP4 quantization-aware training from day one
- Full open release: BF16/FP8/NVFP4 weights, 153 datasets, 15 RL environments on Hugging Face
- Self-host on 8x H100-80GB minimum; available via 15+ cloud providers
The benchmark picture is mixed in the way that matters. It beats GPT-OSS-120B on most tasks - often by wide margins - but trades wins and losses with Qwen3.5-122B-A10B. Where it genuinely leads: long-context retrieval (RULER@1M: 91.75%), math competition (HMMT Feb 2025: 93.67%), and code generation (LiveCodeBench: 81.19%). Where it falls behind: general knowledge (MMLU-Pro: 83.73 vs Qwen's 86.70), science reasoning (GPQA: 79.23 vs 86.60), and agentic coding (SWE-Bench: 60.47 vs 66.40).
The efficiency angle is the real differentiator. Running at 12B active parameters with Mamba-2 layers handling sequence processing means this model can serve agentic workloads - dozens of tool calls across long contexts - at a fraction of the cost of a dense 120B model. NVIDIA claims 5x throughput over the previous Nemotron Super and 4x faster inference on Blackwell NVFP4 vs Hopper FP8.
Key Specifications
| Specification | Details |
|---|---|
| Provider | NVIDIA |
| Model Family | Nemotron 3 |
| Parameters | 120B total, 12B active (LatentMoE) |
| Architecture | Mamba-2 + Transformer Hybrid LatentMoE with MTP |
| Context Window | 1,000,000 tokens (256K default config) |
| Training Data | 15.6T tokens, 153 datasets, 20 languages, 43 programming languages |
| Pre-training Cutoff | June 2025 |
| Post-training Cutoff | February 2026 |
| Quantization | NVFP4 (native), FP8, BF16 |
| Min Hardware | 8x H100-80GB |
| License | NVIDIA Nemotron Open Model License |
| Pricing | Free trial on build.nvidia.com; self-host or third-party providers |
| Release Date | March 11, 2026 |
The 1M context window defaults to 256K in the shipped configuration due to VRAM constraints. Extending to the full million requires setting VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 and adjusting --max-model-len. Practical long-context use at 1M needs significant GPU memory beyond the 8x H100 minimum.
Benchmark Performance
All scores from NVIDIA's model card. Comparison models are Qwen3.5-122B-A10B and GPT-OSS-120B - the two closest open models by parameter class.
| Benchmark | Nemotron 3 Super | Qwen3.5-122B-A10B | GPT-OSS-120B |
|---|---|---|---|
| MMLU-Pro | 83.73 | 86.70 | 81.00 |
| AIME 2025 (no tools) | 90.21 | 90.36 | 92.50 |
| HMMT Feb 2025 (no tools) | 93.67 | 91.40 | 90.00 |
| HMMT Feb 2025 (with tools) | 94.73 | 89.55 | - |
| GPQA (no tools) | 79.23 | 86.60 | 80.10 |
| GPQA (with tools) | 82.70 | - | 80.09 |
| LiveCodeBench | 81.19 | 78.93 | 88.00 |
| SWE-Bench (OpenHands) | 60.47 | 66.40 | 41.90 |
| SWE-Bench Multilingual | 45.78 | - | 30.80 |
| RULER @ 256K | 96.30 | 96.74 | 52.30 |
| RULER @ 512K | 95.67 | 95.95 | 46.70 |
| RULER @ 1M | 91.75 | 91.33 | 22.30 |
| IFBench (prompt) | 72.56 | 73.77 | 68.32 |
| Arena-Hard-V2 | 73.88 | 75.15 | 90.26 |
| HLE (no tools) | 18.26 | 25.30 | 14.90 |
| HLE (with tools) | 22.82 | - | 19.00 |
| TauBench V2 (avg) | 61.15 | 74.53 | 61.00 |
| MMLU-ProX (multilingual) | 79.36 | 85.06 | 76.59 |
The RULER long-context scores tell the clearest story. At 1M tokens, Nemotron 3 Super holds 91.75% - virtually tied with Qwen3.5 at 91.33% - while GPT-OSS-120B collapses to 22.30%. The Mamba-2 layers avoid the quadratic attention scaling that kills pure Transformer performance at extreme lengths.
On agentic benchmarks, the results are more nuanced. SWE-Bench OpenHands at 60.47% is solid but 6 points behind Qwen3.5. TauBench V2 averaging 61.15% matches GPT-OSS but trails Qwen3.5's 74.53% by a significant margin. The model excels at tool-augmented reasoning (HMMT improves from 93.67 to 94.73 with tools, GPQA from 79.23 to 82.70) which suggests it's better as a tool-calling agent than a standalone coder.
Arena-Hard-V2 at 73.88% is a weakness - GPT-OSS-120B scores 90.26% on the same benchmark. Open-ended conversational quality isn't this model's strength.
Key Capabilities
Architecture Innovation
Three design choices define this model. Mamba-2 layers handle sequential token processing at linear cost instead of quadratic attention. LatentMoE projects tokens into a compressed latent space before routing to experts - activating four specialists for the cost of one at inference. Multi-Token Prediction uses shared-weight heads to predict multiple future tokens simultaneously, enabling native speculative decoding without a separate draft model.
The NVFP4 quantization-aware training is notable because it was applied during pre-training, not bolted on afterward. This means the model learned to be accurate at 4-bit precision rather than losing accuracy from post-training compression. On Blackwell GPUs, NVFP4 runs 4x faster than FP8 on Hopper.
Long-Context Performance
The 1M token context window isn't just a marketing number - the RULER benchmark scores back it up. 91.75% at 1M tokens, 95.67% at 512K, 96.30% at 256K. Degradation from 256K to 1M is under 5 percentage points. For RAG pipelines, multi-document analysis, and codebase-scale agentic tasks, this is directly useful.
Open Training Artifacts
NVIDIA published the full recipe: 10+ trillion tokens of pre-training data, all post-training datasets in the Nemotron-Post-Training-v3 collection, and all 15 RL training environments. The three-stage pipeline (pre-training with Megatron-LM, SFT with Data Designer, RL with NeMo RL/NeMo Gym) is documented in the technical report. Synthetic data was generated from frontier models including GPT-OSS-120B, DeepSeek-V3, and Qwen3-235B.
Tool Use and Agentic Workflows
The model supports native tool calling with structured outputs, configurable reasoning (on/off/low-effort via chat template), and multi-turn conversations. NVIDIA's own AI-Q research agent, built on Nemotron 3 Super, topped the DeepResearch Bench and DeepResearch Bench II leaderboards.
Pricing and Availability
No standardized third-party pricing exists yet - the model launched today. Current options:
| Channel | Cost | Context | Notes |
|---|---|---|---|
| build.nvidia.com | Free (trial) | 262K | Prompts logged, not for production |
| OpenRouter (NVIDIA) | Free (trial) | 262K | Prompts logged |
| Self-hosted (8x H100) | Hardware cost | Up to 1M | BF16, FP8, or NVFP4 checkpoints |
| NIM microservice | License required | Up to 1M | Enterprise deployment |
Cloud providers offering deployment: Google Cloud Vertex AI, Oracle Cloud, AWS Bedrock (forthcoming), Microsoft Azure, CoreWeave, Crusoe, Nebius, Together AI, Baseten, Cloudflare, DeepInfra, Fireworks AI, Lightning AI, Modal, FriendliAI.
For cost comparison context: self-hosting a 12B-active MoE on 8x H100s is dramatically cheaper per token than serving a dense 120B model. The efficiency advantage is the pricing story here - once providers publish rates, expect this to undercut similarly-sized dense models significantly.
Strengths
- Exceptional long-context: 91.75% RULER@1M, one of the best scores at any parameter class
- 12B active inference cost: MoE efficiency means frontier-class capability at a fraction of the compute
- Full open release: Weights, 10T+ tokens of training data, 15 RL environments - not just "open weights"
- Architecture innovation: Mamba-2 + LatentMoE + MTP is a genuine technical advance, not a scaled-up Transformer
- Native NVFP4: Quantization-aware training from pre-training means minimal accuracy loss at 4-bit
- Broad deployment: 15+ cloud and inference providers at launch
- Strong math reasoning: 93.67% HMMT, 90.21% AIME 2025, improves further with tool use
Weaknesses
- Trails Qwen3.5 on key benchmarks: MMLU-Pro, GPQA, SWE-Bench, TauBench V2 - Qwen3.5-122B wins on breadth
- Weak conversational quality: Arena-Hard-V2 at 73.88% is well below GPT-OSS-120B's 90.26%
- HLE score is low: 18.26% (22.82% with tools) trails Qwen3.5's 25.30% on the hardest reasoning benchmark
- 8x H100 minimum: Self-hosting barrier is high for smaller teams
- 1M context needs explicit config: Default ships at 256K, full context needs manual override and more GPUs
- NVIDIA-specific license: Not Apache 2.0 - patent termination clause if you litigate against NVIDIA
- No third-party pricing yet: Hard to evaluate cost-effectiveness until providers publish rates
- Limited multilingual: 7 languages supported vs broader coverage from Qwen
Related Coverage
- News: NVIDIA Ships Nemotron 3 Super
- Leaderboards: Reasoning Benchmarks | Coding Benchmarks | Cost Efficiency
- Compare: Qwen 3.5 27B Claude Opus Distilled | Grok 4
Sources:
✓ Last verified March 11, 2026
