NVIDIA Ships Nemotron 3 Super - 120B Open Model for Agents

NVIDIA released Nemotron 3 Super today, a 120-billion parameter open model that activates only 12 billion parameters at inference time. The model uses a hybrid architecture combining Mamba-2 state-space layers with Transformer attention layers and a latent Mixture-of-Experts (LatentMoE) routing system. Weights, training data, and fine-tuning recipes are all published on Hugging Face under the NVIDIA Nemotron Open Model License.

TL;DR

120B total / 12B active parameters using Mamba-2 + Transformer hybrid with LatentMoE and Multi-Token Prediction
1M token context window, 91.75% on RULER@1M vs 22.30% for GPT-OSS-120B
5x throughput over previous Nemotron Super, 4x faster on Blackwell NVFP4 vs Hopper FP8
Fully open: weights, 10T+ tokens of datasets, 15 RL environments on Hugging Face
Live on build.nvidia.com, DeepInfra, OpenRouter, Perplexity, and 15+ cloud providers

Architecture

The hybrid design interleaves three layer types: Mamba-2 layers handle sequential token processing with 4x better memory and compute efficiency than pure attention, MoE layers activate four expert specialists per token through a latent projection that compresses tokens before routing, and select Transformer attention layers provide the global reasoning capability the Mamba layers lack.

Multi-Token Prediction (MTP) adds shared-weight prediction heads that generate multiple future tokens simultaneously. NVIDIA claims this enables native speculative decoding at 3x faster inference without needing a separate draft model. The combination means the model runs at the computational cost of a 12B-parameter dense model while drawing on 120B parameters of learned capacity.

Training used NVFP4 quantization-aware training from the start - not post-training quantization - across the majority of linear layers, with BF16 and MXFP8 preserved for latent projections, attention, and embedding layers. On Blackwell GPUs, the NVFP4 precision runs 4x faster than FP8 on Hopper.

Benchmarks

NVIDIA benchmarks Nemotron 3 Super against two comparably-sized open models: Qwen3.5-122B-A10B and GPT-OSS-120B.

Benchmark	Nemotron 3 Super	Qwen3.5-122B-A10B	GPT-OSS-120B
MMLU-Pro	83.73	86.70	81.00
AIME 2025	90.21	90.36	92.50
GPQA (no tools)	79.23	86.60	80.10
LiveCodeBench	81.19	78.93	88.00
SWE-Bench (OpenHands)	60.47	66.40	41.90
RULER @ 1M	91.75	91.33	22.30
RULER @ 256K	96.30	96.74	52.30
HMMT Feb 2025	93.67	91.40	90.00
IFBench (prompt)	72.56	73.77	68.32
Arena-Hard-V2	73.88	75.15	90.26

The long-context numbers are the standout. At 1 million tokens, Nemotron 3 Super scores 91.75% on RULER, matching Qwen3.5 and crushing GPT-OSS-120B's 22.30%. The Mamba layers are doing the heavy lifting here - they process long sequences without the quadratic attention cost that cripples pure Transformer models at extreme context lengths.

On general benchmarks, it trades blows. Qwen3.5-122B leads on MMLU-Pro (86.70 vs 83.73), GPQA (86.60 vs 79.23), and SWE-Bench (66.40 vs 60.47). Nemotron 3 Super takes math reasoning on HMMT (93.67 vs 91.40) and edges ahead on LiveCodeBench (81.19 vs 78.93). Against GPT-OSS-120B, Nemotron wins most categories except Arena-Hard-V2 where GPT-OSS holds a strong lead (90.26 vs 73.88).

Training Pipeline

NVIDIA published the full training recipe - a rarity at this scale:

Pre-training: 25+ trillion tokens of crawled and synthetic data covering code, math, science, and general knowledge. Built with Megatron-LM using NVFP4 quantization-aware training from the start.
Supervised fine-tuning: Synthetic datasets for code, math, tool calling, instruction following, and structured outputs, with emphasis on long-range retrieval and multi-document aggregation. Generated using NVIDIA's Data Designer library.
Reinforcement learning: Asynchronous Group Relative Policy Optimization (GRPO) across 15 environments covering math, code, science, tool use, multi-turn conversations, and structured output. Training and inference run on fully decoupled GPU devices using NeMo RL and NeMo Gym.

The total training corpus spans 15.6 trillion tokens across 153 datasets, 20 languages, and 43 programming languages. Data was collected between 2013 and February 2026. NVIDIA released over 10 trillion tokens of the pre-training datasets and all 15 RL environments on Hugging Face.

Availability

Nemotron 3 Super is available in BF16, FP8, and NVFP4 checkpoints on Hugging Face. The default context configuration is 256K tokens due to VRAM constraints, with 1M accessible via explicit flag.

The model requires a minimum of 8x H100-80GB GPUs. It supports NVIDIA Ampere (A100), Hopper (H100, H200), and Blackwell (GB200) hardware, with inference via vLLM, SGLang, or TensorRT-LLM.

Cloud and API availability is broad: NVIDIA's build.nvidia.com offers a free trial, with paid deployment through Google Cloud Vertex AI, Oracle Cloud, AWS Bedrock (forthcoming), Microsoft Azure, CoreWeave, Crusoe, Nebius, and Together AI. Inference providers DeepInfra, Fireworks AI, Baseten, Cloudflare, Lightning AI, Modal, and FriendliAI all list the model. OpenRouter hosts a free NVIDIA-backed endpoint.

Partners already integrating the model include Perplexity for search, CodeRabbit and Factory for code review, and enterprise customers Palantir, Cadence, Dassault Systemes, and Siemens.

What Matters

The real story is the architecture bet. By combining Mamba-2 for efficient sequence processing, LatentMoE for sparse computation, and MTP for speculative decoding, NVIDIA built a model that runs at 12B active-parameter cost while competing with dense models 10x its inference weight. For agentic workloads - where models make dozens of tool calls in sequence across long contexts - that efficiency gap compounds fast.

The open release of datasets and RL environments, not just weights, is also significant. Most open-weight releases are "open" in the way a restaurant kitchen is "open" - you can see in, but you can't reproduce the meal. NVIDIA published the actual ingredients and recipe.

Under the NVIDIA Nemotron Open Model License, the model is commercially usable with attribution requirements. It's not Apache 2.0 - the license includes a patent termination clause if you litigate against NVIDIA - but it's permissive enough for most production use cases.

No third-party pricing is available yet. The model launched today and providers are still onboarding it. Self-hosting on 8x H100s or equivalent is the immediate path for production workloads.

Sources:

Architecture

Benchmarks

Training Pipeline

Availability

What Matters

Google Analytics