Name: NVIDIA Nemotron 3 Super 120B-A12B
Author: NVIDIA

Nemotron 3 Super is NVIDIA's open model play for agentic AI. The pitch: 120 billion parameters of learned capacity, 12 billion active at inference, running at throughput levels that make multi-agent pipelines economically viable. The architecture is genuinely novel - a Mamba-2 and Transformer hybrid with latent Mixture-of-Experts routing and Multi-Token Prediction - and NVIDIA released not just the weights but the full training dataset (10T+ tokens) and all 15 RL environments. That level of openness from a company this size is unusual.

TL;DR

120B total / 12B active MoE with Mamba-2 + Transformer hybrid architecture
1M token context (91.75% RULER@1M), competitive with Qwen3.5-122B across benchmarks
5x throughput over previous Nemotron Super, NVFP4 quantization-aware training from day one
Full open release: BF16/FP8/NVFP4 weights, 153 datasets, 15 RL environments on Hugging Face
Self-host on 8x H100-80GB minimum; available via 15+ cloud providers

The benchmark picture is mixed in the way that matters. It beats GPT-OSS-120B on most tasks - often by wide margins - but trades wins and losses with Qwen3.5-122B-A10B. Where it genuinely leads: long-context retrieval (RULER@1M: 91.75%), math competition (HMMT Feb 2025: 93.67%), and code generation (LiveCodeBench: 81.19%). Where it falls behind: general knowledge (MMLU-Pro: 83.73 vs Qwen's 86.70), science reasoning (GPQA: 79.23 vs 86.60), and agentic coding (SWE-Bench: 60.47 vs 66.40).

The efficiency angle is the real differentiator. Running at 12B active parameters with Mamba-2 layers handling sequence processing means this model can serve agentic workloads - dozens of tool calls across long contexts - at a fraction of the cost of a dense 120B model. NVIDIA claims 5x throughput over the previous Nemotron Super and 4x faster inference on Blackwell NVFP4 vs Hopper FP8.

Key Specifications

Specification	Details
Provider	NVIDIA
Model Family	Nemotron 3
Parameters	120B total, 12B active (LatentMoE)
Architecture	Mamba-2 + Transformer Hybrid LatentMoE with MTP
Context Window	1,000,000 tokens (256K default config)
Training Data	15.6T tokens, 153 datasets, 20 languages, 43 programming languages
Pre-training Cutoff	June 2025
Post-training Cutoff	February 2026
Quantization	NVFP4 (native), FP8, BF16
Min Hardware	8x H100-80GB
License	NVIDIA Nemotron Open Model License
Pricing	Free trial on build.nvidia.com; self-host or third-party providers
Release Date	March 11, 2026

The 1M context window defaults to 256K in the shipped configuration due to VRAM constraints. Extending to the full million requires setting VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 and adjusting --max-model-len. Practical long-context use at 1M needs significant GPU memory beyond the 8x H100 minimum.

Benchmark Performance

All scores from NVIDIA's model card. Comparison models are Qwen3.5-122B-A10B and GPT-OSS-120B - the two closest open models by parameter class.

Benchmark	Nemotron 3 Super	Qwen3.5-122B-A10B	GPT-OSS-120B
MMLU-Pro	83.73	86.70	81.00
AIME 2025 (no tools)	90.21	90.36	92.50
HMMT Feb 2025 (no tools)	93.67	91.40	90.00
HMMT Feb 2025 (with tools)	94.73	89.55	-
GPQA (no tools)	79.23	86.60	80.10
GPQA (with tools)	82.70	-	80.09
LiveCodeBench	81.19	78.93	88.00
SWE-Bench (OpenHands)	60.47	66.40	41.90
SWE-Bench Multilingual	45.78	-	30.80
RULER @ 256K	96.30	96.74	52.30
RULER @ 512K	95.67	95.95	46.70
RULER @ 1M	91.75	91.33	22.30
IFBench (prompt)	72.56	73.77	68.32
Arena-Hard-V2	73.88	75.15	90.26
HLE (no tools)	18.26	25.30	14.90
HLE (with tools)	22.82	-	19.00
TauBench V2 (avg)	61.15	74.53	61.00
MMLU-ProX (multilingual)	79.36	85.06	76.59

The RULER long-context scores tell the clearest story. At 1M tokens, Nemotron 3 Super holds 91.75% - virtually tied with Qwen3.5 at 91.33% - while GPT-OSS-120B collapses to 22.30%. The Mamba-2 layers avoid the quadratic attention scaling that kills pure Transformer performance at extreme lengths.

On agentic benchmarks, the results are more nuanced. SWE-Bench OpenHands at 60.47% is solid but 6 points behind Qwen3.5. TauBench V2 averaging 61.15% matches GPT-OSS but trails Qwen3.5's 74.53% by a significant margin. The model excels at tool-augmented reasoning (HMMT improves from 93.67 to 94.73 with tools, GPQA from 79.23 to 82.70) which suggests it's better as a tool-calling agent than a standalone coder.

Arena-Hard-V2 at 73.88% is a weakness - GPT-OSS-120B scores 90.26% on the same benchmark. Open-ended conversational quality isn't this model's strength.

Key Capabilities

Architecture Innovation

Three design choices define this model. Mamba-2 layers handle sequential token processing at linear cost instead of quadratic attention. LatentMoE projects tokens into a compressed latent space before routing to experts - activating four specialists for the cost of one at inference. Multi-Token Prediction uses shared-weight heads to predict multiple future tokens simultaneously, enabling native speculative decoding without a separate draft model.

The NVFP4 quantization-aware training is notable because it was applied during pre-training, not bolted on afterward. This means the model learned to be accurate at 4-bit precision rather than losing accuracy from post-training compression. On Blackwell GPUs, NVFP4 runs 4x faster than FP8 on Hopper.

Long-Context Performance

The 1M token context window isn't just a marketing number - the RULER benchmark scores back it up. 91.75% at 1M tokens, 95.67% at 512K, 96.30% at 256K. Degradation from 256K to 1M is under 5 percentage points. For RAG pipelines, multi-document analysis, and codebase-scale agentic tasks, this is directly useful.

Open Training Artifacts

NVIDIA published the full recipe: 10+ trillion tokens of pre-training data, all post-training datasets in the Nemotron-Post-Training-v3 collection, and all 15 RL training environments. The three-stage pipeline (pre-training with Megatron-LM, SFT with Data Designer, RL with NeMo RL/NeMo Gym) is documented in the technical report. Synthetic data was generated from frontier models including GPT-OSS-120B, DeepSeek-V3, and Qwen3-235B.

Tool Use and Agentic Workflows

The model supports native tool calling with structured outputs, configurable reasoning (on/off/low-effort via chat template), and multi-turn conversations. NVIDIA's own AI-Q research agent, built on Nemotron 3 Super, topped the DeepResearch Bench and DeepResearch Bench II leaderboards.

Pricing and Availability

No standardized third-party pricing exists yet - the model launched today. Current options:

Channel	Cost	Context	Notes
build.nvidia.com	Free (trial)	262K	Prompts logged, not for production
OpenRouter (NVIDIA)	Free (trial)	262K	Prompts logged
Self-hosted (8x H100)	Hardware cost	Up to 1M	BF16, FP8, or NVFP4 checkpoints
NIM microservice	License required	Up to 1M	Enterprise deployment

Cloud providers offering deployment: Google Cloud Vertex AI, Oracle Cloud, AWS Bedrock (forthcoming), Microsoft Azure, CoreWeave, Crusoe, Nebius, Together AI, Baseten, Cloudflare, DeepInfra, Fireworks AI, Lightning AI, Modal, FriendliAI.

For cost comparison context: self-hosting a 12B-active MoE on 8x H100s is dramatically cheaper per token than serving a dense 120B model. The efficiency advantage is the pricing story here - once providers publish rates, expect this to undercut similarly-sized dense models significantly.

Strengths

Exceptional long-context: 91.75% RULER@1M, one of the best scores at any parameter class
12B active inference cost: MoE efficiency means frontier-class capability at a fraction of the compute
Full open release: Weights, 10T+ tokens of training data, 15 RL environments - not just "open weights"
Architecture innovation: Mamba-2 + LatentMoE + MTP is a genuine technical advance, not a scaled-up Transformer
Native NVFP4: Quantization-aware training from pre-training means minimal accuracy loss at 4-bit
Broad deployment: 15+ cloud and inference providers at launch
Strong math reasoning: 93.67% HMMT, 90.21% AIME 2025, improves further with tool use

Weaknesses

Trails Qwen3.5 on key benchmarks: MMLU-Pro, GPQA, SWE-Bench, TauBench V2 - Qwen3.5-122B wins on breadth
Weak conversational quality: Arena-Hard-V2 at 73.88% is well below GPT-OSS-120B's 90.26%
HLE score is low: 18.26% (22.82% with tools) trails Qwen3.5's 25.30% on the hardest reasoning benchmark
8x H100 minimum: Self-hosting barrier is high for smaller teams
1M context needs explicit config: Default ships at 256K, full context needs manual override and more GPUs
NVIDIA-specific license: Not Apache 2.0 - patent termination clause if you litigate against NVIDIA
No third-party pricing yet: Hard to evaluate cost-effectiveness until providers publish rates
Limited multilingual: 7 languages supported vs broader coverage from Qwen

News: NVIDIA Ships Nemotron 3 Super
Leaderboards: Reasoning Benchmarks | Coding Benchmarks | Cost Efficiency
Compare: Qwen 3.5 27B Claude Opus Distilled | Grok 4

Sources: