NVIDIA Nemotron 3 Super 120B-A12B

NVIDIA Nemotron 3 Super is a 120B-parameter open model with 12B active at inference, combining Mamba-2, LatentMoE, and Multi-Token Prediction for agentic workloads with a 1M token context window.

NVIDIA Nemotron 3 Super 120B-A12B

Nemotron 3 Super is NVIDIA's open model play for agentic AI. The pitch: 120 billion parameters of learned capacity, 12 billion active at inference, running at throughput levels that make multi-agent pipelines economically viable. The architecture is genuinely novel - a Mamba-2 and Transformer hybrid with latent Mixture-of-Experts routing and Multi-Token Prediction - and NVIDIA released not just the weights but the full training dataset (10T+ tokens) and all 15 RL environments. That level of openness from a company this size is unusual.

TL;DR

  • 120B total / 12B active MoE with Mamba-2 + Transformer hybrid architecture
  • 1M token context (91.75% RULER@1M), competitive with Qwen3.5-122B across benchmarks
  • 5x throughput over previous Nemotron Super, NVFP4 quantization-aware training from day one
  • Full open release: BF16/FP8/NVFP4 weights, 153 datasets, 15 RL environments on Hugging Face
  • Self-host on 8x H100-80GB minimum; available via 15+ cloud providers

The benchmark picture is mixed in the way that matters. It beats GPT-OSS-120B on most tasks - often by wide margins - but trades wins and losses with Qwen3.5-122B-A10B. Where it genuinely leads: long-context retrieval (RULER@1M: 91.75%), math competition (HMMT Feb 2025: 93.67%), and code generation (LiveCodeBench: 81.19%). Where it falls behind: general knowledge (MMLU-Pro: 83.73 vs Qwen's 86.70), science reasoning (GPQA: 79.23 vs 86.60), and agentic coding (SWE-Bench: 60.47 vs 66.40).

The efficiency angle is the real differentiator. Running at 12B active parameters with Mamba-2 layers handling sequence processing means this model can serve agentic workloads - dozens of tool calls across long contexts - at a fraction of the cost of a dense 120B model. NVIDIA claims 5x throughput over the previous Nemotron Super and 4x faster inference on Blackwell NVFP4 vs Hopper FP8.

Key Specifications

SpecificationDetails
ProviderNVIDIA
Model FamilyNemotron 3
Parameters120B total, 12B active (LatentMoE)
ArchitectureMamba-2 + Transformer Hybrid LatentMoE with MTP
Context Window1,000,000 tokens (256K default config)
Training Data15.6T tokens, 153 datasets, 20 languages, 43 programming languages
Pre-training CutoffJune 2025
Post-training CutoffFebruary 2026
QuantizationNVFP4 (native), FP8, BF16
Min Hardware8x H100-80GB
LicenseNVIDIA Nemotron Open Model License
PricingFree trial on build.nvidia.com; self-host or third-party providers
Release DateMarch 11, 2026

The 1M context window defaults to 256K in the shipped configuration due to VRAM constraints. Extending to the full million requires setting VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 and adjusting --max-model-len. Practical long-context use at 1M needs significant GPU memory beyond the 8x H100 minimum.

Benchmark Performance

All scores from NVIDIA's model card. Comparison models are Qwen3.5-122B-A10B and GPT-OSS-120B - the two closest open models by parameter class.

BenchmarkNemotron 3 SuperQwen3.5-122B-A10BGPT-OSS-120B
MMLU-Pro83.7386.7081.00
AIME 2025 (no tools)90.2190.3692.50
HMMT Feb 2025 (no tools)93.6791.4090.00
HMMT Feb 2025 (with tools)94.7389.55-
GPQA (no tools)79.2386.6080.10
GPQA (with tools)82.70-80.09
LiveCodeBench81.1978.9388.00
SWE-Bench (OpenHands)60.4766.4041.90
SWE-Bench Multilingual45.78-30.80
RULER @ 256K96.3096.7452.30
RULER @ 512K95.6795.9546.70
RULER @ 1M91.7591.3322.30
IFBench (prompt)72.5673.7768.32
Arena-Hard-V273.8875.1590.26
HLE (no tools)18.2625.3014.90
HLE (with tools)22.82-19.00
TauBench V2 (avg)61.1574.5361.00
MMLU-ProX (multilingual)79.3685.0676.59

The RULER long-context scores tell the clearest story. At 1M tokens, Nemotron 3 Super holds 91.75% - virtually tied with Qwen3.5 at 91.33% - while GPT-OSS-120B collapses to 22.30%. The Mamba-2 layers avoid the quadratic attention scaling that kills pure Transformer performance at extreme lengths.

On agentic benchmarks, the results are more nuanced. SWE-Bench OpenHands at 60.47% is solid but 6 points behind Qwen3.5. TauBench V2 averaging 61.15% matches GPT-OSS but trails Qwen3.5's 74.53% by a significant margin. The model excels at tool-augmented reasoning (HMMT improves from 93.67 to 94.73 with tools, GPQA from 79.23 to 82.70) which suggests it's better as a tool-calling agent than a standalone coder.

Arena-Hard-V2 at 73.88% is a weakness - GPT-OSS-120B scores 90.26% on the same benchmark. Open-ended conversational quality isn't this model's strength.

Key Capabilities

Architecture Innovation

Three design choices define this model. Mamba-2 layers handle sequential token processing at linear cost instead of quadratic attention. LatentMoE projects tokens into a compressed latent space before routing to experts - activating four specialists for the cost of one at inference. Multi-Token Prediction uses shared-weight heads to predict multiple future tokens simultaneously, enabling native speculative decoding without a separate draft model.

The NVFP4 quantization-aware training is notable because it was applied during pre-training, not bolted on afterward. This means the model learned to be accurate at 4-bit precision rather than losing accuracy from post-training compression. On Blackwell GPUs, NVFP4 runs 4x faster than FP8 on Hopper.

Long-Context Performance

The 1M token context window isn't just a marketing number - the RULER benchmark scores back it up. 91.75% at 1M tokens, 95.67% at 512K, 96.30% at 256K. Degradation from 256K to 1M is under 5 percentage points. For RAG pipelines, multi-document analysis, and codebase-scale agentic tasks, this is directly useful.

Open Training Artifacts

NVIDIA published the full recipe: 10+ trillion tokens of pre-training data, all post-training datasets in the Nemotron-Post-Training-v3 collection, and all 15 RL training environments. The three-stage pipeline (pre-training with Megatron-LM, SFT with Data Designer, RL with NeMo RL/NeMo Gym) is documented in the technical report. Synthetic data was generated from frontier models including GPT-OSS-120B, DeepSeek-V3, and Qwen3-235B.

Tool Use and Agentic Workflows

The model supports native tool calling with structured outputs, configurable reasoning (on/off/low-effort via chat template), and multi-turn conversations. NVIDIA's own AI-Q research agent, built on Nemotron 3 Super, topped the DeepResearch Bench and DeepResearch Bench II leaderboards.

Pricing and Availability

No standardized third-party pricing exists yet - the model launched today. Current options:

ChannelCostContextNotes
build.nvidia.comFree (trial)262KPrompts logged, not for production
OpenRouter (NVIDIA)Free (trial)262KPrompts logged
Self-hosted (8x H100)Hardware costUp to 1MBF16, FP8, or NVFP4 checkpoints
NIM microserviceLicense requiredUp to 1MEnterprise deployment

Cloud providers offering deployment: Google Cloud Vertex AI, Oracle Cloud, AWS Bedrock (forthcoming), Microsoft Azure, CoreWeave, Crusoe, Nebius, Together AI, Baseten, Cloudflare, DeepInfra, Fireworks AI, Lightning AI, Modal, FriendliAI.

For cost comparison context: self-hosting a 12B-active MoE on 8x H100s is dramatically cheaper per token than serving a dense 120B model. The efficiency advantage is the pricing story here - once providers publish rates, expect this to undercut similarly-sized dense models significantly.

Strengths

  • Exceptional long-context: 91.75% RULER@1M, one of the best scores at any parameter class
  • 12B active inference cost: MoE efficiency means frontier-class capability at a fraction of the compute
  • Full open release: Weights, 10T+ tokens of training data, 15 RL environments - not just "open weights"
  • Architecture innovation: Mamba-2 + LatentMoE + MTP is a genuine technical advance, not a scaled-up Transformer
  • Native NVFP4: Quantization-aware training from pre-training means minimal accuracy loss at 4-bit
  • Broad deployment: 15+ cloud and inference providers at launch
  • Strong math reasoning: 93.67% HMMT, 90.21% AIME 2025, improves further with tool use

Weaknesses

  • Trails Qwen3.5 on key benchmarks: MMLU-Pro, GPQA, SWE-Bench, TauBench V2 - Qwen3.5-122B wins on breadth
  • Weak conversational quality: Arena-Hard-V2 at 73.88% is well below GPT-OSS-120B's 90.26%
  • HLE score is low: 18.26% (22.82% with tools) trails Qwen3.5's 25.30% on the hardest reasoning benchmark
  • 8x H100 minimum: Self-hosting barrier is high for smaller teams
  • 1M context needs explicit config: Default ships at 256K, full context needs manual override and more GPUs
  • NVIDIA-specific license: Not Apache 2.0 - patent termination clause if you litigate against NVIDIA
  • No third-party pricing yet: Hard to evaluate cost-effectiveness until providers publish rates
  • Limited multilingual: 7 languages supported vs broader coverage from Qwen

Sources:

✓ Last verified March 11, 2026

NVIDIA Nemotron 3 Super 120B-A12B
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.