Name: Nemotron 3 Nano 30B-A3B
Author: NVIDIA

Nemotron 3 Nano 30B-A3B is NVIDIA's bet that you can build a frontier-competitive reasoning model by activating roughly 10% of your total parameters. The model uses a hybrid architecture - 23 Mamba-2 layers, 23 MoE layers, and 6 grouped query attention layers across 52 total layers - to achieve throughput numbers that make competing MoE models look sluggish.

TL;DR

31.6B total / 3.2B active hybrid Mamba2+MoE architecture - 128 routed experts with 6 active per token
89.1% on AIME 2025 (no tools), 99.2% with tool use - 38.8% SWE-bench Verified (OpenHands)
1M token context with 92.9% RULER accuracy at 256K and 86.3% at 1M
3.3x higher throughput than Qwen3-30B-A3B on identical hardware (single H200)
NVIDIA Open Model License - not quite Apache 2.0, but commercially permissive

Overview

The architecture is truly novel. Rather than pure Transformer or pure Mamba, NVIDIA interleaves three different layer types in a repeating pattern: Mamba-2 layers handle sequential state tracking with linear complexity, MoE layers provide capacity through 128 routed experts (6 active per token plus 1 shared), and grouped query attention layers (with 2 groups) provide the global context that pure state-space models struggle with. The result is a model that processes long sequences far more efficiently than a standard Transformer while retaining the quality benefits of attention where it counts.

Training used 25 trillion tokens with a Warmup-Stable-Decay learning rate schedule. The post-training pipeline is equally interesting - NVIDIA used synchronous GRPO (Group Relative Policy Optimization) with Qwen3-Nemotron-235B-A22B-GenRM as the reward model, targeting math, code, science, tool use, and structured output. The model supports a reasoning ON/OFF toggle with configurable thinking budgets, so you can trade reasoning depth for latency depending on the task.

The throughput story is where NVIDIA's hardware-software co-design pays off. On a single H200 GPU, Nemotron 3 Nano reaches 3.3x the throughput of Qwen3-30B-A3B - a model with similar active parameter counts. That gap comes from the Mamba-2 layers processing sequences in linear time and NVIDIA's optimized inference stack.

Key Specifications

Specification	Details
Provider	NVIDIA
Model Family	Nemotron 3
Architecture	Hybrid Mamba-2 / Transformer MoE
Total Parameters	31.6B
Active Parameters	3.2B (3.5B with embeddings)
Total Layers	52
Layer Composition	23 Mamba-2 + 23 MoE + 6 GQA
MoE Configuration	128 routed experts + 1 shared (6 active per token)
GQA Groups	2
Context Window	1M tokens (default config: 256K)
Training Data	25T tokens (cutoff June 2025)
Languages	20+ (including 43 programming languages)
Inference Engines	vLLM, TensorRT-LLM, SGLang, llama.cpp
Release Date	December 15, 2025
License	NVIDIA Open Model License

Benchmark Performance

Benchmark	Nemotron 3 Nano	Qwen3-30B-A3B	GPT-OSS-20B
AIME 2025 (no tools)	89.1	85.0	91.7
AIME 2025 (with tools)	99.2	-	98.7
MMLU-Pro	78.3	80.9	75.0
GPQA Diamond (no tools)	73.0	73.4	71.5
LiveCodeBench v6	68.3	66.0	61.0
Arena-Hard-V2 (Avg)	67.7	57.8	48.6
SWE-bench (OpenHands)	38.8	22.0	34.0
IFBench	71.5	51.0	65.0
RULER-100 @ 256K	92.9	89.4	-
RULER-100 @ 1M	86.3	77.5	-
TauBench V2 (Avg)	49.0	47.7	48.7

The benchmark picture shows clear specialization. Nemotron 3 Nano leads on agentic tasks (SWE-bench 38.8 vs Qwen3's 22.0), long-context retrieval (86.3% RULER at 1M tokens), and instruction following (IFBench 71.5 vs 51.0). It trails Qwen3 on broad knowledge (MMLU-Pro 78.3 vs 80.9) and GPQA by a hair. The AIME 2025 result with tools - 99.2% - is eye-catching but should be interpreted carefully since tool-augmented math is a different capability than raw reasoning.

Key Capabilities

The long-context performance is the standout feature. At 256K tokens, Nemotron 3 Nano scores 92.9% on RULER-100, beating Qwen3-30B-A3B's 89.4%. At the full 1M context, it still holds 86.3% versus Qwen3's 77.5%. The Mamba-2 layers with their linear complexity do the heavy lifting here - you aren't paying quadratic attention costs across the entire context.

For agentic workloads - the use case NVIDIA is explicitly targeting - the model posts strong numbers on SWE-bench Verified (38.8%), TauBench V2 (49.0%), and BFCL v4 (53.8% for function calling). The reasoning ON/OFF toggle is a practical feature for multi-agent systems where you want fast responses for simple routing decisions and deep reasoning for complex subtasks. You can set a thinking budget in tokens to control the cost-quality tradeoff per request.

Hardware requirements are reasonable for the total parameter count. The full BF16 weights need roughly 60GB of VRAM, fitting on a single H100-80GB or H200. FP8 quantized weights are available from NVIDIA directly, and GGUF quantizations from the community bring this into RTX 4090 territory. NVIDIA lists official support for H100, A100, B200, RTX PRO 6000, Jetson Thor, and DGX Spark.

Pricing and Availability

The model weights are available on HuggingFace under the NVIDIA Open Model License - commercially permissive but not Apache 2.0. DeepInfra offers API access at $0.06/M input tokens and $0.24/M output tokens. OpenRouter lists a free tier for testing. NVIDIA NIM provides hosted inference through their own infrastructure.

For self-hosting, vLLM 0.12.0+ and SGLang both support the model with reasoning parsing plugins. TensorRT-LLM integration uses NVIDIA's optimized kernels for maximum throughput on their GPUs.

Strengths

Hybrid Mamba2+MoE architecture delivers 3.3x throughput advantage over similar-sized models
Best-in-class long-context performance (86.3% RULER at 1M tokens)
Strong agentic benchmarks (SWE-bench 38.8, BFCL 53.8) for the 3B active parameter range
Configurable reasoning ON/OFF with token budgets for cost optimization
43 programming languages in training data

Weaknesses

MMLU-Pro (78.3) trails Qwen3-30B-A3B on broad knowledge tasks
NVIDIA Open Model License is more restrictive than Apache 2.0 - read the fine print
Mamba-2 layer support still maturing in third-party inference frameworks
31.6B total weights must reside in memory despite only 3.2B activating
Self-reported benchmarks from NVIDIA's own evaluation pipeline - independent verification still ongoing

Sources: