Models

Nemotron 3 Nano 30B-A3B

NVIDIA's hybrid Mamba2+MoE model packs 31.6B total parameters but activates only 3.2B per token, delivering frontier-class reasoning with 3.3x the throughput of comparable models on a single H200 GPU.

Nemotron 3 Nano 30B-A3B

Nemotron 3 Nano 30B-A3B is NVIDIA's bet that you can build a frontier-competitive reasoning model by activating roughly 10% of your total parameters. The model uses a hybrid architecture - 23 Mamba-2 layers, 23 MoE layers, and 6 grouped query attention layers across 52 total layers - to achieve throughput numbers that make competing MoE models look sluggish.

TL;DR

  • 31.6B total / 3.2B active hybrid Mamba2+MoE architecture - 128 routed experts with 6 active per token
  • 89.1% on AIME 2025 (no tools), 99.2% with tool use - 38.8% SWE-bench Verified (OpenHands)
  • 1M token context with 92.9% RULER accuracy at 256K and 86.3% at 1M
  • 3.3x higher throughput than Qwen3-30B-A3B on identical hardware (single H200)
  • NVIDIA Open Model License - not quite Apache 2.0, but commercially permissive

Overview

The architecture is genuinely novel. Rather than pure Transformer or pure Mamba, NVIDIA interleaves three different layer types in a repeating pattern: Mamba-2 layers handle sequential state tracking with linear complexity, MoE layers provide capacity through 128 routed experts (6 active per token plus 1 shared), and grouped query attention layers (with 2 groups) provide the global context that pure state-space models struggle with. The result is a model that processes long sequences far more efficiently than a standard Transformer while retaining the quality benefits of attention where it counts.

Training used 25 trillion tokens with a Warmup-Stable-Decay learning rate schedule. The post-training pipeline is equally interesting - NVIDIA used synchronous GRPO (Group Relative Policy Optimization) with Qwen3-Nemotron-235B-A22B-GenRM as the reward model, targeting math, code, science, tool use, and structured output. The model supports a reasoning ON/OFF toggle with configurable thinking budgets, so you can trade reasoning depth for latency depending on the task.

The throughput story is where NVIDIA's hardware-software co-design pays off. On a single H200 GPU, Nemotron 3 Nano achieves 3.3x the throughput of Qwen3-30B-A3B - a model with similar active parameter counts. That gap comes from the Mamba-2 layers processing sequences in linear time and NVIDIA's optimized inference stack.

Key Specifications

SpecificationDetails
ProviderNVIDIA
Model FamilyNemotron 3
ArchitectureHybrid Mamba-2 / Transformer MoE
Total Parameters31.6B
Active Parameters3.2B (3.5B with embeddings)
Total Layers52
Layer Composition23 Mamba-2 + 23 MoE + 6 GQA
MoE Configuration128 routed experts + 1 shared (6 active per token)
GQA Groups2
Context Window1M tokens (default config: 256K)
Training Data25T tokens (cutoff June 2025)
Languages20+ (including 43 programming languages)
Inference EnginesvLLM, TensorRT-LLM, SGLang, llama.cpp
Release DateDecember 15, 2025
LicenseNVIDIA Open Model License

Benchmark Performance

BenchmarkNemotron 3 NanoQwen3-30B-A3BGPT-OSS-20B
AIME 2025 (no tools)89.185.091.7
AIME 2025 (with tools)99.2-98.7
MMLU-Pro78.380.975.0
GPQA Diamond (no tools)73.073.471.5
LiveCodeBench v668.366.061.0
Arena-Hard-V2 (Avg)67.757.848.6
SWE-bench (OpenHands)38.822.034.0
IFBench71.551.065.0
RULER-100 @ 256K92.989.4-
RULER-100 @ 1M86.377.5-
TauBench V2 (Avg)49.047.748.7

The benchmark picture shows clear specialization. Nemotron 3 Nano dominates on agentic tasks (SWE-bench 38.8 vs Qwen3's 22.0), long-context retrieval (86.3% RULER at 1M tokens), and instruction following (IFBench 71.5 vs 51.0). It trails Qwen3 on broad knowledge (MMLU-Pro 78.3 vs 80.9) and GPQA by a hair. The AIME 2025 result with tools - 99.2% - is eye-catching but should be interpreted carefully since tool-augmented math is a different capability than raw reasoning.

Key Capabilities

The long-context performance is the standout feature. At 256K tokens, Nemotron 3 Nano scores 92.9% on RULER-100, beating Qwen3-30B-A3B's 89.4%. At the full 1M context, it still holds 86.3% versus Qwen3's 77.5%. The Mamba-2 layers with their linear complexity do the heavy lifting here - you are not paying quadratic attention costs across the entire context.

For agentic workloads - the use case NVIDIA is explicitly targeting - the model posts strong numbers on SWE-bench Verified (38.8%), TauBench V2 (49.0%), and BFCL v4 (53.8% for function calling). The reasoning ON/OFF toggle is a practical feature for multi-agent systems where you want fast responses for simple routing decisions and deep reasoning for complex subtasks. You can set a thinking budget in tokens to control the cost-quality tradeoff per request.

Hardware requirements are reasonable for the total parameter count. The full BF16 weights need approximately 60GB of VRAM, fitting on a single H100-80GB or H200. FP8 quantized weights are available from NVIDIA directly, and GGUF quantizations from the community bring this into RTX 4090 territory. NVIDIA lists official support for H100, A100, B200, RTX PRO 6000, Jetson Thor, and DGX Spark.

Pricing and Availability

The model weights are available on HuggingFace under the NVIDIA Open Model License - commercially permissive but not Apache 2.0. DeepInfra offers API access at $0.06/M input tokens and $0.24/M output tokens. OpenRouter lists a free tier for testing. NVIDIA NIM provides hosted inference through their own infrastructure.

For self-hosting, vLLM 0.12.0+ and SGLang both support the model with reasoning parsing plugins. TensorRT-LLM integration leverages NVIDIA's optimized kernels for maximum throughput on their GPUs.

Strengths

  • Hybrid Mamba2+MoE architecture delivers 3.3x throughput advantage over similar-sized models
  • Best-in-class long-context performance (86.3% RULER at 1M tokens)
  • Strong agentic benchmarks (SWE-bench 38.8, BFCL 53.8) for the 3B active parameter range
  • Configurable reasoning ON/OFF with token budgets for cost optimization
  • 43 programming languages in training data

Weaknesses

  • MMLU-Pro (78.3) trails Qwen3-30B-A3B on broad knowledge tasks
  • NVIDIA Open Model License is more restrictive than Apache 2.0 - read the fine print
  • Mamba-2 layer support still maturing in third-party inference frameworks
  • 31.6B total weights must reside in memory despite only 3.2B activating
  • Self-reported benchmarks from NVIDIA's own evaluation pipeline - independent verification still ongoing

Sources:

Nemotron 3 Nano 30B-A3B
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.