Nemotron 3 Nano Omni

NVIDIA's first open omni-modal model: 30B total / 3B active hybrid Mamba-MoE that processes text, images, audio, and video in a single inference loop, with 9x higher throughput than comparable open omni models.

Nemotron 3 Nano Omni

NVIDIA released Nemotron 3 Nano Omni on April 28, 2026 - the first open model from the company that processes text, images, audio, and video natively inside a single inference loop. Where the earlier Nemotron 3 Nano 30B-A3B handled text only, the Omni variant grafts a C-RADIOv4-H vision encoder and NVIDIA's Parakeet audio encoder onto the same hybrid Mamba-Transformer-MoE backbone, adding four modalities without ballooning compute. It's the most efficiency-focused open multimodal model currently available.

TL;DR

  • Best open model for document intelligence, video understanding, and agentic computer use
  • 30B total / 3B active parameters; 256K context; free on OpenRouter
  • Beats Qwen3-Omni on every published multimodal benchmark while delivering 9x higher throughput on video tasks

Overview

The Omni model shares the same 30B-A3B sparsity profile as its text-only sibling, meaning it activates only 3 billion parameters per token at inference time. The backbone interleaves 23 Mamba selective state-space layers (efficient for long sequences), 23 Mixture-of-Experts layers with 128 experts and top-6 routing, and 6 grouped-query attention layers. On top of that base sit two adapters: a vision encoder that handles images and video via dynamic-resolution patch tokenization and an efficient video sampling (EVS) layer that prunes static tokens across frames, and an audio encoder trained on NVIDIA's Parakeet model that processes up to 20 minutes of audio per inference call.

The published technical report (arXiv 2604.24954) credits a three-stage training recipe: multimodal alignment and context extension, preference optimization, and a multimodal reinforcement learning pass using NeMo-RL across 2.3 million environment rollouts. Synthetic data from NeMo Data Designer contributed 11.4 million visual QA pairs, and the RL phase covered 25 verifier configurations across math reasoning, GUI grounding, and ASR verification. That's a serious post-training investment for a 3B-active model.

On the positioning front, this isn't a general reasoning leader. NVIDIA didn't claim top scores on GPQA Diamond or MMLU, and the published benchmark table doesn't include them. This model is optimized for throughput and multimodal workload efficiency - specifically for agentic pipelines that need to ingest documents, screen recordings, and audio streams without spinning up separate specialized models.

Key Specifications

SpecificationDetails
ProviderNVIDIA
Model FamilyNemotron 3
Parameters30B total / 3B active (MoE, top-6 routing)
Context Window256K tokens
Input ModalitiesText, images, video, audio, documents, charts, GUIs
OutputText only
Input PriceFree (OpenRouter); commercial providers unlisted
Output PriceFree (OpenRouter); commercial providers unlisted
Release DateApril 28, 2026
LicenseNVIDIA Open Model Agreement (commercial use permitted)

Benchmark Performance

NVIDIA claims the top spot on six multimodal leaderboards. The numbers below come from the HuggingFace model card and the NVIDIA technical blog, comparing against Qwen3-Omni-30B-A3B (the most direct open-source competitor) and the prior Nemotron Nano V2 VL text-vision model.

BenchmarkNemotron 3 Nano OmniQwen3-Omni-30BNano V2 VL
MMLongBench-Doc57.549.538.0
CharXiv (chart reasoning)63.661.141.3
OCRBenchV2-En65.8-61.2
OSWorld (computer use)47.429.011.0
ScreenSpot-Pro (GUI)57.859.7-
Video-MME72.270.563.0
WorldSense (video + audio)55.454.0-
DailyOmni (cross-modal)74.173.6-
VoiceBench89.488.8-
HF Open ASR (WER, lower better)5.956.55-

The OSWorld result is the most striking: 47.4 for agentic computer use versus 29.0 for Qwen3-Omni, a 63% relative improvement on one of the hardest real-world agentic benchmarks. The GUI/ScreenSpot-Pro result is the only area where Qwen3-Omni edges ahead (59.7 vs 57.8), so don't call this a clean sweep across the board.

For production throughput, Coactive's MediaPerf evaluation (a real-world video processing benchmark) puts Nemotron at 9.91 hours-per-hour on video tagging - 2.6x faster than Qwen3-VL, 5x faster than GPT-5.1, and roughly 33x faster than Gemini 3.0 Pro for the same task. On multi-document workloads, NVIDIA reports 7.4x higher effective system capacity versus comparable alternatives. These numbers measure deployed throughput at a fixed interactivity threshold, not just raw generation speed, making them more meaningful for planning actual infrastructure.

Memory footprint at different quantization levels: NVFP4 runs in roughly 18GB, FP8 in approximately 30GB, and BF16 in roughly 60GB. A single RTX 4090 can run the NVFP4 version, which is a meaningful bar for local deployment.

Key Capabilities

Document and Long-Context Reasoning

The model's strongest individual use case is multi-page document analysis. MMLongBench-Doc measures reasoning across long mixed-text-and-image documents - contracts, technical reports, multi-chart papers - and Nemotron's 57.5 puts it well ahead of the field among open models. NVIDIA credits NeMo Data Designer's synthetic QA pipeline, which produced 11.4 million training pairs from real PDFs. The 2.19x accuracy improvement on MMLongBench-Doc attributed to synthetic data alone is unusually large and worth watching when fine-tuning recipes get published.

For OCR specifically, the dynamic-resolution image tokenizer matters. Images are encoded as between 1,024 and 13,312 visual patches - roughly equivalent to 512x512 to 1840x1840 pixel resolution - without the tiling artifacts that crop up in fixed-resolution models. Dense document scans and high-DPI screenshots benefit directly from that range.

Agentic Computer Use

The OSWorld result of 47.4 puts this model near or above everything except the top closed-source frontier models on the computer use leaderboard. At 3B active parameters, it's also the most compute-efficient model near that score. The vision encoder processes screen captures at 1920x1080 native resolution, and the MoE backbone handles the multi-step planning that OSWorld requires. Whether this translates to production reliability in real desktop automation pipelines is the next question - OSWorld is a controlled benchmark, not a user study.

Audio and Video Understanding

Audio is treated as a first-class modality, not an afterthought. The Parakeet encoder runs at 16 kHz and supports up to five hours of audio via the 256K context window, though the training cap was 20 minutes per call. VoiceBench scores of 89.4 versus 88.8 for Qwen3-Omni are close enough that user-specific workloads will matter more than the headline gap. The ASR word-error rate of 5.95 on the HuggingFace Open ASR leaderboard is a cleaner number: it puts Nemotron at the top of the open-model ASR rankings.

For video, the EVS layer is the key architectural decision. Rather than encoding all frames uniformly, EVS keeps the first frame intact and prunes unchanged regions in subsequent frames, doubling effective frame coverage without doubling compute. On Video-MME (72.2) and the combined WorldSense (55.4) and DailyOmni (74.1) benchmarks, this approach delivers slightly better accuracy than Qwen3-Omni with significantly lower inference cost.

NVIDIA Nemotron 3 Nano Omni benchmark comparison chart showing document, video, and audio results Benchmark results from NVIDIA's technical blog: Nemotron 3 Nano Omni vs Qwen3-Omni across document, video, and audio tasks. Source: developer.nvidia.com

Pricing and Availability

The model is available now on Hugging Face in three quantization formats: BF16 (60GB), FP8 (30GB), and NVFP4 (18GB). OpenRouter offers free access at $0/M input and $0/M output tokens for the free-tier route. NVIDIA NIM microservices are available on build.nvidia.com, AWS, and Oracle Cloud.

Inference providers that have rolled out it include Baseten, Clarifai, DeepInfra, Eigen AI, fal.AI, FriendliAI, and Fireworks AI, though none have published public pricing at launch. The predecessor text model on DeepInfra runs at $0.06/M input and $0.24/M output as a reference point; the Omni version is likely priced higher given the multi-modal processing overhead.

Supported inference frameworks: vLLM, SGLang, and NVIDIA TensorRT-LLM. The official inference route uses Megatron-Bridge at NVIDIA-NeMo/Megatron-Bridge/tree/main/examples/models/vlm/nemotron_3_omni on GitHub.

For comparison, Qwen3.5-Omni (the closest open competitor) charges $0.40/M input and $4.80/M output for the Plus tier. If Nemotron commercial pricing lands below that, the efficiency advantage becomes even clearer on paper.

Strengths and Weaknesses

Strengths

  • Best open-model OSWorld score for agentic computer use (47.4)
  • Tops six multimodal leaderboards in document intelligence, video, and audio
  • Runs on a single consumer GPU in NVFP4 quantization (~18GB)
  • Free at point of use via OpenRouter
  • NVIDIA Open Model Agreement allows commercial deployment
  • Joint audio-video-text token interleaving removes stitching latency in agentic pipelines

Weaknesses

  • Text output only - no image generation, voice synthesis, or video generation
  • 256K context window is shorter than the 1M-token window on the text-only Nemotron Nano
  • No public benchmark results for GPQA Diamond, MMLU, or standard reasoning tests
  • Commercial inference pricing from providers not yet published
  • ScreenSpot-Pro trails Qwen3-Omni (57.8 vs 59.7)

Sources

✓ Last verified April 30, 2026

James Kowalski
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.