NVIDIA SANA-WM
NVIDIA's SANA-WM is a 2.6B-parameter hybrid linear diffusion transformer that generates 60-second 720p video with 6-DoF camera control on a single H100, built for embodied AI and robotics simulation.

NVIDIA NVLabs released SANA-WM (World Model) on May 14, 2026, as an arXiv preprint, with weights publicly available in the following days. It's a 2.6B-parameter model that generates 60-second, 720p video clips with precise 6-DoF camera control - on a single H100 GPU. That last detail is the headline: competing systems require 8 GPUs for far lower throughput.
TL;DR
- Produces 60-second 720p video with 6-DoF camera control on a single H100 at 24.1 clips/hour
- Hybrid linear diffusion transformer (2.6B params): 15 Gated DeltaNet blocks + 5 softmax attention blocks
- Targets embodied AI and robotics simulation, not general text-to-video - it's a specialist, not a generalist
The intended use cases are narrow and specific: embodied AI training data, robotics simulation, autonomous driving synthetic data generation, and VR/AR scene creation. SANA-WM isn't competing with Veo 3.1 for consumer video creation. It's competing with internal research pipelines that currently require large GPU clusters.
An optional 17B refiner (built on LTX-2 from Lightricks) is available for improved quality. With the refiner active, rotation error drops from 7.59 to 4.50 degrees and camera motion consistency (CamMC) improves from 1.63 to 1.41 - at a throughput cost of roughly two clips per hour.
Key Specifications
| Specification | Details |
|---|---|
| Provider | NVIDIA NVLabs |
| Model Family | SANA |
| Parameters | 2.6B (generator); 17B optional refiner |
| Output Length | 60 seconds at 720p (1280x720) |
| Input Price | Free (open weights) |
| Output Price | Free (open weights) |
| Release Date | 2026-05-14 |
| License | Apache 2.0 (code); LTX-2 Community License (refiner weights) |
| Min Hardware | Single NVIDIA H100 (51.1 GB); RTX 5090 with NVFP4 quantization |
Architecture
SANA-WM uses what the team calls a hybrid linear diffusion transformer. The backbone stacks 20 blocks in alternating fashion: 15 Gated DeltaNet blocks for long-range temporal modeling and 5 standard softmax attention blocks for local feature refinement.
The design choice is deliberate. Softmax attention's quadratic cost becomes prohibitive at 60-second video lengths. Gated DeltaNet maintains linear complexity about sequence length while preserving the selective state behavior needed for consistent long-range scene generation. The 5 softmax blocks are placed strategically at intervals to inject fine-grained spatial detail that linear attention tends to blur.
Camera control runs through a dual-branch 6-DoF system. One branch processes the semantic content of the scene; the other processes the 6-DoF camera arc (3 translational + 3 rotational degrees of freedom). The two branches are fused at each block, letting the model simultaneously reason about what's in the scene and how the camera is moving through it.
15 Gated DeltaNet blocks + 5 softmax attention blocks - enough linear complexity to run on one GPU, enough local detail to keep scenes coherent across a full minute.
On a RTX 5090 with NVFP4 quantization, producing a 60-second clip takes 34 seconds. That's consumer-grade inference for a task that previously required a multi-GPU workstation.
Benchmark Performance
SANA-WM was assessed on three metrics: rotation error (RotErr, in degrees - lower is better), camera motion consistency (CamMC - lower is better), and VBench visual quality (higher is better). Throughput is measured in videos per hour on one GPU unless otherwise noted.
| Method | Params | GPUs | RotErr (lower is better) | CamMC (lower is better) | VBench (higher is better) | Throughput |
|---|---|---|---|---|---|---|
| SANA-WM | 2.6B | 1 | 7.59 | 1.63 | 79.29 | 24.1 vid/hr |
| SANA-WM + refiner | 2.6B + 17B | 1 | 4.50 | 1.41 | 80.62 | 22.0 vid/hr |
| LingBot-World | 14B + 14B | 8 | 10.47 | 2.05 | 81.82 | 0.6 vid/hr |
| Matrix-Game 3.0 | 5B | 8 | 12.96 | 1.92 | 78.53 | 3.1 vid/hr |
| Infinite-World | 1.3B | 1 | 16.55 | 2.08 | 79.18 | 5.9 vid/hr |
| HY-WorldPlay | 8B | 8 | 12.96 | 2.45 | 68.82 | 1.1 vid/hr |
The camera control numbers are where SANA-WM earns its place. At 7.59 degrees rotation error without the refiner and 4.50 with it, SANA-WM beats every listed competitor by a wide margin - LingBot-World's 10.47 is the closest, and that system needs 8 GPUs running at 0.6 clips per hour. That's a 40x throughput disadvantage.
VBench, which measures general video quality, tells a more nuanced story. SANA-WM + refiner scores 80.62 versus LingBot-World's 81.82. The gap is small but real. Teams that need maximum visual fidelity and don't care about cost per clip might still favor LingBot-World. Most robotics simulation pipelines care far more about camera precision and generation volume than marginal VBench gains.
See our video generation benchmarks leaderboard for a broader comparison across text-to-video models.
Key Capabilities
Precise camera control for simulation data. The 6-DoF dual-branch system is the technical differentiator. Robotics training requires consistent point-of-view trajectories - a robot arm learning to pick objects needs thousands of clips from specific angles with specific camera motions. SANA-WM can generate those at 24 clips per hour on hardware that fits in a standard server rack.
Long-form coherence. Sixty seconds at 720p is significantly longer than most open video models manage without quality degradation. The linear attention backbone is purpose-built for this: it maintains scene consistency across the full clip without the memory cost that would require parallelizing across multiple GPUs.
Hardware accessibility. Single-GPU inference is the practical unlock. Research groups and small robotics teams that can't justify or afford 8-GPU nodes can now generate synthetic training data locally. The NVFP4 quantized version runs on a consumer RTX 5090 in 34 seconds per clip. This is relevant for the broader NVIDIA push into robotics tooling - compare with NVIDIA's Alpamayo robotics model and the Emerald AI Grid for flexible manufacturing, where synthetic data generation at this scale makes physical testing less necessary.
NVIDIA NVLabs also released a companion world-modeling effort in early 2026 via the LeCun-adjacent AMI Labs funding round context - SANA-WM fits a broader industry trend toward compact world models that don't require hyperscaler budgets.
Pricing and Availability
SANA-WM is free under Apache 2.0 for the code and model weights. The optional refiner uses the LTX-2 Community License, which restricts commercial deployment - teams integrating into production products need to review the full license terms before shipping.
The model and code are available at the SANA GitHub repo (SANA-WM lives in this repo), with the project page at nvlabs.github.io/Sana/WM/ and weights on Hugging Face.
No API or hosted inference is available from NVIDIA. Teams run this self-hosted on their own hardware.
Strengths and Weaknesses
Strengths
- Best camera control accuracy in its class by a significant margin (4.50 RotErr with refiner)
- 24.1 clips/hour on a single H100 is 40x faster than the next-best multi-GPU competitor
- Apache 2.0 code license allows commercial use without restrictions (refiner weights excluded)
- RTX 5090 support with NVFP4 quantization makes consumer deployment viable
- 60-second output at 720p outperforms most open models on duration and resolution
Weaknesses
- Specialist model only - no general text-to-video capability for consumer use cases
- VBench visual quality (80.62 with refiner) trails LingBot-World (81.82)
- Refiner weights under LTX-2 Community License restrict commercial deployment
- No hosted API or managed inference option - self-hosting required
- Limited to 720p; competing models like Veo 3.1 offer 4K output
Related Coverage
- Video Generation Benchmarks Leaderboard 2026
- NVIDIA Lyra 2.0 - Explorable 3D Worlds from One Photo
- NVIDIA's Alpamayo Hits 100K Downloads to Become Hugging Face's Top Robotics Model
- NVIDIA Nemotron 3 Super 120B-A12B
FAQ
Can SANA-WM run on a consumer GPU?
Yes. With NVFP4 quantization it runs on a single RTX 5090 and generates a 60-second 720p clip in 34 seconds. A H100 (51.1 GB VRAM) is the recommended configuration for unquantized runs.
Is SANA-WM suitable for general text-to-video tasks?
No. It's built for embodied AI training, robotics simulation, and autonomous driving synthetic data. Teams looking for consumer video generation should assess Veo 3.1 or similar generalist models instead.
What license does SANA-WM use?
The code and main generator weights are Apache 2.0, allowing commercial use. The optional 17B refiner uses the LTX-2 Community License, which restricts commercial deployment - review terms before using the refiner in production.
How does SANA-WM's camera control compare to competitors?
SANA-WM achieves 4.50 degrees rotation error with the refiner - the lowest of any model tested in the paper. LingBot-World (the next-best) reaches 10.47 degrees but requires 8 GPUs running at 0.6 clips per hour.
Where can I access SANA-WM weights?
The code is at github.com/NVlabs/Sana and weights are on Hugging Face. No managed API is available from NVIDIA - self-hosting is required.
Sources
✓ Last verified May 26, 2026
