Nvidia Cosmos 3 Is the First Open Physical AI Model

Nvidia launched Cosmos 3 today at GTC Taipei, and the claim deserves to be taken at face value: it is the first fully open omnimodel for physical AI. That means a single model that understands text, images, video, ambient sound, and produces not just video but robot action data - joint angles, gripper positions, trajectory waypoints - in one forward pass. No orchestration layer. No separate video model bolted onto a policy network.

The announcement came at Computex alongside a flurry of other hardware shows, but Cosmos 3 is the one that matters for anyone building robots or autonomous vehicles in 2026.

TL;DR

First open omnimodel for physical AI - produces video, images, and robot action data from a single model
Trained on 20 trillion tokens including nearly 1 billion images and 400 million real and synthetic videos
Three tiers: Super (64B, datacenter), Nano (16B, workstation), Edge (coming soon)
Tops 7 open-weight leaderboards for physical AI reasoning and world generation
Free under the OpenMDW 1.1 license from the Linux Foundation; available on Hugging Face now

Model Tiers at a Glance

Variant	Parameters	Target Hardware	Primary Use Case
Cosmos 3 Super	64B	Hopper / Blackwell GPUs (datacenter)	Robotics and AV post-training, synthetic data
Cosmos 3 Nano	16B	RTX PRO 6000 (workstation)	Real-time video and action reasoning
Cosmos 3 Edge	Not disclosed	Edge devices	Real-time inference on-device (coming soon)

The Nano is the more interesting number for independent developers. At 16B parameters, it runs on a single workstation-class GPU - the kind of hardware that's actually available outside a hyperscaler. If you want to experiment with running large open models locally, the Nano is Nvidia's pitch for why that should include physical AI.

The official NVIDIA Cosmos 3 press visual showing the model's physical AI capabilities Nvidia Cosmos 3 combines world simulation with native action generation in a single model architecture. Source: nvidianews.nvidia.com

What Cosmos 3 Does That Others Don't

The Omnimodel Architecture

Cosmos 3 uses what Nvidia calls a Mixture-of-Transformers design with two interconnected towers. The Reasoner Tower is a vision-language model using an autoregressive architecture - it reads multimodal input and builds a physical understanding of the scene, including object motion, spatial relationships, and causal structure. The Generator Tower then uses diffusion-based processes to produce outputs conditioned on that understanding.

The key point: these two towers share state. The Generator isn't given a text summary from the Reasoner. It gets the full representation. That's what allows the model to produce physically coherent outputs rather than plausible-looking hallucinations.

From Pixels to Joint Angles

Most AI video models stop at video. Cosmos 3 doesn't. The model produces numerical action data - the specific joint angles, gripper positions, and trajectory points that tell a physical robot what to do. Ming-Yu Liu, VP of Nvidia's Cosmos Lab, put it simply: "Action data is what makes Cosmos different. Autonomous actions are key."

This matters because the gap between visual understanding and physical execution has been one of the hardest problems in robotics. A model that can generate both a predictive world simulation and the corresponding action sequence closes that gap in training - robots can learn manipulation skills from synthetic data rather than thousands of hours of costly teleoperation.

20 Trillion Tokens of Training Data

Nvidia trained Cosmos 3 on 20 trillion tokens of multimodal data, including nearly 1 billion images, 400 million real and synthetic videos, ambient audio, text, and action data from humans and robots. For comparison, most large language models train on 1-15 trillion tokens of text. Cosmos 3's training set covers a fundamentally different distribution: it's weighted toward the kinds of visual and physical sequences that robots actually encounter.

The company also released six synthetic training datasets covering robotics manipulation, physics simulation, spatial reasoning, human motion, autonomous driving, and warehouse operations.

Benchmarks: What the Leaderboards Say

Cosmos 3 Super tops seven open-model leaderboards: Physics-IQ, PAI-Bench, R-Bench, RoboLab, RoboArena, VANTAGE-Bench, and TAR. On Artificial Analysis's Text-to-Image and Image-to-Video leaderboards, it also leads among open weights. These are specialized physical AI benchmarks, not MMLU or SWE-Bench - the comparison class is narrow, and the nearest competitors are clearly smaller or less open.

The benchmarks matter but should be read carefully. PAI-Bench covers physical AI across video understanding and generation in robotics, autonomous vehicles, and physics common sense. Physics-IQ specifically measures whether a model actually understands physical causation. Leading both says something real about model quality in this domain.

No independent reproduction of these numbers has appeared yet. The scores come from Nvidia's own release materials.

A humanoid robot working in an industrial setting, illustrating the physical AI tasks Cosmos 3 is designed to support Humanoid robots in industrial settings are one of the primary targets for Cosmos 3's action generation capabilities. Source: unsplash.com

The Cosmos Coalition

Nvidia launched the Cosmos Coalition alongside the model: Agile Robots, Black Forest Labs, Generalist, LTX, Runway, and Skild AI are founding members, collectively committing to advance open world models. The inclusion of Black Forest Labs - the team behind FLUX - and Runway signals that Cosmos 3 is being positioned not just as a robotics infrastructure play but as a foundation for video generation too.

Industrial adopters already using Cosmos include LG Electronics, Samsung, Doosan Robotics, Li Auto, Centific, and Milestone Systems. These aren't proof-of-concept deployments; Li Auto is using it for autonomous vehicle scenario generation and LG for factory vision AI.

The model is free to use under the OpenMDW 1.1 license from the Linux Foundation - permissive enough for commercial training, modification, and deployment. Access is available on Hugging Face, through NVIDIA NIM microservices, and on cloud platforms including Microsoft Azure, CoreWeave, and Baseten.

This positions Nvidia differently from the arc Google has taken. While Google recently closed the open-source Gemini CLI in favor of a proprietary platform, Nvidia is doubling down on open-weight models as a distribution strategy - and using its hardware ecosystem to make sure the open weights still run best on Nvidia silicon.

What It Does Not Tell You

Cosmos 3 is a foundation model, not a production robotics system. The gap between generating physically plausible action sequences and launching them on real robots is major. The training datasets Nvidia released are synthetic and curated - they don't reflect the full distribution of edge cases a robot encounters in uncontrolled environments.

The benchmark leadership is real but narrow. Physical AI is still a young benchmark category, and the evaluation protocols for PAI-Bench and Physics-IQ are not yet as standardized as SWE-Bench or MMLU. Independent evaluations will take months to emerge.

Pricing for NVIDIA NIM-based API access hasn't been disclosed. For teams without datacenter access, the Nano's workstation requirement is still a significant hardware commitment. The Edge variant - the one that would matter for truly embedded deployment - has no release timeline.

Nvidia also announced Nemotron 3 Ultra at the same event, a 550-billion-parameter language model with a Hybrid Mamba-Transformer MoE architecture and a claimed 5x inference speed advantage over comparable open models. The dual launch makes GTC Taipei Nvidia's biggest open-model release day to date, but it also means Cosmos 3 is competing for attention with another major announcement from the same company.

The open-weight physical AI space was effectively empty before today. There were robotics-adjacent video models, and there were separate policy networks, but no single open model that integrated world understanding with action generation at this scale. Cosmos 3 fills that gap - imperfectly, with caveats about synthetic training data and the distance from simulation to deployment, but at a weight class and openness level that no prior release matched.

Sources: