NVIDIA Cosmos 3 — The Open Omnimodel That Reasons Before It Generates a World, and It’s Yours to Download

On May 31 at GTC Taipei, NVIDIA dropped Cosmos 3 and slapped a wild label on it: the world’s first fully open omnimodel. One model that looks at a scene, reasons about how it actually works, then generates the next frames, the ambient sound, and the actions to match — text, image, video, audio, motion, all in one network. The weights are sitting on Hugging Face right now, under a license that lets you train, modify and ship it.

The Story

Cosmos 3 is a world foundation model — the same family we’ve been tracking all spring (Genie 3, World Labs, NVIDIA’s own Lyra). The difference is the word open. Where most world models are locked behind a closed API, Cosmos 3 ships under OpenMDW 1.1 from the Linux Foundation: you can fine-tune it, modify it, and deploy it commercially. NVIDIA trained it on a frankly absurd diet — 20 trillion tokens, including nearly a billion images and 400 million real and synthetic videos, plus ambient audio, text, and action data from humans and robots.

The architecture is where it gets genuinely interesting. Instead of one transformer brute-forcing pixels, Cosmos 3 uses a mixture-of-transformers: a reasoning block that interprets a moving scene — object interactions, motion, spatial-temporal relationships — paired with an expert generation block that uses that understanding to produce physically accurate output. In plain English: it thinks about how the world works before it renders what happens next. That’s the opposite of the “predict a plausible-looking pixel and hope physics survives” approach that gives older video models their melting-spaghetti moments.

Diagram of the Cosmos 3 mixture-of-transformers architecture: a reasoning transformer feeding an expert generation transformer
The mixture-of-transformers backbone: reason about the scene first, then generate. Source: NVIDIA on Hugging Face

It’s an omnimodel — one network handles text, images, video, ambient sound, and action, no swapping checkpoints. It comes in three sizes: Cosmos 3 Super for the highest physics accuracy, Cosmos 3 Nano for high-quality video and reasoning “in fractions of a second,” and Cosmos 3 Edge (coming soon) for real-time inference on-device. Among open models, NVIDIA says it tops Artificial Analysis, Physics-IQ, PAI-Bench and R-Bench for world-generation accuracy, plus the vision and action leaderboards.

Illustration showing how Cosmos 3 powers perception, prediction and action
Perception, prediction, action — folded into a single open model. Source: NVIDIA Blog

Why You Should Care

Let’s be honest up front, because that’s the Lab way: Cosmos is robotics-first. The headline use cases are humanoids, autonomous vehicles, warehouse safety. This is not a “type a prompt, get a playable level” toy, and anyone selling it that way is fibbing.

But look at who NVIDIA put in its new Cosmos Coalition: alongside robotics labs sit Black Forest Labs (the FLUX people), Runway, and LTX — three of the names behind the image and video models creative folks actually use every day. When the studios building your generative tools are building on the same open world model, that model quietly becomes the substrate under your next workflow.

  • Physically-grounded video + sound. A model that reasons about physics produces motion and scenes that hold together. For previs, environment loops, and animated b-roll, “it doesn’t fall apart” is the whole game.
  • Open weights you can own. Fine-tune Cosmos on your own footage and assets, run it locally, no API gatekeeper metering your renders. For indie game devs and small 3D studios, that’s the line between renting and owning a pipeline.
  • Free synthetic data. NVIDIA published six synthetic datasets alongside the model — one of them is digital humans, directly useful if you’re training avatar or character systems.
  • World generation as infrastructure. We’ve watched Gaussian splats and “world APIs” become plumbing (World Labs’ World API, OpenUSD splats). Cosmos 3 is the open-weights answer to that same trend.
Leaderboard chart showing Cosmos 3 ranking first among open models across multiple world-generation benchmarks
Top of the open-model leaderboards across world generation, vision and action. Source: NVIDIA Blog

Try It / Follow Them

  • Grab the weights: on Hugging Face, code on GitHub, or try it hosted at build.nvidia.com.
  • License: OpenMDW 1.1 (Linux Foundation) — commercial use, fine-tuning and redistribution are all on the table.
  • Read the deep dives: NVIDIA’s technical blog and the launch announcement.
  • Reality check: this is a frontier model. Super wants serious GPU muscle; Nano is the variant to watch if you want video and reasoning on accessible hardware.

IK3D Lab Take

We almost skipped this one because “open model for robots and self-driving cars” is not exactly our beat. Then we read the Coalition list and the modality stack, and it clicked: Cosmos 3 isn’t a creative tool — it’s the foundation creative tools will be poured on. A world model that reasons before it generates, ships open weights, generates video and ambient sound, and hands you six free datasets to fine-tune on? That’s exactly the kind of unglamorous infrastructure that, eighteen months from now, every “magic” world-builder app turns out to be quietly running on top of.

It’s not plug-and-play for a solo 3D artist today — you’ll need an ML pipeline and a GPU that doesn’t flinch. But the trajectory is unmistakable: the world-model layer is going open, and the people who make the tools we love are already standing on it. Download Nano, point it at your own footage, and you’re tinkering with the same substrate as the labs. That’s the part that gets us out of bed.

 

Sharing is caring!

Leave a Reply

Your email address will not be published. Required fields are marked *