NVIDIA Cosmos is a developer-first world foundation model platform designed to help Physical AI developers build their Physical AI systems better and faster. Cosmos contains
- pre-trained models, available via Hugging Face under the NVIDIA Open Model License that allows commercial use of the models for free
- training/fine-tuning scripts under the Apache 2 License, offered through NVIDIA Nemo Framework for training/fine-tuning the models for various downstream Physical AI applications
Details of the platform is described in the Cosmos paper. Preview access is avaiable at build.nvidia.com.
- Pre-trained Diffusion-based world foundation models for Text2World and Video2World generation where a user can generate visual simulation based on text prompts and video prompts.
- Pre-trained Autoregressive-based world foundation models for Video2World generation where a user can generate visual simulation based on video prompts and optional text prompts.
- Video tokenizers for tokenizing videos into continuous tokens (latent vectors) and discrete tokens (integers) efficiently and effectively.
- Post-training scripts to post-train the pre-trained world foundation models for various Physical AI setup.
- Video curation pipeline for building your own video dataset. [Coming soon]
- Training scripts for building your own world foundation model. [Diffusion] [Autoregressive].
Follow the Cosmos Installation Guide to setup the docker. For inference with the pretrained models, please refer to Cosmos Diffusion Inference and Cosmos Autoregressive Inference.
The code snippet below provides a gist of the inference usage.
PROMPT="A sleek, humanoid robot stands in a vast warehouse filled with neatly stacked cardboard boxes on industrial shelves. The robot's metallic body gleams under the bright, even lighting, highlighting its futuristic design and intricate joints. A glowing blue light emanates from its chest, adding a touch of advanced technology. The background is dominated by rows of boxes, suggesting a highly organized storage system. The floor is lined with wooden pallets, enhancing the industrial setting. The camera remains static, capturing the robot's poised stance amidst the orderly environment, with a shallow depth of field that keeps the focus on the robot whi