Nvidia unveiled Nemotron 3 Nano Omni, an open multimodal AI model that integrates vision, audio, and language capabilities into a single architecture.
The model aims to address the issues of fragmented pipelines in enterprise AI systems by processing multiple input types, including text, images, audio, and video, and generating text as output. Nvidia stated that it combines the knowledge capacity of larger models while reducing computational costs.
Constructed on a 30-billion-parameter hybrid mixture-of-experts architecture, Nemotron 3 Nano Omni activates approximately 3 billion parameters per inference. This architecture consolidates components, including a Parakeet speech encoder for audio and a C-RADIOv4-H vision encoder, enhancing the model’s performance.
Nvidia claims that the model provides up to 9x higher throughput compared to similar open omni models. It achieves around 3x greater throughput with 2.75x lower compute power for video reasoning tasks, supporting a 256K-token context window and topping six leaderboards for complex document intelligence and media understanding.
Foxconn, Palantir, and H Company have adopted the model. Gautier Cloix, CEO of H Company, stated, “Utilizing the Nemotron 3 Nano Omni allows our agents to swiftly analyze full HD screen recordings, a capability that was previously unfeasible.”
Additionally, companies such as Dell, Oracle, and Infosys are currently evaluating the model. The model is accessible on platforms including Hugging Face, OpenRouter, Amazon SageMaker JumpStart, Vultr, and over 25 partner platforms.
Nvidia released Nemotron 3 Nano Omni with open weights, datasets, and training recipes for developer customization. This model represents a key component in Nvidia’s broader Nemotron 3 family, which includes Super and Ultra models designed for heavier workloads and has recorded over 50 million downloads in the past year.





