The Decoder • 91일 전

엔비디아 네모트론 3 나노 옴니 공개

IMP

9/10

핵심 요약

엔비디아가 텍스트, 이미지, 비디오, 오디오를 동시에 처리하는 오픈소스 멀티모달 모델 '네모트론 3 나노 옴니(Nemotron 3 Nano Omni)'를 공개했습니다. 이 모델은 경쟁사 모델(Qwen, GPT 등)에서 생성한 합성 데이터와 자체 오디오 데이터셋을 포함한 7,170억 개의 토큰으로 학습되었으며, 에이전트 애플리케이션에 최적화되어 상업적 사용이 가능합니다. 가장 주목할 점은 모델 가중치뿐만 아니라 학습 데이터, 파이프라인, 강화 학습 레시피까지 투명하게 공개하여 오픈소스 생태계에 큰 의미를 갖는다는 것입니다.

번역된 본문

엔비디아(Nvidia)가 텍스트, 이미지, 비디오 및 오디오를 처리하는 오픈 멀티모달 모델인 '네모트론 3 나노 옴니(Nemotron 3 Nano Omni)'를 공개했습니다. 이번 발표의 가장 흥미로운 부분은 단순히 모델의 성능이 아니라, 경쟁사 모델인 Qwen, GPT-OSS, Kimi, DeepSeek-OCR 등으로부터 파생된 학습 데이터의 구성 방식에 있습니다.

네모트론 3 나노 옴니는 단일 아키텍처에서 텍스트, 이미지, 비디오 및 오디오를 처리하는 오픈소스 멀티모달 모델입니다. 300억 개의 파라미터를 갖춘 이 모델은 mixture-of-experts 기술이 적용된 '맘바-트랜스포머(Mamba-Transformer)' 하이브리드 방식을 사용하며, 쿼리당 약 30억 개의 파라미터를 활성화합니다. 엔비디아의 자체 비전 인코더인 'C-RADIOv4-H'와 오디오 인코더인 'Parakeet-TDT'를 기반으로 작동하며, 최대 25만 6,000개 토큰의 컨텍스트 윈도우를 지원합니다. 다만 공식적으로 지원되는 언어는 영어뿐입니다.

기술 보고서에 따르면, 네모트론 3 나노 옴니는 주로 에이전트 애플리케이션을 위해 구축되었습니다. 구체적인 활용 분야로는 문서 처리, 컴퓨터 사용 에이전트, 비디오 및 오디오 분석, 음성 상호작용 등이 있습니다. OCRBenchV2, MMLongBench-Doc, WorldSense, VoiceBench와 같은 벤치마크 테스트에서 이 모델은 전신인 '네모트론 나노 V2 VL'을 능가하며 알리바바의 '큐원3-옴니(Qwen3-Omni)'와 대등한 성능을 보여줍니다. 또한 GUI 에이전트를 위한 벤치마크인 OSWorld에서는 이전 버전 대비 정확도가 11.1점에서 47.4점으로 크게 향상되었습니다. 엔비디아는 동일한 상호작용 수준에서 처리량(throughput)이 큐원3-옴니보다 최대 9배 높다고 설명했습니다.

경쟁 모델이 학습 데이터를 구성한 방식 벤치마크 성능도 중요하지만, 진정한 오픈소스 배포에서만 볼 수 있는 학습 데이터의 세부 사항도 매우 흥미롭습니다. 엔비디아는 7단계의 학습 과정에서 약 7,170억 개의 토큰을 처리했으며, 각 단계마다 컨텍스트 윈도우가 확장되었습니다.

상당 부분의 합성 학습 데이터는 경쟁 모델들로부터 생성되었습니다. 이미지 캡션, 질의응답 쌍 및 추론 과정은 'Qwen3-VL-30B-A3B-Instruct', 'Qwen3.5-122B-A10B', 'Qwen2.5-VL-72B-Instruct', 오픈AI의 'gpt-oss-120b', 'Kimi-K2.5', 'GLM-4.1V-9B-Thinking', 그리고 'DeepSeek-OCR'을 사용하여 생성되었습니다. 또한 엔비디아는 데이터 필터링을 처리하기 위해 GPT-4o와 'Gemini 3 Flash Preview'를 도입했습니다.

다른 모델을 사용하여 새로운 모델을 학습시키는 것은 업계 전반에 걸쳐 일반적인 관행이지만, 대부분의 개발자들은 이러한 사실을 이토록 공개적으로 밝히지는 않습니다. 오픈AI, 앤스로픽, 구글과 같은 기업들은 중국 AI 연구소들이 대규모 지식 증류(distillation) 작업을 수행하고 있다고 반복적으로 비난한 바 있습니다.

오디오 데이터에는 엔비디아 자체의 'Granary' 및 'SIFT-50M' 데이터셋과 함께 큐원의 'Omni-Captioner'에서 얻은 캡션이 포함되어 있습니다. 강화 학습 단계에서는 시각적 그라운딩(visual grounding), 차트 및 문서 이해, GUI 클릭, 자동 음성 인식 등의 작업을 다루는 25개 환경에 걸쳐 5단계 파이프라인을 구축했습니다.

엔비디아는 BF16, FP8 및 NVFP4 형식의 가중치와 함께 학습 데이터의 일부, 'Megatron-Bridge'의 학습 파이프라인, 그리고 'NeMo-RL'의 강화 학습 레시피를 공개했습니다. 이는 단순히 가중치만 제공하는 다른 프로젝트들과 차별화되는 부분입니다. 추론 모드(reasoning mode)는 기본적으로 켜져 있으므로, 사용자는 사고 과정(chain-of-thought)이 필요 없는 작업의 경우 이를 수동으로 꺼야 합니다. 이 모델은 상업적 사용을 허용하는 '엔비디아 오픈 모델 계약(NVIDIA Open Model Agreement)' 조건에 따라 배포됩니다.

원문 보기

원문 보기 (영어)

With Nemotron 3 Nano Omni, Nvidia reveals what really goes into a modern multimodal model Maximilian Schreiner View the LinkedIn Profile of Maximilian Schreiner Apr 29, 2026 Nvidia Key Points Nvidia has released Nemotron 3 Nano Omni, an open AI model that processes text, images, video, and audio and is built for agentic applications. Training involved 717 billion tokens. Much of the synthetic training data comes from competing models like Qwen, gpt-oss, and DeepSeek-OCR. Along with the model weights, Nvidia is also releasing parts of the training data and pipelines. The model is cleared for commercial use. Ask about this article… Search Nvidia has released Nemotron 3 Nano Omni, an open multimodal model that handles text, images, video, and audio. The interesting part isn't just the performance - it's the training data, which draws on models like Qwen, GPT-OSS, Kimi, and DeepSeek-OCR. Nemotron 3 Nano Omni is an open-source multimodal model that processes text, images, video, and audio in a single architecture. The 30-billion-parameter model uses a Mamba-Transformer hybrid with Mixture-of-Experts, activating about three billion parameters per query. It runs on Nvidia's own C-RADIOv4-H vision encoder and the Parakeet-TDT audio encoder, with a context window of up to 256,000 tokens. The only officially supported language is English. According to the technical report , Nemotron 3 Nano Omni is built mainly for agentic applications: document processing, computer-use agents, video and audio analysis, and voice interaction. On benchmarks like OCRBenchV2, MMLongBench-Doc, WorldSense, and VoiceBench, the model beats its predecessor, Nemotron Nano V2 VL, and goes toe-to-toe with Alibaba's Qwen3-Omni . On OSWorld, a benchmark for GUI agents, accuracy jumps from 11.1 to 47.4 points compared to the previous version. Nvidia says throughput at the same interactivity level is up to nine times higher than Qwen3-Omni. Ad How rival models shaped the training data The benchmarks are one thing, but there are also interesting details about the training data, the kind of detail you only get with a true open-source release. Nvidia processed roughly 717 billion tokens across seven training stages, with the context window expanding at each step. Ad DEC_D_Incontent-1 A big chunk of the synthetic training data comes from competing models. Image captions, question-answer pairs, and reasoning traces were generated using Qwen3-VL-30B-A3B-Instruct , Qwen3.5-122B-A10B , Qwen2.5-VL-72B-Instruct, OpenAI's gpt-oss-120b , Kimi-K2.5, GLM-4.1V-9B-Thinking, and DeepSeek-OCR . Nvidia also pulled in GPT-4o and Gemini 3 Flash Preview to handle filtering. Using other models to train new ones is common practice across the industry, though most developers aren't this upfront about it. Companies like OpenAI, Anthropic, and Google have repeatedly accused Chinese AI labs of large-scale distillation efforts . Ad The audio data includes Nvidia's own Granary and SIFT-50M datasets, along with captions from Qwen's Omni-Captioner. For the reinforcement learning stage, the team built a five-stage pipeline spanning 25 environments, covering tasks like visual grounding, chart and document understanding, GUI clicks, and automatic speech recognition. Along with the weights in BF16, FP8, and NVFP4, Nvidia is releasing parts of the training data , the training pipelines on Megatron-Bridge, and the RL recipes on NeMo-RL. That sets this release apart from projects that only ship weights. Reasoning mode is on by default, so users have to turn it off manually for tasks that don't need chain-of-thought. The model ships under the NVIDIA Open Model Agreement, which allows commercial use. Ad DEC_D_Incontent-2 Ad AI News Without the Hype – Curated by Humans Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section. Subscribe now Source: Technical Report | Nemotron 3 Nano Omni Weights | Training Data

엔비디아 멀티모달 오픈소스 학습 데이터 에이전트