The Decoder • 103일 전

엔비디아, 단 한 장의 사진으로 3D 환경 구축하는 'Lyra 2.0' 공개

IMP

8/10

핵심 요약

엔비디아 연구진이 단 한 장의 사진만으로 최대 90m 길이의 일관된 3D 환경을 생성하는 시스템 'Lyra 2.0'을 발표했습니다. 이 시스템은 기존 3D 생성 모델들의 고질적인 문제였던 공간 왜곡과 누적 오류를 해결하여 6개의 경쟁 모델을 성능 면에서 압도합니다. 생성된 3D 공간은 엔비디아의 물리 엔진인 Isaac Sim으로 내보내어 실제 데이터 없이도 로봇 훈련을 위한 시뮬레이션 환경으로 즉각 활용할 수 있어 로봇 산업의 훈련 비용과 시간을 혁신적으로 단축할 수 있습니다.

번역된 본문

엔비디아, Lyra 2.0으로 로봇 시뮬레이션 훈련의 확장을 목표로 하다 작성자: Maximilian Schreiner | 2026년 4월 16일

핵심 요약:

엔비디아 연구진은 단 한 장의 사진에서 최대 90미터까지 확장되는 일관된 3D 환경을 생성하는 시스템인 Lyra 2.0을 발표했습니다.
이 시스템은 이전에 생성된 3D 형상(Geometry)을 참조 데이터로 저장하고, 특정 품질 저하를 방지하도록 집중적으로 훈련하여 기존 비디오 모델의 두 가지 핵심적인 약점을 해결합니다.
엔비디아에 따르면 Lyra 2.0은 6개의 경쟁 모델을 능가하며, 생성된 장면을 Isaac Sim과 같은 물리 엔진으로 내보내어 가상 환경에서 로봇을 훈련시키는 데 사용할 수 있습니다.

엔비디아 연구진은 단 한 장의 사진에서 대규모의 일관된 3D 환경을 생성하는 시스템인 Lyra 2.0을 공개했습니다. 이렇게 생성된 장면은 실시간으로 탐색할 수 있으며 로봇 시뮬레이션에 직접적으로 사용될 수 있습니다.

기존 3D 장면 생성 AI 모델들은 긴 카메라 경로(이동 궤적)에서 어려움을 겪었습니다. 가상 카메라가 시작점에서 멀어질수록 색상과 구조가 왜곡되는 문제가 발생했습니다. 카메라가 이전에 봤던 위치로 돌아올 때, 모델은 종종 해당 환경을 처음부터 완전히 새로 만들어내는 오류를 범하곤 했습니다. 엔비디아 연구진은 Lyra 2.0을 통해 이 문제를 해결하고자 합니다.

이 시스템은 단 한 장의 사진을 입력받아 장면을 가상으로 걸어 다니는 듯한 카메라 제어 영상을 생성합니다. 이 영상들은 자동으로 3D 표현으로 변환되어 실시간으로 확인할 수 있으며 시뮬레이션 환경에서 사용됩니다. 연구 논문에 따르면, 생성된 장면은 약 90미터에 걸쳐 펼쳐질 수 있습니다.

Lyra 2.0이 3D 장면 생성의 두 가지 최대 문제를 해결하는 방법

연구진에 따르면 현재의 비디오 모델은 두 가지 근본적인 과제에서 실패합니다. 첫째, 프레임 밖으로 사라진 이전에 본 영역을 모델이 잊어버린다는 것입니다. 둘째, 단계적인 영상 생성 과정에서 작은 오류들이 누적되어 시간이 지남에 따라 상당한 왜곡으로 커진다는 점입니다.

첫 번째 문제를 해결하기 위해 Lyra 2.0은 생성된 모든 프레임의 3D 형상(Geometry)을 저장합니다. 카메라가 이전에 방문했던 영역으로 다시 이동할 때, 시스템은 이전 프레임을 검색하여 그 공간 정보를 참조로 활용합니다. 비디오 모델이 여전히 실제 이미지 생성을 담당하기 때문에 저장된 형상의 오류가 새로운 프레임에 직접적으로 반영되지는 않습니다.

이러한 누적 오류(Drift)를 방지하기 위해 연구진은 훈련 과정에서 모델을 의도적으로 모델 자체의 결함이 있는 출력 결과에 노출시킵니다. 이를 통해 모델은 단순히 오류를 전달하는 것이 아니라 품질 저하를 인식하고 스스로 교정하는 방법을 학습하게 됩니다.

Lyra 2.0, 6개의 경쟁 기법을 성능으로 압도

엔비디아에 따르면 두 개의 데이터셋에 대한 벤치마크 테스트에서 Lyra 2.0은 이미지 품질, 스타일 일관성 및 카메라 제어와 같은 거의 모든 측정 기준에서 GEN3C, Yume-1.5, CaM을 포함한 6개의 타 방법론을 능가했습니다. 또한 빠른 변형(Variant) 모델은 비슷한 품질을 유지하면서도 영상을 약 13배 더 빠르게 생성합니다.

생성된 3D 장면은 대화형 인터페이스를 통해 단계별로 탐색할 수 있으며, 메시(Mesh) 형태로 추출되어 엔비디아의 Isaac Sim과 같은 물리 엔진으로 내보낼 수 있습니다. 회사 측은 이를 통해 실제 3D 데이터를 직접 촬영하고 구축할 필요 없이 완전히 가상으로 생성된 환경에서 로봇을 훈련시킬 수 있을 것이라고 밝혔습니다. 다만 현재까지 Lyra 2.0은 정적인(움직임이 없는) 장면만 지원합니다.

원문 보기

원문 보기 (영어)

Nvidia wants to scale robot simulation training with Lyra 2.0 Maximilian Schreiner View the LinkedIn Profile of Maximilian Schreiner Apr 16, 2026 Nvidia Key Points Nvidia researchers present Lyra 2.0, a system that generates coherent 3D environments with an extension of up to 90 meters from a single photo. The system stores already generated 3D geometry as orientation and trains specifically against quality losses in order to solve two central weaknesses of previous video models. According to Nvidia, Lyra 2.0 outperforms six competitors and can export the generated scenes to physics engines such as Isaac Sim in order to train robots in generated environments. Ask about this article… Search Nvidia researchers have unveiled Lyra 2.0, a system that generates large, coherent 3D environments from a single photograph. The resulting scenes can be explored in real time and used directly in robot simulations. Existing AI models for 3D scene generation struggle with long camera paths: the further the virtual camera moves from its starting point, the more colors and structures distort. When the camera returns to a previously seen location, the model often reinvents the environment from scratch. Nvidia researchers aim to solve this problem with Lyra 2.0 . The system takes a single photo and generates camera-controlled videos that simulate a virtual walkthrough of a scene. These videos are then automatically converted into 3D representations that can be viewed in real time and used in simulation environments. According to the research paper , the generated scenes can span roughly 90 meters. Ad How Lyra 2.0 fixes the two biggest problems in 3D scene generation Current video models fail at two fundamental challenges, according to the researchers. First, the model forgets previously seen areas as soon as they leave the frame. Second, small errors accumulate during step-by-step video generation, building up into significant distortions over time. Ad DEC_D_Incontent-1 To tackle the first problem, Lyra 2.0 stores the 3D geometry for every generated frame. When the camera moves back toward a previously visited area, the system retrieves the earlier frames and uses their spatial information as a reference. The video model still handles the actual image generation, which means errors in the stored geometry don't bleed directly into new frames. To prevent drift, the researchers deliberately expose the model to its own flawed outputs during training. This teaches it to recognize and correct quality degradation instead of passing errors along. Ad Lyra 2.0 outperforms six competing methods In benchmark tests on two datasets, Lyra 2.0 beats six other methods - including GEN3C, Yume-1.5, and CaM - across nearly all measured criteria like image quality, style consistency, and camera control, according to Nvidia. A faster variant of the model generates videos roughly 13 times quicker at comparable quality. The generated 3D scenes can be explored step by step through an interactive interface and exported as meshes to physics engines like Nvidia Isaac Sim. This could let robots train in fully generated environments without needing to capture real-world 3D data, the company says. For now, though, Lyra 2.0 only supports static scenes. Ad DEC_D_Incontent-2 Ad AI News Without the Hype – Curated by Humans Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section. Subscribe now Source: Nvidia (Blog) | Arxiv | HuggingFace

3D 생성 로봇 시뮬레이션 엔비디아 인공지능 Lyra 2.0