The Decoder • 73일 전

AI 로봇, 움직이기 전 결과 시뮬레이션 능력 장착

IMP

8/10

핵심 요약

최근 발표된 리뷰 논문은 일상적인 비디오 데이터만으로 학습해 로봇이 행동의 결과를 미리 시뮬레이션할 수 있게 하는 '월드 액션 모델(WAMs)' 체계를 소개합니다. 기존 모델들이 단순히 카메라 이미지를 행동으로 매핑하는 데 그쳤다면, 이 모델들은 물리적 환경 변화를 예측하여 미지의 환경에서도 뛰어난 일반화 성능을 보여줍니다. 연구진은 백여 편의 관련 논문을 분석해 결과 예측과 행동 생성을 순차적 또는 동시에 수행하는 두 가지 핵심 아키텍처로 분류했습니다.

번역된 본문

글쓴이: Jonathan Kemper | THE DECODER 프롬프트 | 2026년 5월 17일

주요 요점: 최근 발표된 리뷰 논문은 로봇 공학을 위한 모델 클래스인 '월드 액션 모델(World Action Models, WAMs)'에 대한 체계적인 프레임워크를 소개합니다. 이를 통해 AI 시스템은 라벨링되지 않은 일상 영상만으로도 학습할 수 있게 됩니다. 기존 접근 방식과 달리 WAMs는 주어진 카메라 이미지에 어떤 행동이 뒤따라야 하는지만 학습하는 것이 아닙니다. 그 대신 해당 행동의 결과로 환경이 어떻게 변할지 시뮬레이션하여 물리적 세계에 대한 내부 모델을 효과적으로 구축합니다. 이 리뷰에서 분석된 약 100편의 논문들은 크게 두 가지 주요 아키텍처 범주로 나뉩니다. 한 연구 흐름은 먼저 예측된 미래 비디오를 생성한 다음 그로부터 제어 명령을 도출하는 방식이며, 다른 하나는 시각적 입력과 행동을 동시에 병렬로 처리하는 방식입니다.

오늘날 로봇 AI에는 근본적인 약점이 있습니다. 모델이 카메라 이미지를 단순히 움직임에 직접 매핑하도록 학습한다는 것입니다. 즉, 자신의 행동 결과로 실제 세계가 어떻게 변하는지 이해하지 못합니다. 중산대학교(Fudan University), 상하이 혁신 연구소(Shanghai Innovation Institute), 싱가포르 국립대학교(National University of Singapore)의 새로운 서베이 논문은 이러한 간극을 메우기 위해 설계된 모델 클래스를 체계적으로 분류한 최초의 연구입니다. 바로 '월드 액션 모델(World Action Models)'입니다.

자신의 가까운 미래를 시뮬레이션하는 로봇 기존의 비전-언어-액션(Vision-Language-Action) 모델은 대부분 관찰을 일치하는 행동으로 직접 매핑하는 방식을 학습합니다. 월드 액션 모델은 여기서 한 걸음 더 나아갑니다. 환경이 어떻게 변할지 모델링한 다음, 그 예측을 행동 생성과 결합합니다. 저자들은 이의 실질적 이점이 크다고 말합니다. 움직임을 실행하기 전에 그 결과를 시뮬레이션하는 모델은 낯선 물건이나 환경에 훨씬 더 잘 일반화됩니다. 더 중요한 것은 로봇의 행동이 전혀 라벨링되지 않은 영상, 즉 일상적인 1인칭 비디오와 같은 데이터로도 학습할 수 있다는 점입니다. 이런 종류의 데이터는 기존 로봇 AI에서는 거의 쓸모가 없었습니다. 순수 비디오 생성기는 그럴듯한 미래 프레임을 생성할 수는 있지만, 제어 신호와 연결되지 않습니다. 베이징대학교 연구팀은 최근 세계 모델(World Models)의 통일된 정의를 내리면서 정확히 이 점을 구분했습니다. 월드 액션 모델은 이 두 가지 조건을 동시에 충족합니다.

두 가지 핵심 아키텍처 연구진은 약 100편의 논문을 두 가지 아키텍처 흐름으로 분류했습니다. 첫 번째는 '캐스케이디드 WAMs(Cascaded WAMs)'로, 두 단계로 작동합니다. 먼저 세계 모델이 다음 장면의 모습을 나타내는 이미지나 비디오를 생성합니다. 그런 다음 두 번째 모듈이 해당 출력에서 알맞은 제어 명령을 끌어냅니다. UniPi와 같은 초기 연구는 전체 비디오를 생성하고 학습된 역 모델(inverse model)을 통해 움직임을 도출합니다. AVDC나 3DFlowAction과 같은 다른 접근 방식은 로봇의 궤적을 기하학적으로 계산할 수 있는 모션 필드를 사용합니다. VPP 또는 LAPA와 같은 나머지 방식들은 가시적인 이미지를 완전히 건너뛰고 압축된 추상적 표현으로 미래를 예측합니다. 이는 모든 단일 픽셀을 렌더링하는 데 필요한 컴퓨팅 비용을 아껴줍니다. 두 번째 흐름인 '조인트 WAMs(Joint WAMs)'는 두 가지 작업을 단일 모델에서 결합합니다. GR-1, GR-2 또는 WorldVLA와 같은 연구는 이미지와 행동을 통합된 토큰 시퀀스로 취급합니다. PAD, UWM 또는 DreamZero와 같은 디퓨전(Diffusion) 기반 변형은 미래 프레임과 움직임을 병렬로 생성합니다. 엔비디아(Nvidia)의 Cosmos Policy는 동일한 아키텍처를 컨트롤러, 시뮬레이터 또는 평가 모델로 사용할 수 있습니다. 엔비디아는 제어 명령을 받아 시뮬레이션된 시각적 미래를 생성하는 세계 모델인 DreamDojo와 유사한 이중 역할을 추구하고 있습니다. 또한 이 서베이는 세계 모델을 대체재가 아닌 공급자로 사용하는 π0.7에 대해서도 논의합니다. 이 모델은 상상한 미래 프레임을 사전 학습된 로봇 AI의 컨텍스트에 제공하고, 그 후 로봇 AI가 움직임을 생성하는 방식입니다.

진짜 병목 현상은 '데이터' 논문의 한 챕터 전체는 학습 데이터가 어디서 오는지 파헤칩니다. 네 가지 소스가 이 분야를 형성합니다. 원격 조종 로봇의 텔레오퍼레이션(Teleoperation) 데이터는 정확하지만 비용이 많이 듭니다.

원문 보기

원문 보기 (영어)

World Action Models give robots the ability to simulate consequences before they move Jonathan Kemper View the LinkedIn Profile of Jonathan Kemper May 17, 2026 Nano Banana Pro prompted by THE DECODER Key Points A recent review paper introduces a systematic framework for "World Action Models" (WAMs), a model class for robotics that enables AI systems to be trained using unlabeled everyday videos. Unlike conventional approaches, WAMs don't just learn which action should follow a given camera image. They also simulate how the environment will change as a result of that action, effectively building an internal model of the physical world. The roughly one hundred papers analyzed in the review fall into two main architectural categories. One line of work first generates a predicted future video and then derives control commands from it, while the other processes visual input and actions simultaneously in parallel. Ask about this article… Search Today's robotics AI has a basic weakness: models learn to map camera images directly to movements. But they don't understand how the world actually changes as a result of their actions. A new survey paper from Fudan University, the Shanghai Innovation Institute, and the National University of Singapore is the first to systematically catalog a class of models designed to close that gap: World Action Models. Robots that simulate their own near future Existing vision-language-action models mostly learn direct mappings from observations to matching actions. World Action Models go further. They also model how the environment will likely change, then couple that prediction to action generation. Ad The payoff is practical, the authors say. A model that simulates the consequences of a movement before executing it generalizes better to unfamiliar objects and settings. More importantly, it can learn from video footage where no robot actions are labeled at all—everyday first-person videos, for example. That kind of data was nearly useless for traditional robotics AI. Ad DEC_D_Incontent-1 Pure video generators can produce plausible future frames, but they aren't tied to control signals. A research team at Peking University recently drew exactly that distinction in its unified definition of world models . World Action Models meet both conditions at once. Two core architectures The researchers sort about a hundred papers into two architectural lines. The first, Cascaded WAMs, works in two steps. A world model first generates an image or video of what the scene should look like next. Then a second module pulls the right control commands from that output. Early work like UniPi generates complete videos and derives motion through a learned inverse model. Ad Other approaches like AVDC or 3DFlowAction use motion fields from which the robot's trajectory can be computed geometrically. Still others - VPP or LAPA, for instance - skip visible images entirely and predict the future in compressed, abstract representations. That saves the compute otherwise needed to render every single pixel. The second line, Joint WAMs, combines both tasks in a single model. Work like GR-1, GR-2, or WorldVLA treats images and actions as a unified token sequence. Diffusion-based variants such as PAD, UWM, or DreamZero generate the future frame and the movement in parallel. Nvidia's Cosmos Policy can use the same architecture as a controller, a simulator, or an evaluation model. Ad DEC_D_Incontent-2 Nvidia pursues a similar dual role with DreamDojo , a world model that takes control commands and generates a simulated visual future from them. The survey also discusses π0.7 , which uses the world model not as a replacement but as a supplier. It feeds imagined future frames into the context of a pretrained robotics AI, which then generates the movement. Ad The real bottleneck is data A whole chapter digs into where training data comes from. Four sources shape the field. Teleoperation data from remotely controlled robots is precise but expensive and limited to a handful of environments. Datasets like Open X-Embodiment or DROID try to fix that by pooling data from many labs. Portable demo tools like the Universal Manipulation Interface sidestep hardware dependency: people perform tasks with handheld grippers in everyday settings. The RDT2 dataset collects about 10,000 hours of material this way. Simulations like RoboCasa or RoboTwin 2.0 deliver unlimited trajectories with perfect depth data but suffer from the well-known sim-to-real gap. Nvidia leans hard into this approach with GR00T N1 , training humanoid robots mostly in synthetic environments. Egocentric everyday videos from Ego4D offer unlimited variety but contain no action labels. This is where World Action Models show their edge. They could use those videos to predict future frames even when no motion data is available. Evaluation can't keep up with development The authors are especially critical about how well these models are actually tested. Visual quality gets measured with standard metrics like PSNR or FVD, but those say little about whether a video is physically plausible. Specialized benchmarks test different slices of physical plausibility. VideoPhy evaluates physical interaction scenarios. Physics-IQ tests predictions of real physical events from video frames. WorldModelBench checks explicit rules like gravity, conservation of mass, rigid body mechanics, and impenetrability. One especially sharp finding comes from the "Wow, Where, Val!" benchmark . It checks whether a generated video can actually yield an executable movement. Many visually convincing models drop to near-zero success rates on this test, the survey reports. So a video can look realistic and still contain nothing useful for control. The authors call this the core problem: there's no metric for whether the imagined future and the executed movement are causally consistent. Validation for Yann LeCun's JEPA approach So far, the authors say, no controlled study compares the different architectures under identical conditions. Nearly all models work only with camera images, even though tasks with fine contact need tactile and force data. Compute is still a bottleneck, too. DreamZero manages about seven predictions per second; traditional robot controllers run at around fifty. The authors also raise a safety question. A model that confidently predicts a wrong future can kick off long action chains that are hard to stop. But that same predictive ability could also check planned movements against physical rules before they're executed. Meta's V-JEPA 2 showed a few months ago that self-supervised video world models can skip generating visible pixels entirely, predicting only abstract representations of the future instead. The survey authors see this as one of the most promising ways to cut the heavy compute cost of explicit video generation without losing the physical grounding that makes predictions useful. A full list of all discussed papers is available on GitHub . AI News Without the Hype – Curated by Humans Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section. Subscribe now Source: Arxiv

로봇 공학 월드 모델 비디오 학습 시뮬레이션 AI 아키텍처