The Decoder • 106일 전

단 한 장의 사진으로 45분 실시간 립싱크 영상 생성하는 AI

IMP

8/10

핵심 요약

연구진이 단 한 장의 이미지만으로 말하기, 듣기, 노래 부르는 캐릭터의 실시간 영상을 생성하는 AI 모델 'LPM 1.0'을 공개했습니다. 이 모델은 ChatGPT 등 음성 AI와 결합해 45분까지 안정적인 스트리밍이 가능하며, 실사, 애니메이션, 3D 게임 캐릭터 등 다양한 스타일을 추가 학습 없이 지원합니다. 완성도 높은 딥페이크 기술의 등장이지만, 현재는 안전성 문제로 공개 계획 없이 연구 목적으로만 남겨진 점이 특징입니다.

번역된 본문

연구진이 단 한 장의 이미지에서 말하거나, 듣거나, 노래하는 인물의 실시간 영상을 생성하는 AI 모델 'LPM 1.0'을 소개했습니다. 이 모델은 텍스트, 오디오, 참조 이미지를 동시에 처리하여 립싱크가 완벽히 맞는 말하기, 망설임이나 시선 이동 같은 미세한 표정, 그리고 감정의 변화까지 자연스럽게 구현해 냅니다. 또한 ChatGPT나 더우바오(Doubao)와 같은 음성 오디오 AI 모델에 직접 연결하여 실시간 시각적 대화 파트너를 만들어낼 수 있습니다.

LPM 1.0은 실사 얼굴, 애니메이션, 3D 게임 캐릭터 등 다양한 이미지 스타일에 걸쳐 추가 학습(fine-tuning) 없이도 완벽하게 작동합니다. 전체 비디오 생성 과정은 한 번에 완성된 영상을 렌더링하는 대신 스트리밍 방식을 통해 실시간으로 진행됩니다. 최대 45분 길이의 영상까지도 시스템이 안정적으로 유지되는 것으로 보고되었습니다.

LPM 1.0은 연구진이 '다중 세분화 정체성 조절(Multi-granularity identity conditioning)'이라고 부르는 기술을 활용합니다. 메인 이미지와 함께 다양한 각도와 표정이 담긴 참조 이미지를 함께 입력받는 방식입니다. 이를 통해 모델이 이빨, 특정 감정과 관련된 주름, 측면 얼굴 같은 세부 사항을 임의로 만들어내지 않고 참조 자료에서 직접 가져와 자연스럽게 적용할 수 있습니다.

이 모델은 크게 세 가지 대화 상태를 인식합니다. 청취 시에는 들려오는 오디오에 반응하여 고개를 끄덕이거나 시선을 바꾸는 반응형 표정을 생성합니다. 말하기 상태에서는 오디오가 입술 움직임과 바디 랭귀지를 구동합니다. 정지나 공백 구간에서는 텍스트 지시에 따라 자연스러운 대기 상태의 행동을 생성합니다.

프로젝트 매니저인 아일링 제eng(Ailing Zeng)에 따르면, LPM 1.0은 실시간 대화 외에도 기존 오디오를 활용한 오프라인 비디오 생성도 지원하여 팟캐스트나 영화 대화 씬에 유용하게 쓰일 수 있습니다. 이를 통해 라이브 채팅 외부에서의 콘텐츠 제작 가능성도 열리게 됩니다. 비디오 기반의 입력 제어는 이번 버전에는 포함되지 않았지만, 향후 프레임워크 업데이트를 통해 지원될 수 있다고 덧붙였습니다.

공개 계획 없는 연구 프로젝트로 남아 개발팀은 LPM 1.0이 순수한 연구 프로젝트임을 강조했습니다. 가중치(weights), 코드 또는 공개 데모를 출시할 계획은 현재 없습니다. 영상에 등장하는 모든 얼굴은 실제 인물이 아닌 AI로 생성된 것입니다. 연구진은 생성된 영상에 여전히 육안으로 보이는 아티팩트(artifact, 오류)가 존재하며, 정량적 분석 결과 실제 영상 품질과 비교해 눈에 띄는 차이가 있음을 인정했습니다. 또한 적절한 안전장치와 책임 있는 사용 프레임워크가 확고히 마련되는 경우에만 접근 권한을 개방할 것을 고려하겠다고 밝혔습니다.

연구 프로젝트에 불과하지만, LPM 1.0은 AI 기술의 진행 방향을 명확히 보여줍니다. 즉, 단순히 텍스트나 음성으로 소통하는 수준을 넘어 표정, 시선 접촉, 감정적 반응이 가능한 시각적으로 믿을 수 있는 캐릭터로 등장하는 것입니다. 이는 교육, 게임, 고객 서비스 또는 가상 동반자(companion) 분야에서 매우 가치 있게 쓰일 수 있습니다.

동시에 이 기술은 심각한 위험도 내포하고 있습니다. 악의적인 행위자가 사기, 조작, 신원 도용 등에 악용할 수 있는 실시간 딥페이크 인프라 수준에 위험할 정도로 근접해 있습니다. 이러한 일들은 이미 발생하고 있으며, 이제 그 진입 장벽만 점점 낮아지고 있는 실정입니다.

원문 보기

원문 보기 (영어)

New AI model generates 45-minute lip-synced video from one photo and runs in real time Matthias Bastian View the LinkedIn Profile of Matthias Bastian Apr 13, 2026 Nano Banana Pro prompted by THE DECODER Key Points Researchers have introduced LPM 1.0, an AI model that generates real-time video of a speaking, listening, or singing character from just a single image, complete with lip-synced speech, subtle facial expressions like hesitation or gaze shifts, and smooth emotional transitions. The model plugs directly into voice AI systems like ChatGPT and works across a wide range of visual styles, including photorealistic faces, anime, and 3D game characters. The entire video generation runs as a real-time streaming process, with the system reportedly staying stable for videos up to 45 minutes long. Ask about this article… Search Researchers have introduced LPM 1.0, an AI model that generates real-time video of a speaking, listening, or singing figure from a single image. The model processes text, audio, and reference images simultaneously, producing lip-synchronized speech, subtle facial expressions like hesitation or shifts in gaze, and emotional transitions. It can plug directly into voice-audio AI models from ChatGPT or Doubao to create a visual conversation partner in real time. LPM 1.0 works across different image styles, photorealistic faces, anime, and 3D game characters, without any additional training. The entire video generation runs as a streaming process in real time rather than rendering a finished video all at once. Videos up to 45 minutes long should remain stable. Ad LPM 1.0 utilizes what the researchers call "multi-granularity identity conditioning:" alongside a main image, the model also receives reference images from different angles and with varying facial expressions. This means it doesn't have to invent details like teeth, wrinkles tied to specific emotions, or profile views on its own — it can pull them directly from the reference material. Ad DEC_D_Incontent-1 The model recognizes three conversational states. When listening, it generates reactive facial expressions like nodding or gaze shifts based on incoming audio. When speaking, the response audio drives lip movements and body language. During pauses, LPM generates natural idle behavior based on text instructions. Beyond real-time conversation, LPM 1.0 also supports offline video generation from existing audio, useful for podcasts or movie dialogs, according to project manager Ailing Zeng . This opens the door to content creation outside of live chats. Video-based input control isn't included in this version, but Zeng says the framework could support it in the future. Ad Still a research project with no public release planned The development team stresses that LPM 1.0 is purely a research project. There are no plans to release weights, code, or a public demo. All faces shown are AI-generated, not real people. The researchers acknowledge that the generated videos still contain visible artifacts, and a quantitative analysis confirmed a noticeable gap compared to real video quality. The team also says they'd only consider opening access "if and when adequate safeguards and responsible-use frameworks are firmly in place." More details are available on the project page and in the technical report . Ad DEC_D_Incontent-2 Even as a research project, LPM 1.0 points to where things are heading: AI systems that don't just communicate through text or voice, but show up as visually believable characters with facial expressions, eye contact, and emotional reactions. That could prove valuable for education, gaming, customer service, or virtual companions. Ad At the same time, the technology carries serious risks. It edges dangerously close to real-time deepfake infrastructure that bad actors could exploit for fraud, manipulation, or impersonation. All of those things are already happening, what keeps shrinking is the barrier to entry. The researchers are explicit that the system is not meant to mislead, deceive, or impersonate real people. AI News Without the Hype – Curated by Humans Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section. Subscribe now Source: Project page

비디오 생성 실시간 AI 딥페이크 아바타 연구 논문