MarkTechPost • 69일 전

바이트댄스, 이미지와 영상의 이해·생성·편집 통합 멀티모달 AI 'Lance' 공개

IMP

9/10

핵심 요약

바이트댄스가 이미지와 영상의 이해, 생성, 편집 기능을 하나의 모델에서 모두 처리할 수 있는 통합 모델 'Lance'를 발표했습니다. 이 모델은 이해(Understanding)와 생성(Generation) 작업을 각각 분리된 전문가 네트워크로 처리하는 듀얼 스트림 혼합 전문가(MoE) 아키텍처를 채택하여 작업 간 간섭 없이 높은 성능을 발휘합니다. 단일 모델로 텍스트, 이미지, 영상이라는 세 가지 모달리티를 자연스럽게 아우르며 시각 AI 분야의 중요한 이정표를 제시합니다.

번역된 본문

단일 모델이 이미지와 영상을 모두 이해하고 생성할 수 있도록 구축하는 것은 생각보다 어렵습니다. 이 두 가지 작업은 서로 반대되는 방향으로 모델을 끌어당기기 때문입니다. '이해(Understanding)' 작업은 언어와 밀접하게 정렬된 고수준의 의미론적 특징에서 이점을 얻는 반면, '생성(Generation)' 작업은 텍스처, 기하학적 구조 및 시간적 역학을 보존하는 저수준의 연속적인 표현을 필요로 합니다. 대부분의 시스템은 이러한 긴장을 해결하기 위해 두 작업을 서로 다른 아키텍처로 분리한 다음, 사후적으로 연결하는 방식을 취합니다.

바이트댄스 연구팀은 'Lance(랜스)'를 통해 다른 접근 방식을 취했습니다. 연구팀은 별도의 구성 요소를 조립하는 대신, 이미지와 영상 모달리티에 걸쳐 이해, 생성 및 편집을 기본적으로 통합하는 모델을 처음부터 공동 훈련(Joint training)하는 방식으로 설계했습니다.

Lance의 주요 기능 Lance는 자체 기능을 텍스트(X2T), 이미지(X2I), 영상(X2V)라는 세 가지 출력 계열로 구성합니다. '이해' 측면에서는 이미지 및 영상 캡셔닝, 시각적 질의응답(VQA), 광학 문자 인식(OCR), 시각적 그라운딩(Visual Grounding), 추론 기능을 다룹니다. '생성' 측면에서는 텍스트-이미지, 텍스트-영상, 이미지-영상 변환뿐만 아니라 주제 기반 생성(Subject-driven generation), 이미지 편집 및 영상 편집(두 모달리티에 걸친 멀티턴 일관성 편집 포함) 기능을 처리합니다. 이러한 올인원 기능은 중요한 이정표입니다. 일반적인 통합 아키텍처는 기본적인 이미지 이해와 텍스트-이미지 생성에서 멈추는 경우가 많지만, Lance는 이해와 생성 작업 전반에 걸쳐 전체 이미지-영상 생태계를 기본적으로 연결하는 몇 안 되는 모델 중 하나입니다.

아키텍처 작동 방식 이 아키텍처는 '통합된 컨텍스트 모델링(Unified context modeling)'과 '분리된 기능 경로(Decoupled capability pathways)'라는 두 가지 원칙을 기반으로 합니다. 통합된 컨텍스트를 위해 Lance는 텍스트, 이미지, 영상 등 모든 입력을 단일 공유 인터리브 다중 모달 시퀀스(Interleaved multimodal sequence)로 변환합니다. 텍스트 토큰은 Qwen2.5-VL 임베딩 레이어에서 가져옵니다. 이해 지향적 시각 입력의 경우 Qwen2.5-VL ViT 인코더가 간결한 의미론적 시각 토큰을 생성합니다. 생성 지향적 시각 입력의 경우 Wan2.2 3D 인과적 VAE(Variational Autoencoder) 인코더가 이미지와 영상을 연속적인 잠재 표현으로 인코딩하여 16배 공간 다운샘플링과 4배 시간 다운샘플링을 적용합니다. 이러한 모든 이질적인 토큰 유형(텍스트, 의미론적 시각, 잠재 시각)은 동일한 시퀀스 내에 존재합니다. 그런 다음 모델은 전체 컨텍스트에 대해 일반화된 3D 인과적 어텐션(Causal attention)을 실행하며, 텍스트 토큰은 인과적 어텐션을, 시각 토큰은 양방향 어텐션(Bidirectional attention)을 사용합니다.

분리된 경로를 위해 Lance는 Qwen2.5-VL 3B에서 초기화된 듀얼 스트림 혼합 전문가(Mixture-of-Experts, MoE) 아키텍처를 사용합니다. 이해 전문가(LLMUND)는 텍스트 및 의미론적 시각 토큰을 처리하여 다중 모달 추론 및 텍스트 생성을 위한 출력을 생성합니다. 생성 전문가(LLMGEN)는 시각적 합성 및 편집을 위해 VAE 잠재 토큰을 처리합니다. 가장 중요한 점은 두 전문가가 동일한 공유 인터리브 시퀀스 위에서 작동하여 컨텍스트를 공유하지만 동일한 매개변수(Parameter)를 두고 경쟁하지 않는다는 것입니다. 이해 전문가는 다음 토큰 예측 손실(Next-token prediction loss)로 훈련되고, 생성 전문가는 연속적인 잠재 공간에서의 흐름 정합 목표(Flow matching objective)로 훈련됩니다. 이 두 가지 손실은 훈련 전반에 걸쳐 구성 가능한 가중치로 결합됩니다.

모달리티 인지 회전 위치 인코딩 (MaPE, Modality-Aware Rotary Positional Encoding) 동일한 시퀀스 내에서 ViT 의미론적 토큰, 깨끗한 VAE 조건 토큰, 노이즈가 있는 VAE 타겟 토큰을 실행하면 미묘한 문제가 발생합니다. 표준 3D-RoPE는 시공간 레이아웃만을 기반으로 위치를 인코딩하므로, 이러한 토큰 그룹을 구별할 방법이 없습니다. 여러 시각 토큰 그룹이 동일한 시퀀스를 차지하면 위치적 경계가 모호해져 작업 간 정렬(Alignment)에 악영향을 미칠 수 있습니다. Lance는 이 문제를 해결하기 위해 모달리티 인지 회전 위치 인코딩(MaPE)을 도입했습니다. MaPE는 시퀀스 내 인덱스를 기반으로 각 모달리티 그룹에 고정된 시간 오프셋(Temporal offset)을 적용합니다. 공간 좌표는 변경되지 않으므로 이미지와 영상 내의 고유한 레이아웃이 그대로 보존됩니다. 이 시간 오프셋만으로도 개별 영상 내의 시간 순서를 방해하지 않고 전역 위치 공간에서 토큰 그룹을 완벽하게 분리할 수 있습니다.

원문 보기

원문 보기 (영어)

Uncategorized Building a single model that can both understand and generate images and videos is harder than it sounds. The two tasks pull in opposite directions. Understanding benefits from high-level semantic features tightly aligned with language. Generation needs low-level continuous representations that preserve texture, geometry, and temporal dynamics. Most systems handle this tension by separating the two into distinct architectures, then bridging them post-hoc. ByteDance research team took a different approach with Lance . Rather than assembling separate components, the research team designed a model that natively integrates understanding, generation, and editing across both image and video modalities — trained jointly from the start. What Lance Can Do Lance organizes its capabilities into three output families: text (X2T), images (X2I), and videos (X2V). On the understanding side, this covers image and video captioning, visual question answering, OCR, visual grounding, and reasoning. On the generation side, it handles text-to-image, text-to-video, image-to-video, subject-driven generation, image editing, and video editing — including multi-turn consistency editing across both modalities. This all-in-one capability is a major milestone. While standard unified architectures typically stop at basic image understanding and text-to-image generation, Lance is among the few to natively bridge the entire image-video ecosystem across both understanding and generation tasks. How the Architecture Works The architecture is based on two principles: unified context modeling and decoupled capability pathways . For unified context, Lance converts all inputs — text, images, and videos — into a single shared interleaved multimodal sequence. Text tokens come from the Qwen2.5-VL embedding layer. For understanding-oriented visual inputs, the Qwen2.5-VL ViT encoder produces compact semantic visual tokens. For generation-oriented visual inputs, the Wan2.2 3D causal VAE encoder encodes images and videos into continuous latent representations, applying 16× spatial downsampling and 4× temporal downsampling. All these heterogeneous token types — text, semantic visual, and latent visual — live in the same sequence. The model then runs generalized 3D causal attention over the full context, with text tokens using causal attention and visual tokens using bidirectional attention. For decoupled pathways, Lance uses a dual-stream mixture-of-experts architecture initialized from Qwen2.5-VL 3B. The understanding expert (LLMUND) handles text and semantic visual tokens, producing outputs for multimodal reasoning and text generation. The generation expert (LLMGEN) handles VAE latent tokens for visual synthesis and editing. Crucially, both experts operate over the same shared interleaved sequence — they share context but don't compete for the same parameters. The understanding expert is trained with a next-token prediction loss; the generation expert is trained with a flow matching objective in continuous latent space. The two losses are combined with configurable weights throughout training. Modality-Aware Rotary Positional Encoding (MaPE) Running ViT semantic tokens, clean VAE condition tokens, and noisy VAE target tokens through the same sequence creates a subtle problem. Standard 3D-RoPE encodes positions based on spatiotemporal layout alone — it has no way to tell these token groups apart. When multiple visual token groups occupy the same sequence, their positional boundaries become ambiguous, which can hurt cross-task alignment. Lance introduces Modality-Aware Rotary Positional Encoding (MaPE) to fix this. MaPE applies a fixed temporal offset to each modality group based on its index in the sequence. Spatial coordinates stay unchanged, so the intrinsic layout within images and videos is preserved. The temporal offset alone is enough to separate the token groups in the global positional space without disrupting temporal ordering within any individual video. Removing MaPE drops GenEval from 80.94 to 80.56, GEdit-Bench from 6.86 to 6.30, and VBench from 81.81 to 80.95 — consistent degradation across generation, editing, and understanding. Training: Four Stages, One Unified Framework Lance is trained through four sequential stages , each building on the last. Pre-Training (PT) lays the foundation using approximately 1B image-text and 140M video-text pairs, covering 1.5T training tokens. This stage establishes basic multimodal alignment and generation capability. The VAE and ViT encoders are frozen here; only the backbone and connectors are trained. Continual Training (CT) expands the task space by introducing interleaved multi-task data — editing samples, subject-driven generation samples, and multimodal understanding data — across approximately 300B tokens. A progressive data-mixture schedule gradually increases the proportion of harder tasks like editing as training proceeds. Supervised Fine-Tuning (SFT) tightens instruction following, editing accuracy, and identity consistency using curated high-quality data across 72B tokens. Reinforcement Learning (RL) uses Group Relative Policy Optimization (GRPO), with PaddleOCR serving as the reward model, to further sharpen text rendering accuracy and image-text alignment. Everything fits within a maximum training budget of 128 GPUs. Results Image Generation. On GenEval, Lance scores 0.90 overall, matching TUNA for the top spot among unified models. Subcategory scores include counting (0.84), colors (0.97), and spatial position (0.87). On DPG-Bench, Lance scores 84.67 overall, with particularly strong relation modeling — though TUNA (86.76) and TUNA-2 (86.54) lead that benchmark. To put the parameter efficiency in perspective: Janus-Pro-7B scores 0.80 on GenEval; Show-o2 (7B) scores 0.76. Lance matches the top unified model score at 3B activated parameters. Video Generation. On VBench, Lance achieves a Total Score of 85.11 (using LLM rewriting), the highest among unified models. The next-best unified model, TUNA, scores 84.06. Lance also outscores dedicated generation-only models including HunyuanVideo (83.43) and Wan2.1-T2V (83.69). Image Editing. On GEdit-Bench, Lance scores 7.30 Avg/G_O, the highest among unified models. It leads in background change, material modification, motion change, portrait beautification, subject removal, subject replacement, and tone transfer. Text modification is flagged as a remaining weakness. Video Understanding. On MVBench, Lance achieves a 62.0 overall score, the highest among unified models. Show-o2 (7B), the next-best unified model, scores 55.7. Lance also outperforms several understanding-only models with more parameters — notable given that it is simultaneously trained for generation and editing. Marktechpost’s Visual Explainer How—To Guide Getting Started with Lance by ByteDance A step-by-step guide to installing and running Lance — a 3B native unified multimodal model for image & video understanding, generation, and editing. Step 1 of 6 Step 01 — Prerequisites Check Your Environment First Before cloning the repository, confirm your system meets the minimum software and hardware requirements. Lance requires CUDA-capable hardware with significant VRAM. 🐍 Python 3.10 or higher Required ⚡ CUDA 12.4 or higher Required 🖥️ GPU VRAM 40 GB minimum For inference 📦 License Apache 2.0 Open—source Note: A GPU with at least 40 GB VRAM is required for running inference. CUDA 12.4+ is mandatory — lower versions are not officially supported. Step 02 — Clone the Repository Clone from GitHub Clone the official Lance repository from ByteDance on GitHub. The repository includes the inference scripts, Gradio interface, benchmark scripts, and model configuration files. git clone https://github.com/bytedance/Lance cd Lance The repository structure you will see after cloning: inference_lance.py Main inference script for all tasks inference_lance.sh Shell wrapper with configurable

멀티모달 비디오 생성 이미지 생성 모델 아키텍처 바이트댄스