The Decoder • 76일 전

알리바바 Qwen-Image-2.0, 압축률 2배 향상 및 생성 스텝 40→4 단축

IMP

8/10

핵심 요약

알리바바가 발표한 'Qwen-Image-2.0' 기술 보고서에 따르면, 새로운 VAE(변이형 오토인코더) 도입으로 공간적 압축률을 16배로 2배 향상시키고, 트랜스포머 아키텍처 최적화를 통해 이미지 생성 스텝을 기존 40단계에서 단 4단계로 줄였습니다. 이를 통해 고품질의 복잡한 이미지를 훨씬 더 빠르고 적은 컴퓨팅 자원으로 생성할 수 있게 되어, 실무적인 이미지 생성 파이프라인의 효율성을 획기적으로 높였다는 점에서 중요합니다.

번역된 본문

알리바바의 'Qwen-Image-2.0' 기술 보고서는 연구진이 학습과 추론 과정 모두에서 효율성을 극대화한 방법을 제시합니다. 핵심적인 변화는 압축률을 높인 VAE, 개편된 이미지 트랜스포머, 그리고 사용자의 단순한 프롬프트를 풍부한 설명으로 확장해 주는 전용 모듈입니다.

이미지 모델은 원시 픽셀을 직접 다루지 않습니다. 대신 VAE(Variational Autoencoder, 변이형 오토인코더)라는 별도의 신경망이 각 이미지를 훨씬 작은 잠재 표현(latent representation)으로 압축한 뒤 다시 원본 이미지로 복원합니다. 이 신경망의 압축 성능이 좋을수록 이미지 모델의 학습이 더 빠르고 저렴해집니다. 대부분의 오픈소스 모델은 각 방향으로 이미지를 8배씩 축소하는 압축기를 사용하며, FLUX.1-dev와 HunyuanVideo가 그러한 방식을 따릅니다. 하지만 기술 보고서에 따르면 Qwen-Image-2.0은 16배의 공간적 다운샘플링을 적용해 압축률을 두 배로 높였습니다.

압축률을 두 배로 높이면 일반적으로 미세한 디테일이 손상되지만, Qwen 팀은 두 가지 방법으로 이를 보완했습니다. 첫째, 압축기 내부의 스킵 연결(skip connection)이 병목 계층 주변으로 정밀한 이미지 정보를 우회시켜 전달합니다. 둘째, 학습 과정에서 잠재 공간(latent space)이 의미론적으로 유의미한 구조를 포착하도록 형성하여 이미지 모델이 더 깔끔하게 작업할 수 있도록 했습니다. 특히 팀은 이러한 정렬 압력(alignment pressure)이 학습 초기에만 강하게 작용하고 이후에는 줄어든다고 밝혔습니다.

눈에 띄는 점은 표준적인 학습 구성 요소 중 하나가 완전히 제거되었다는 것입니다. 대부분의 VAE는 판별자(discriminator)라는 두 번째 신경망을 사용해 실제 이미지와 복원된 이미지의 차이를 구별하는 법을 학습함으로써 출력 결과를 더 선명하게 만듭니다. 그러나 Qwen 팀은 이를 완전히 폐기했으며, 대규모에서 이것이 "대부분 불필요하다"고 평가하고 학습 불안정성의 원인이라고 지적했습니다. 공격적인 압축에도 불구하고 이 VAE는 표준 ImageNet 데이터셋에서 더 낮은 압축률을 사용하는 경쟁 모델들보다 더 높은 복원 점수를 기록했습니다.

트랜스포머 아키텍처 변경으로 활성화 폭주 제어 Qwen-Image-2.0은 텍스트와 이미지 토큰을 단일 스트림으로 처리하는 트랜스포머를 중심으로 구축되었습니다. 텍스트 조건부 입력(text conditioning)은 가중치가 고정된 비전-언어 모델인 Qwen3-VL에서 제공됩니다. 팀은 트랜스포머 자체에 두 가지 아키텍처 변경을 가했습니다. 첫째, 내부 스케일링 메커니즘을 간소화했습니다. 원래 설계에서는 신호에 학습된 계수(factor)를 곱하고 학습된 오프셋(offset)을 더했지만, 이제는 곱셈만 남게 되었습니다. 둘째, 어텐션 계층 사이의 피드포워드 블록을 SwiGLU로 교체했으며, 이는 두 개의 병렬 경로가 서로를 게이트(gate)하는 형태의 변형입니다. SwiGLU 도입은 특정 학습 문제에서 비롯되었습니다. 모델이 텍스트와 이미지를 함께 학습할 때 일부 내부 값이 극단적인 크기로 급증하고, 초기 학습 단계에서 뉴런이 영구적으로 포화 상태에 빠질 수 있습니다. 대규모 언어 모델(LLM) 연구자들은 이를 '대규모 활성화(massive activations)'라고 부릅니다. SwiGLU는 값들을 다루기 가능한 범위 내에 유지합니다.

역공학된 학습 데이터로 프롬프트 모듈 구동 인포그래픽이나 포스터와 같이 복잡한 결과물을 얻으려면 자세한 프롬프트가 필요합니다. 하지만 실제 사용자는 짧고 모호한 요청만 입력하는 경우가 많습니다. Qwen-Image-2.0은 Qwen3.5-9B 기반의 업스트림 모듈을 통해 이러한 간결한 입력을 구체적인 설명으로 변환해 이 격차를 해결합니다. 이 모듈의 학습 과정은 독특했습니다. 짧은 프롬프트와 상세한 프롬프트를 수동으로 매칭하는 대신, 팀은 기존의 풍부한 이미지 설명에서 시작해 조명, 텍스처, 레이아웃 등의 구체적인 요소를 체계적으로 제거하여 일반 사용자가 입력한 것처럼 보이도록 만들었습니다. 이렇게 삭제되는 각 단계는 누락된 디테일을 다시 추가하는 방법을 제공하는 고유한 학습 신호를 자동으로 생성했습니다. 이 모듈은 두 단계로 학습됩니다. 첫째, 이러한 합성 쌍에서 학습합니다. 그런 다음 후보 프롬프트를 생성하고, 가중치가 고정된 이미지 생성기가 해당 프롬프트로 결과를 렌더링하며, 이 결과물이 의도에 부합하고 좋아 보이도록 모듈을 최적화합니다.

5개의 보상 모델이 최종 튜닝을 이끕니다 인간의 선호도에 맞추는 마지막 정렬 라운드를 위해,

원문 보기

원문 보기 (영어)

Alibaba's Qwen-Image-2.0 doubles compression and cuts generation steps from 40 to 4 Jonathan Kemper View the LinkedIn Profile of Jonathan Kemper May 14, 2026 Qwen Alibaba's technical report on Qwen-Image-2.0 lays out how the team squeezed more efficiency out of both training and inference. The big moves: a harder-compressing VAE, a reworked image transformer, and a dedicated module that expands bare-bones user prompts into rich descriptions. Image models don't operate on raw pixels. Instead, a separate neural network—a variational autoencoder, or VAE—compresses each image into a much smaller latent representation, then reconstructs the full image from it. The harder this network compresses, the faster and cheaper training becomes for the image model itself. Most open-source models use compressors that shrink images eightfold in each direction; FLUX.1-dev and HunyuanVideo both work this way, for example. Qwen-Image-2.0, according to the technical report, goes twice as far with 16-fold spatial downsampling. Doubling the compression ratio normally destroys fine detail, but the Qwen team counters this two ways. First, skip connections in the compressor shuttle fine-grained image information around the bottleneck layers. Second, the team shapes the latent space during training so it captures semantically meaningful structures, giving the image model a cleaner workspace. Notably, the team says this alignment pressure is only strong early on and gets dialed back later. One standard training component is completely absent. Most VAEs use a discriminator, a second network that learns to spot the difference between real and reconstructed images, pushing output toward sharper results. The Qwen team drops this entirely, calling it "largely redundant" at scale and a source of training instability. Even with the more aggressive compression, the VAE posts higher reconstruction scores on the standard ImageNet dataset than competitors using gentler compression ratios. Transformer architecture changes tame runaway activations Qwen-Image-2.0 is built around a transformer that processes text and image tokens in a single stream. Text conditioning comes from Qwen3-VL, a vision-language model whose weights stay frozen. The team made two architectural changes to the transformer itself. First, they stripped down an internal scaling mechanism. Where the original design multiplied the signal by a learned factor and added a learned offset, only the multiplication survives. Second, the team replaced the feed-forward blocks between attention layers with SwiGLU, a variant where two parallel paths gate each other. The SwiGLU swap traces back to a specific training problem: when the model learns text and image jointly, some internal values spike to extreme magnitudes, and neurons can permanently saturate early in training. Large language model researchers call this "massive activations." SwiGLU keeps values in a workable range. Reverse-engineered training data powers the prompt module Complex outputs like infographics or posters demand detailed prompts. But real users type short, vague requests. Qwen-Image-2.0 handles this gap with an upstream module built on Qwen3.5-9B that turns terse input into fleshed-out descriptions. Training this module took an unusual path. Rather than manually pairing short prompts with detailed ones, the team started with existing rich image descriptions and systematically stripped out specifics—lighting, textures, and layout—until each one read like something a casual user would type. Every deletion step automatically produced its own training signal: a recipe for adding the missing detail back in. The module trains in two phases. First, it learns from these synthetic pairs. Then it generates candidate prompts, a frozen image generator renders results from them, and the module gets optimized so those results look good and match the intent. Five reward models steer the final tuning For the last round of alignment to human taste, the team deploys five separate reward models. Three score freshly generated images on aesthetics, prompt fidelity, and portrait quality. The other two grade edited images on how well they follow instructions without drifting from the original. One pragmatic shortcut stands out in the reinforcement learning setup. Classifier-free guidance, a standard trick that sharpens diffusion model output, only runs when generating training examples, not during the optimization loop itself. That cuts compute costs without a visible hit to quality. A self-correcting data pipeline The team built a self-optimizing pipeline for managing training data. When evaluations or user feedback surface bad outputs, the system automatically bins each failure into one of three root causes. If reinforcement learning is at fault, the reward signal gets adjusted. If the model is missing knowledge, an automated search combs the training data for gaps and patches them with targeted new examples. If the prompt module is the weak link, it gets retrained. The report says humans only step in for final review and filtering. Training data moves through six stages as image resolution ramps from 256 up to 2,048 pixels. The ratio of generation data to editing data also shifts, from 9:1 early on to 7:3 in later stages. Distillation cuts inference from 40 steps to four Diffusion models typically build images through dozens of small denoising steps. To speed up inference, the team distills the full model into a lighter version that only needs four steps instead of 40. The distillation process doesn't try to replicate the step-by-step generation path; it just matches the final output. Visual quality stays comparable, according to the report. These technical details flesh out what Alibaba showed when it first announced the model earlier this year . Qwen-Image-2.0 initially shipped only as an invite-only API beta on Alibaba Cloud and a demo inside Qwen Chat. In blind comparisons on Alibaba's in-house Arena platform, it lands just behind the current leaders. OpenAI's GPT-Image-2 holds the top spot , with Google's Nano Banana Pro in second. Across the board, the leading models have converged at a high level for photorealism, text rendering, and precise editing; the gaps between top systems are slim. Open-source release for Qwen-Image-2.0 is still up in the air. The weights haven't shipped yet, though Alibaba released the first Qwen image model under Apache 2.0 roughly a month after launch. Qwen-Image-2.0 also joins a growing wave of Chinese image models pushing hard on accurate text rendering, including Meituan's LongCat image and Zhipu AI's GLM image . AI News Without the Hype – Curated by Humans Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section. Subscribe now --> Read on for the full picture. Subscribe for hype-free coverage. Access to all THE DECODER articles. Read without distractions – no Google ads. Access to comments and community discussions. Weekly AI newsletter. 6 times a year: “AI Radar” – deep dives on key AI topics. Up to 25 % off on KI Pro online events. Access to our full ten-year archive. Get the latest AI news from The Decoder. Subscribe to The Decoder -->

이미지 생성 알리바바 모델 아키텍처 VAE 오픈소스