Hacker News • 106일 전

내향적 디퓨전 언어 모델(I-DLM)

IMP

9/10

핵심 요약

디퓨전 언어 모델(DLM)의 한계를 극복하고 자기회귀(AR) 모델과 동등한 성능을 달성한 내향적 디퓨전 언어 모델(I-DLM)을 소개합니다. 이 모델은 기존 DLM이 가진 '내향적 일관성' 부족 문제를 해결하여, 절반 크기의 파라미터로도 대형 모델들을 능가하는 추론 및 코딩 성능을 보여줍니다. 또한 동시 처리 시 높은 처리량을 제공하며 기존 AR 서빙 인프라와 완벽하게 호환된다는 점에서 실무적으로 매우 중요한 의미를 갖습니다.

번역된 본문

이 페이지에서는 개요, 초록, 동기, 방법, 결과, 처리량 가속화, 탐색기, 수용 테이블, 문서화, 인용 등을 확인할 수 있습니다.

내향적 디퓨전 언어 모델 (Introspective Diffusion Language Models)

Yifan Yu *, Yuqing Jian *, Junxiong Wang , Zhongzhu Zhou , Donglin Zhuang , Xinyu Fang , Sri Yanamandra , Xiaoxia Wu , Qingyang Wu , Shuaiwen Leon Song , Tri Dao , Ben Athiwaratkun , James Zou †, Fan Lai †◊, Chenfeng Xu †◊ Together AI • UIUC • Princeton • Stanford • UT Austin

공동 제1저자 † 공동 지도 ◊ 교신저자

논문 (arXiv) 코드 모델 인용

AIME-24: 69.6 (I-DLM-8B) vs. LLaDA-2.1-mini 43.3
LCB-v6: 45.7 (I-DLM-8B) vs. LLaDA-2.1-mini 30.4
C=64 조건에서 LLaDA-2.1-mini 대비 2.9-4.1배 높은 처리량(Throughput)
기본 AR 모델과 비트 수준에서 완벽하게 동일한 무손실(Lossless) 성능

초록 (Abstract) 디퓨전 언어 모델(DLM)은 매력적인 전망을 제시합니다. 병렬 토큰 생성은 자기회귀(AR) 디코딩의 순차적 병목 현상을 해결할 수 있기 때문입니다. 그러나 실제로는 DLM이 품질 면에서 AR 모델에 미치지 못했습니다. 본 연구자들은 이러한 격차가 '내향적 일관성(Introspective Consistency)'의 근본적인 결여에서 비롯된다고 주장합니다. 즉, AR 모델은 자신이 생성한 것에 동의하지만, DLM은 그렇지 않은 경우가 많습니다. 이에 내향적 스트라이드 디코딩(Introspective Strided Decoding, ISD)을 사용하여 새로운 토큰을 생성하는 동시에 이전에 생성된 토큰을 검증하는 내향적 디퓨전 언어 모델(I-DLM)을 소개합니다. 실증적으로 I-DLM-8B는 동일한 규모의 AR 모델과 동등한 품질을 달성한 최초의 DLM입니다. 파라미터 크기가 절반임에도 불구하고 LLaDA-2.1-mini(16B)보다 AIME-24에서 +26점, LiveCodeBench-v6에서 +15점 높은 성능을 보이며, 높은 동시성 환경에서 2.9~4.1배 높은 처리량을 제공합니다. Gated LoRA를 적용할 경우, ISD는 비트 수준의 무손실 가속화를 가능하게 합니다.

왜 내향적 일관성(Introspective Consistency)인가? 핵심 통찰: AR 훈련은 하나의 순전파(forward pass)에서 생성과 내성을 통합합니다. 기존 DLM은 이를 놓치며, 노이즈 제거는 학습하지만 내성은 학습하지 않습니다. 본 연구는 현재 DLM의 세 가지 근본적인 병목 현상을 파악했습니다: (1) 낮은 내향적 일관성. SDAR: 0.699 vs. I-DLM: 0.984. (2) 컴퓨팅 비효율성. TiDAR: 약 7.8배 오버헤드 vs. I-DLM: 약 2.5배. (3) 인프라 불일치. SDAR 기울기=84 vs. I-DLM: 549.

I-DLM 방법론

내향적 일관성 훈련 (Introspective-Consistency Training): 인과적 어텐션(Causal Attention), 로짓 시프트(Logit Shift), 올-마스크(All-masked) 목표 함수를 통해 사전 훈련된 AR 모델을 변환합니다.
내향적 스트라이드 디코딩 (Introspective Strided Decoding): 순전파당 N개의 토큰을 생성하는 동시에 p/q 수용 기준을 통해 이전 토큰을 검증합니다.
AR 호환 서빙 (AR-Compatible Serving): 엄격한 인과적 어텐션을 통해 사용자 정의 인프라 없이 SGLang에 직접 통합할 수 있습니다.

디코딩 패러다임 비교. I-DLM은 AR 서빙 인프라 내에서 즉시 사용 가능한 대체제입니다.

결과 (Results) I-DLM은 동일 규모의 AR 모델 품질과 일치하면서도 15개 벤치마크에서 이전의 모든 DLM을 능가한 최초의 DLM입니다.

종단간 품질 (End-to-End Quality) 파란색 = 30B 미만 최고의 비-AR 모델. 굵은 글씨 = 100B 미만 최고의 비-AR 모델. (모델 비교: Qwen3 8B, Qwen3 32B, LLaDA-2.1-mini 16B, LLaDA-2.0-flash 100B, LLaDA-2.1-flash 100B, SDAR 8B, SDAR 30B, Mercury Coder, Gemini Diffusion, I-DLM 8B, I-DLM 32B)

지식 및 추론 (Knowledge & Reasoning): ARC-C (I-DLM 8B: 95.8, 32B: 96.8), MMLU (82.4, 86.8), MMLU-Pro (73.1, 79.7), GPQA-D (55.6, 62.1), GPQA (54.9, 58.7)
수학 (Math): GSM8K (95.0, 94.9), MATH-500 (96.8, 97.6), MathBench (89.1, 95.6), AIME-24 (69.6, 83.3), AIME-25 (60.8, 80.0)
코딩 (Code): HumanEval (93.3, 96.3), MBPP (92.2, 94.6), LCB-v6 (45.7, 57.1)
명령어 준수 (Instruction Following): IFEval (84.7, 84.7)

처리량 (Throughput) 다양한 배치 크기(1, 4, 16, 64)에서 DLM과 비교한 처리량-지연 시간 트레이드오프. I-DLM은 LLaDA-2.1-mini보다 2.9~4.1배 높은 처리량을 제공합니다.

원문 보기

원문 보기 (영어)

On this page Overview Abstract Motivation Method Results Throughput Speedup Explorer Acceptance Table Documentation Citation Introspective Diffusion Language Models Yifan Yu *, Yuqing Jian *, Junxiong Wang , Zhongzhu Zhou , Donglin Zhuang , Xinyu Fang , Sri Yanamandra , Xiaoxia Wu , Qingyang Wu , Shuaiwen Leon Song , Tri Dao , Ben Athiwaratkun , James Zou &dagger;, Fan Lai &dagger;&loz;, Chenfeng Xu &dagger;&loz; Together AI • UIUC • Princeton • Stanford • UT Austin * Equal contribution &dagger; Equal advising &loz; Corresponding author Paper (arXiv) Code Models Cite 69.6 AIME-24 (I-DLM-8B) vs. LLaDA-2.1-mini 43.3 45.7 LCB-v6 (I-DLM-8B) vs. LLaDA-2.1-mini 30.4 2.9-4.1x Throughput over LLaDA-2.1-mini at C=64 Lossless Bit-for-bit identical to base AR model Abstract Diffusion language models (DLMs) offer a compelling promise: parallel token generation could break the sequential bottleneck of autoregressive (AR) decoding. Yet in practice, DLMs consistently lag behind AR models in quality. We argue that this gap stems from a fundamental failure of introspective consistency : AR models agree with what they generate, whereas DLMs often do not. We introduce the Introspective Diffusion Language Model (I-DLM) , which uses introspective strided decoding (ISD) to verify previously generated tokens while advancing new ones in the same forward pass. Empirically, I-DLM-8B is the first DLM to match the quality of its same-scale AR counterpart , outperforming LLaDA-2.1-mini (16B) by +26 on AIME-24 and +15 on LiveCodeBench-v6 with half the parameters, while delivering 2.9-4.1x throughput at high concurrency. With gated LoRA, ISD enables bit-for-bit lossless acceleration. Why Introspective Consistency? Key Insight: AR training unifies generation and introspection in one forward pass. Existing DLMs miss this — they learn to denoise but not to introspect. We identify three fundamental bottlenecks in current DLMs: (1) Low introspective consistency. SDAR: 0.699 vs. I-DLM: 0.984. (2) Compute inefficiency. TiDAR: ~7.8x overhead vs. I-DLM: ~2.5x. (3) Infrastructure mismatch. SDAR slope=84 vs. I-DLM: 549. The I-DLM Method Introspective-Consistency Training Convert pretrained AR models via causal attention, logit shift, and an all-masked objective. Introspective Strided Decoding Generate N tokens per forward pass while verifying prior tokens via the p/q acceptance criterion. AR-Compatible Serving Strict causal attention enables direct integration into SGLang with no custom infrastructure. Decoding paradigm comparison. I-DLM is a drop-in replacement within AR serving infrastructure. Results I-DLM is the first DLM to match same-scale AR quality while surpassing all prior DLMs across 15 benchmarks. End-to-End Quality Blue = best non-AR <30B. Bold = best non-AR <100B. Qwen3 8B Qwen3 32B LLaDA-2.1 -mini 16B LLaDA-2.0 -flash 100B LLaDA-2.1 -flash 100B SDAR 8B SDAR 30B Mercury Coder Gemini Diffusion I-DLM 8B I-DLM 32B Knowledge & Reasoning ARC-C 95.8 97.2 90.2 --- --- 91.9 93.2 --- --- 95.8 96.8 MMLU 83.5 87.2 74.5 --- --- 78.6 82.8 --- --- 82.4 86.8 MMLU-Pro 75.1 80.1 64.8 74.8 76.6 56.9 61.5 --- --- 73.1 79.7 GPQA-D 58.9 64.1 46.0 --- --- 40.2 36.7 --- --- 55.6 62.1 GPQA 55.4 65.0 53.3 62.3 67.3 --- --- --- --- 54.9 58.7 Math GSM8K 96.0 94.7 89.0 --- --- 91.7 91.4 --- --- 95.0 94.9 MATH-500 95.8 97.8 85.0 --- --- 78.6 77.8 --- --- 96.8 97.6 MathBench 93.1 95.5 84.2 --- --- 76.9 79.3 --- --- 89.1 95.6 AIME-24 73.1 76.7 43.3 --- --- 10.0 16.7 --- --- 69.6 83.3 AIME-25 65.4 80.0 43.3 60.0 63.3 10.0 10.8 --- --- 60.8 80.0 Code HumanEval 95.1 96.3 86.0 --- --- 78.7 87.2 90.0 89.6 93.3 96.3 MBPP 93.4 95.7 82.1 --- --- 72.0 71.6 76.6 76.0 92.2 94.6 LCB-v6 50.3 58.3 30.4 42.5 45.4 16.6 21.7 --- --- 45.7 57.1 Instruction Following IFEval 84.7 84.5 83.2 82.6 83.6 61.4 60.6 --- --- 84.7 84.7 Throughput Throughput-latency tradeoff compared with DLMs across batch sizes (1, 4, 16, 64). I-DLM delivers 2.9-4.1x higher throughput than LLaDA-2.1-mini and SDAR at C=64. Speedup Factor Explorer In the memory-bound decode regime, TPF closely approximates wall-clock speedup : a TPF of 2.5 represents roughly 2.5x faster decoding than AR. Explore how acceptance rate and stride size affect this below. I-DLM acceptance rate ( p ): 0.90 0.70 0.80 0.85 0.90 0.95 1.00 R-ISD LoRA overhead ( α ): 1.12 Gated LoRA adds compute at MASK positions for bit-for-bit lossless output. α=1.12 matches empirical overhead. SDAR acceptance rate ( p sdar ): 0.50 SDAR uses confidence-based denoising with typically lower per-token acceptance rates than ISD. --> 0.50 -- N=2 -- N=3 -- N=4 -- N=8 Memory-Bound Regime (Low Concurrency) Speedup ≈ TPF. Forward pass latency is roughly constant regardless of token count. --> Compute-Bound Regime (High Concurrency) At high concurrency, compute overhead matters. Additionally, SDAR must synchronize at the slowest block in each batch — E[max(S 1 ..S B )] grows with batch size, degrading effective TPF. ISD has no sync penalty since every request advances independently. Batch size / Concurrency ( B ): 1 Drag to see how batch synchronization degrades SDAR's effective speedup. ISD curves stay flat. Speedup = TPF² / query_size. Captures: (1) more tokens per forward, (2) fewer forwards needed, (3) query cost per forward. Speedup > 1 = fewer total FLOPs than AR. --> 1 Memory-bound: Speedup ≈ TPF = (2+p+...+p N-2 ) / (2-p N-1 ) R-ISD (lossless): Speedup ≈ TPF / α — gated LoRA guarantees bit-for-bit AR output. Compute-bound speedup: Speedup = TPF² / query_size = TPF / OH . ISD: TPF² / (2N-1) . SDAR: TPF² / N . AR = 1. (Speedup > 1 means fewer total FLOPs than AR for the same output.) Show derivation: why TPF²/query_size = TPF/OH ↓ --> × Derivation: Compute-Bound Efficiency Setup. Consider generating $L$ output tokens. Let $Q$ = query_size per forward (constant: $2N{-}1$ for ISD fixed, $N$ for SDAR). Step 1: Total forwards. TPF = E[tokens] / E[forwards] per cycle, so to produce $L$ tokens we need: total_forwards = L / TPF Step 2: Total queries. Each forward processes $Q$ queries: total_queries = (L / TPF) × Q Step 3: Compare with AR. AR produces $L$ tokens in $L$ forwards of 1 query each = $L$ total queries. AR_queries = L Step 4: Compute overhead. Overhead = Method cost / AR cost: Overhead = ((L / TPF) × Q) / L = Q / TPF This is how many more total queries we use compared to AR, per output token. Step 5: Compute-bound speedup. Speedup = how much faster than AR, accounting for both fewer forwards and larger queries: Speedup = TPF / Overhead = TPF / (Q / TPF) = TPF² / Q Intuitively: TPF appears twice because it helps in two ways — each forward produces TPF tokens (numerator), and we need 1/TPF as many forwards (which reduces the denominator). The query cost Q penalizes once. When Speedup > 1, parallel decoding uses fewer total FLOPs than AR. Step 6: Show equivalence to TPF/OH. OH (compute overhead) = total queries / total output tokens: OH = (E[forwards] × Q) / E[tokens] = Q / TPF Therefore: TPF / OH = TPF / (Q / TPF) = TPF² / Q ■ This identity holds for any method with constant query size per forward, regardless of acceptance rate $p$ or stride $N$. --> How do DLMs perform as they approach compute-bound? At high concurrency, forward pass latency scales with query count per forward. We can measure compute efficiency as TPF²/query_size — how much useful output each FLOP produces relative to AR (efficiency = 1): SDAR (N=4, p=0.5): TPF ≈ 1.1, processes N=4 queries/forward → compute efficiency = 1.1²/4 ≈ 0.31 . Each FLOP produces only 31% as much output as AR. This pushes SDAR into compute-bound early, and its throughput plateaus (batching efficiency slope = 84, see motivation figure). I-DLM (N=4, p=0.9): TPF ≈ 2.9, processes 2N−1=7 queries/forward → compute efficiency = 2.9²/7 ≈ 1.22 . Each FLOP pro

디퓨전 모델 언어 모델 자기회귀 모델 딥러닝 연구 AI 인프라