r/LocalLLaMA • 109일 전

싱가포르 국립대, 병렬 디코딩 가속화하는 'DMax' 발표

IMP

8/10

핵심 요약

싱가포르 국립대(NUS) 연구팀이 기존 디퓨전 언어 모델(dLLM)의 한계를 극복하고 병렬 처리 속도를 획기적으로 높인 'DMax' 모델을 공개했습니다. 이 모델은 디코딩 과정을 점진적 자기 정제(self-refinement) 과정으로 재정의하여, 초기 예측의 오류가 누적되는 현상을 방지하고 스스로 오류를 수정할 수 있도록 설계되었습니다. 그 결과 수학 및 코딩 벤치마크에서 기존 모델 대비 2~3배 높은 처리 속도를 기록하면서도 원본 모델의 정확도를 유지하는 성과를 입증했습니다.

번역된 본문

TL;DR (요약):

DMax는 디코딩 과정을 점진적인 자기 정제 과정으로 재구성하여 오류 누적을 교묘하게 완화합니다. 이를 통해 모델이 텍스트를 생성하는 동안 자신의 잘못된 예측을 스스로 정정할 수 있도록 합니다.

초록 (Abstract):

우리는 효율적인 디퓨전 언어 모델(dLLMs)을 위한 새로운 패러다임인 DMax를 제시합니다. 이 방법론은 병렬 디코딩 시 발생하는 오류 누적을 완화하여, 생성 품질을 유지하면서도 디코딩 병렬성을 공격적으로(Aggressive) 끌어올릴 수 있게 합니다. 마스킹된 토큰을 최종 토큰으로 바꾸는 이진 방식의 기존 마스크 기반 dLLM들과 달리, DMax는 디코딩을 마스크 임베딩(Mask Embedding)에서 토큰 임베딩(Token Embedding)으로 나아가는 점진적인 자기 정제 과정으로 재정의합니다.

본 접근 방식의 핵심은 마스크 기반 dLLM과 균등(Uniform) dLLM을 효율적으로 통합하는 새로운 훈련 전략인 '온-폴리시 균등 훈련(On-Policy Uniform Training)'입니다. 이를 통해 모델은 마스킹된 입력뿐만 아니라 자신이 만들어낸 잘못된 예측으로부터 정상적인 토큰을 복원하는 능력을 갖추게 됩니다. 이를 바탕으로 우리는 '소프트 병렬 디코딩(Soft Parallel Decoding)'을 추가로 제안합니다. 이는 각 중간 디코딩 상태를 예측된 토큰 임베딩과 마스크 임베딩 사이의 보간(Interpolation)으로 표현하여, 임베딩 공간 내에서 반복적인 자기 수정을 가능하게 합니다.

다양한 벤치마크에서 진행된 광범위한 실험은 DMax의 효과를 입증합니다. 기존 LLaDA-2.0-mini와 비교했을 때, 우리의 방법은 GSM8K에서 정확도를 유지하면서 TPF(Tokens Per Forward)를 2.04에서 5.47로 향상시켰습니다. MBPP에서도 비슷한 성능을 유지하면서 TPF를 2.71에서 5.86으로 증가시켰습니다. 두 대의 H200 GPU 환경, 배치 사이즈 1 조건에서 우리 모델은 평균 1,338 TPS(초당 토큰)를 달성했습니다.

실무자를 위한 설명 (Layman's Explanation):

디퓨전 언어 모델이 일반적인 LLM보다 텍스트를 더 빠르게 생성할 수 있는 이유는, 여러 개의 토큰을 동시에 채워 넣을 수 있기 때문입니다. 하지만 실제로는 초기의 잘못된 추측이 눈덩이처럼 불어나는 경향이 있어 이러한 속도의 이점이 제한됩니다. 모델이 잘못된 토큰을 한 번 확정해 버리면, 그 잘못된 토큰이 다음 단계의 문맥(Context)으로 사용되기 때문에 디코딩이 너무 공격적으로 이루어질 경우 품질이 급격히 저하될 수 있습니다.

DMax가 하는 일은 모델이 자신의 실수로부터 더 잘 회복할 수 있는 방법을 제공하는 것입니다. 마스킹된 빈칸에서 최종 토큰으로 이동하는 경직된 단방향 경로를 따르는 대신, 모델이 최종 토큰을 확정하기 전까지 중간 추측값들을 계속 다듬을 수 있도록 허용합니다.

논문의 두 가지 핵심 아이디어는 매우 직관적입니다. 첫째, 모델이 완벽하지 않은 자신의 예측을 바탕으로 훈련(On-policy Training)하여, 실제 추론 시점에 발생할 수 있는 유형의 오류를 스스로 정리하는 방법을 학습합니다. 둘째, 디코딩하는 동안 모든 추측을 즉시 최종 결과로 취급하는 대신 '더 부드러운(Softer)' 중간 표현을 사용합니다. 이는 불확실성을 유지하여 수정을 더 쉽게 만듭니다.

그 결과, DMax는 기존에 흔히 발생하던 품질 붕괴 현상 없이 훨씬 더 많은 병렬 디코딩을 가능하게 합니다. 이 논문의 수학 및 코딩 벤치마크 테스트에서 DMax는 원본 모델과 거의 동일한 정확도를 유지하면서도 엄청난 속도 향상을 달성했으며, 일부 병렬성이 낮은 설정에서는 정확도가 약간 향상되기도 했습니다. 따라서 이 연구의 핵심은 단순히 '더 빠른 디퓨전 언어 모델'을 만들었다는 것이 아니라, 디퓨전 언어 모델이 품질을 유지할 수 있을 만큼 충분히 스스로를 수정할 수 있게 되었다는 데 있습니다.

원문 보기

원문 보기 (영어)

##TL;DR: **DMax cleverly mitigates error accumulation by reforming decoding as a progressive self-refinement process, allowing the model to correct its own erroneous predictions during generation.** --- ##Abstract: >We present DMax, a new paradigm for efficient diffusion language models (dLLMs). It mitigates error accumulation in parallel decoding, enabling aggressive decoding parallelism while preserving generation quality. Unlike conventional masked dLLMs that decode through a binary mask-to-token transition, DMax reformulates decoding as a progressive self-refinement from mask embeddings to token embeddings. > >At the core of our approach is On-Policy Uniform Training, a novel training strategy that efficiently unifies masked and uniform dLLMs, equipping the model to recover clean tokens from both masked inputs and its own erroneous predictions. Building on this foundation, we further propose Soft Parallel Decoding. We represent each intermediate decoding state as an interpolation between the predicted token embedding and the mask embedding, enabling iterative self-revising in embedding space. > >Extensive experiments across a variety of benchmarks demonstrate the effectiveness of DMax. Compared with the original LLaDA-2.0-mini, our method improves TPF on GSM8K from 2.04 to 5.47 while preserving accuracy. On MBPP, it increases TPF from 2.71 to 5.86 while maintaining comparable performance. On two H200 GPUs, our model achieves an average of 1,338 TPS at batch size 1. --- ##Layman's Explanation: The core idea is that diffusion language models should be able to generate text faster than normal LLMs because they can fill in multiple tokens at the same time. In practice, though, that speed advantage gets limited because early wrong guesses tend to snowball. Once the model commits to a bad token, that bad token becomes part of the context for the next step, so quality can fall apart fast when decoding gets too aggressive. What DMax does is give the model a better way to recover from its own mistakes. Instead of moving in a rigid one-way path from masked slots to final tokens, it lets the model keep refining intermediate guesses before locking them in. The paper’s two main ideas are pretty intuitive. First, the model is trained on its own imperfect predictions, so it learns how to clean up the kinds of errors it will actually make at inference time. Second, during decoding it uses a softer in-between representation rather than treating every guess as fully final right away, which helps preserve uncertainty and makes revision easier. The result is that DMax pushes much more parallel decoding without the usual collapse in quality. On the paper’s math and coding benchmarks, it gets large speedups while keeping accuracy close to the original model, and in some lower-parallel settings it even improves accuracy a bit. So the main takeaway is not just “faster diffusion LLMs,” but diffusion LLMs that can revise themselves well enough

디퓨전 언어 모델 병렬 디코딩 추론 가속화 자기 정제 NUS