Hacker News • 64일 전

EAGLE, vLLM, TorchSpec 3사 협업한 추론 속도 혁신

IMP

8/10

핵심 요약

EAGLE, vLLM, TorchSpec 팀이 협력하여 대규모 언어 모델(LLM)의 추론 속도를 획기적으로 높이는 'EAGLE 3.1' 스페큘러티브 디코딩(Speculative Decoding) 알고리즘을 발표했습니다. 이번 업데이트는 긴 문맥이나 다양한 프롬프트 환경에서 발생하는 '어텐션 드리프트(Attention Drift)' 문제를 해결하여, 기존 대비 최대 2배 긴 컨텍스트 수용 길이를 달성하며 모델 배포 안정성을 크게 높였습니다. 또한 실제 서비스 모델인 Kimi K2.6용 드래프트 모델을 오픈소스로 공개하고 vLLM 메인 브랜치에 통합하여 산업계의 실사용성을 강화했다는 점에서 중요합니다.

번역된 본문

EAGLE 1, EAGLE 2, EAGLE 3을 포함한 EAGLE 시리즈는 연구 및 프로덕션 시스템 모두에서 가장 널리 채택되고 실용적으로 배포된 스페큘러티브 디코딩(Speculative Decoding) 알고리즘 제품군 중 하나가 되었습니다. 오늘 EAGLE 팀, vLLM 팀, TorchSpec 팀은 스페큘러티브 디코딩의 견고성, 효율성 및 배포 가능성에서 중요한 진전을 이룬 EAGLE 3.1을 공동으로 발표하게 되어 기쁩니다.

EAGLE 3.1의 혁신 스페큘러티브 디코딩은 통제된 환경에서는 훌륭한 성능을 발휘하지만, 다양한 채팅 템플릿, 긴 문맥 입력 또는 분포 외(Out-of-distribution) 시스템 프롬프트가 적용되면 성능이 저하되는 경우가 많습니다. EAGLE 팀은 이러한 취약성이 '어텐션 드리프트(Attention Drift)'라는 현태 때문이라는 것을 발견했습니다. 즉, 스페큘레이션 깊이가 깊어질수록 초안 모델(Drafter)이 어텐션을 점차적으로 Sink 토큰에서 멀어지게 하고 자체 생성된 토큰으로 향하게 만드는 현상입니다.

우리는 이 문제의 두 가지 근본적인 원인을 파악했습니다. 첫째, 상위 계층의 은닉 상태(Hidden state)가 초안 모델의 입력을 지배하게 되면서 퓨즈드(Fused) 입력 표현의 불균형이 심화된다는 점입니다. 둘째, 정규화되지 않은 잔차 경로(Residual path)로 인해 스페큘레이션 단계를 거치며 은닉 상태의 크기(Magnitude)가 증가한다는 점입니다. 이러한 효과가 결합되어 초안 모델은 스페큘레이션 깊이가 깊어질수록 점차 불안정해집니다.

이 문제를 해결하기 위해 EAGLE 3.1은 두 가지 핵심 아키텍처 개선 사항을 도입했습니다: 타겟 은닉 상태 이후, FC(Fully Connected) 레이어 이전에 FC 정규화(Normalization) 적용 정규화 이후의 은닉 상태(Post-norm hidden states)를 다음 디코딩 단계에 입력 직관적으로 말하자면, 이러한 Post-norm 설계는 단순히 타겟 모델에 추가 레이어를 덧붙이는 것이 아니라, 디코딩 단계마다 초안 모델을 재귀적으로 호출하는 것과 유사한 방식으로 작동하게 만듭니다. 이러한 변경으로 다양한 배포 환경에서 견고성이 크게 향상되었습니다. EAGLE 3과 비교했을 때 EAGLE 3.1은 다음과 같은 이점을 보여줍니다:

학습 시간에서 추론 시간으로의 더 나은 외삽(Extrapolation) 강력한 긴 문맨(Long-context) 견고성 채팅 템플릿 및 시스템 프롬프트 변화에 대한 높은 회복 탄력성 다양한 서빙 환경에서 걸쳐 더 안정적인 수용 길이(Acceptance length) 긴 문맨 작업 환경에서 EAGLE 3.1은 EAGLE 3에 비해 최대 2배 긴 수용 길이를 달성합니다.

TorchSpec을 활용한 EAGLE 3.1 학습 현재 TorchSpec은 EAGLE 3.1 및 차세대 스페큘러티브 디코딩 알고리즘을 위한 효율적인 학습 지원을 제공합니다. 학습 오버헤드를 줄이고 실험 워크플로우를 단순화함으로써, TorchSpec은 차세대 스페큘러티브 디코딩 연구 및 배포를 위한 반복과 탐색을 가속화하는 데 도움을 줍니다.

TorchSpec과 vLLM을 기반으로, 우리는 Kimi K2.6을 위한 EAGLE 3.1 드래프트 모델을 학습시키고 오픈소스로 공개했습니다: https://huggingface.co/lightseekorg/kimi-k2.6-eagle3.1-mla 이 모델은 실제 서비스 모델에 TorchSpec 학습과 vLLM 서빙 지원을 통해 EAGLE 3.1을 배포하는 하나의 예시로 활용될 수 있습니다.

vLLM과의 EAGLE 3.1 통합 EAGLE 3.1은 기존 EAGLE 3 구현의 구성 기반(Config-driven) 확장 형태로 vLLM에 통합되었습니다. 해당 통합에는 다음이 포함됩니다:

FC 정규화 지원 Post-norm 은닉 상태 피드백 타겟 은닉 상태에 대한 하드코딩된 가정 제거 이와 동시에 기존 EAGLE 3 체크포인트와의 하위 호환성은 완벽하게 유지됩니다. 결과적으로 EAGLE 3.1 드래프트 모델은 동일한 스페큘러티브 디코딩 코드 경로를 통해 직접 연결(Plugged)될 수 있습니다. 덕분에 프로덕션 환경의 vLLM 서빙에서 드래프트 모델을 업그레이드하는 과정이 원활하고 간편해집니다.

이러한 지원은 이미 vLLM의 현재 메인 브랜치에 병합되었으며, vLLM의 나이틀리(Nightly) 릴리즈 및 향후 출시될 v0.22.0 릴리즈를 통해 사용할 수 있을 것입니다.

초기 데이터 포인트로서, 우리는 SPEED-Bench 코딩 데이터셋에서 vLLM(TP=4, GB200, non-disagg)을 활용해 Kimi K2.6 EAGLE 3.1 드래프트 모델을 Kimi-K2.6-NVFP4와 함께 벤치마크했습니다. EAGLE 3.1은 동시성(Concurrency) 1에서 사용자당 출력 처리량을 2.03배 더 높여주었으며, 동시성이 증가함에 따라서도 유의미한 속도 향상을 유지했습니다(C=4일 때 1.71배, C=16일 때 1.66배).

생태계 전반의 오픈소스 협업 EAGLE 팀, vLLM 팀, TorchSpec 팀 간의 이번 협업은 알고리즘 연구, 시스템 최적화 및 학습 인프라에 걸친 오픈소스 협업의 훌륭한 사례입니다.

원문 보기

원문 보기 (영어)

Table of Contents The EAGLE series — including EAGLE 1, EAGLE 2, and EAGLE 3 — has become one of the most widely adopted and practically deployed families of speculative decoding algorithms across both research and production systems. Today, the EAGLE team , vLLM team , and TorchSpec team are excited to jointly introduce EAGLE 3.1 — a major step forward in speculative decoding robustness, efficiency, and deployability. EAGLE 3.1 Innovations While speculative decoding performs well in controlled settings, performance often degrades under different chat templates, long-context inputs, or out-of-distribution system prompts. The EAGLE team traced this fragility to a phenomenon we call attention drift — as speculation depth increases, the drafter gradually shifts attention away from sink tokens and toward its own generated tokens. We identified two underlying issues. First, the fused input representation becomes increasingly imbalanced as higher-layer hidden states dominate the drafter input. Second, hidden-state magnitude grows across speculation steps due to the unnormalized residual path. Together, these effects make the drafter progressively less stable at deeper speculation depths. To address this issue, EAGLE 3.1 introduces two key architectural improvements: FC normalization after each target hidden state and before the FC layer Feeding post-norm hidden states into the next decoding step Intuitively, the post-norm design makes the method behave more like recursively invoking the drafter across decoding steps, rather than simply appending additional layers to the target model. These changes significantly improve robustness across deployment scenarios. Compared with EAGLE 3, EAGLE 3.1 demonstrates: Better training-time to inference-time extrapolation Stronger long-context robustness Higher resilience to chat template and system prompt variation More stable acceptance length across diverse serving environments In long-context workloads, EAGLE 3.1 achieves up to 2× longer acceptance length compared with EAGLE 3 . EAGLE 3.1 Training with TorchSpec TorchSpec now provides efficient training support for EAGLE 3.1 and future speculative decoding algorithms. By lowering training overhead and simplifying experimentation workflows, TorchSpec helps accelerate iteration and exploration for next-generation speculative decoding research and deployment. Based on TorchSpec and vLLM, we also trained and open-sourced an EAGLE 3.1 draft model for Kimi K2.6: https://huggingface.co/lightseekorg/kimi-k2.6-eagle3.1-mla The model serves as an example of deploying EAGLE 3.1 with TorchSpec training and vLLM serving support on a real-world serving model. EAGLE 3.1 Integration with vLLM EAGLE 3.1 lands in vLLM as a config-driven extension of the existing EAGLE 3 implementation. The integration includes: FC normalization support Post-norm hidden-state feedback Removal of hardcoded assumptions around target hidden states At the same time, backward compatibility with existing EAGLE 3 checkpoints is fully preserved. As a result, EAGLE 3.1 draft models can be plugged directly through the same speculative-decoding code path, for example: This makes draft-model upgrades in production vLLM serving smooth and easy. The support has already been merged into the current main branch of vLLM and will be available via vLLM's nightly release as well as the upcoming v0.22.0 release. As an early data point, we benchmarked the Kimi K2.6 EAGLE 3.1 draft model on Kimi-K2.6-NVFP4 with vLLM (TP=4, GB200, non-disagg) on the SPEED-Bench coding dataset. EAGLE 3.1 delivers 2.03× higher per-user output throughput at concurrency 1 , and the speedup stays meaningful as concurrency scales (1.71× at C=4, 1.66× at C=16). Open-Source Collaboration Across the Ecosystem This collaboration between the EAGLE team, vLLM team, TorchSpec team represents a strong example of open-source collaboration across algorithm research, system optimization, and training infrastructure. The EAGLE team continues advancing speculative decoding algorithms, vLLM helps bring these innovations into production inference systems at scale, and TorchSpec enables efficient training and rapid experimentation for future speculative decoding algorithms. Together, we hope to continue raising the overall baseline for speculative decoding and driving further improvements in token efficiency across the broader LLM ecosystem. Share: View Markdown Source Related Posts vLLM Tops the Artificial Analysis Leaderboard May 11, 2026 · 15 min read How vLLM built the leading deployments of DeepSeek V3.2, MiniMax-M2.5, and Qwen 3.5 397B.

추론 속도 최적화 스페큘러티브 디코딩 오픈소스 vLLM 모델 서빙