MarkTechPost • 108일 전

MIT·NVIDIA, 트라이어텐션 제안…KV 캐시 10배 절감

IMP

9/10

핵심 요약

MIT, NVIDIA, 저장대학 연구진이 대규모 언어 모델(LLM)의 메모리 병목 현상을 해결하기 위해 'TriAttention(트라이어텐션)'이라는 새로운 KV 캐시 압축 기법을 제안했습니다. 이 방법은 기존 방식들의 한계를 극복하여, 수학 추론 벤치마크에서 Full Attention과 동등한 정확도를 유지하면서도 처리량(Throughput)은 2.5배 높이거나 KV 메모리는 최대 10.7배 줄일 수 있습니다.

번역된 본문

긴 사슬 추론(Long-chain reasoning)은 현대 대규모 언어 모델에서 가장 많은 컴퓨팅 자원을 요구하는 작업 중 하나입니다. DeepSeek-R1이나 Qwen3 같은 모델이 복잡한 수학 문제를 풀 때, 답에 도달하기 전에 수만 개의 토큰을 생성할 수 있습니다. 이러한 모든 토큰은 생성 중에 모델이 다시 참조해야 하는 Key와 Value 벡터를 저장하는 메모리 구조인 'KV 캐시'에 저장되어야 합니다. 추론 사슬이 길어질수록 KV 캐시는 커지며, 특히 소비자용 하드웨어 등 많은 배포 환경에서 이러한 증가는 결국 GPU 메모리를 완전히 고갈시키게 됩니다.

MIT, NVIDIA, 저장대학의 연구팀은 이 문제를 직접적으로 해결하는 TriAttention이라는 방법을 제안했습니다. 32K 토큰 생성이 포함된 AIME25 수학 추론 벤치마크에서 TriAttention은 Full Attention의 정확도와 일치하는 성능을 보여주면서도 2.5배 높은 처리량(Throughput)을 달성하거나 KV 메모리를 10.7倍로 줄였습니다. 동일한 효율 수준에서 기존의 선도적인 베이스라인 기법들은 정확도의 절반 수준밖에 달성하지 못했습니다.

기존 KV 캐시 압축의 문제점 TriAttention이 왜 중요한지 이해하려면 KV 캐시 압축에 대한 표준 접근 방식을 이해하는 것이 도움이 됩니다. SnapKV, H2O, R-KV를 포함한 대부분의 기존 방법들은 KV 캐시 내의 토큰 중 중요한 것을 추정하고 나머지를 제거(Evict)하는 방식으로 작동합니다. 중요도는 일반적으로 어텐션 점수(Attention scores)를 확인하여 추정합니다. 즉, 특정 Key가 최근 Query들로부터 높은 어텐션을 받는다면 중요하다고 판단하여 유지합니다.

문제는 이러한 방법들이 연구팀이 'Post-RoPE 공간'이라고 부르는 영역에서 작동한다는 것입니다. RoPE(회전 위치 임베딩, Rotary Position Embedding)는 Llama, Qwen, Mistral을 포함한 대부분의 최신 LLM이 사용하는 위치 인코딩 방식입니다. RoPE는 주파수에 의존적인 방식으로 Query와 Key 벡터를 회전시켜 위치를 인코딩합니다. 결과적으로 위치 10,000에 있는 Query 벡터는 위치 100에 있는 동일한 의미의 Query와 매우 다르게 보이는데, 그 방향이 위치 인코딩에 의해 회전되었기 때문입니다.

이러한 회전 때문에 현재 어떤 Key가 중요한지 추정하기에 방향이 '최신 상태'인 Query는 가장 최근에 생성된 것들뿐입니다. 이전 연구들은 이를 경험적으로 확인했습니다. 즉, 중요도 추정을 위한 관찰 윈도우를 넓히는 것은 도움이 되지 않으며, 성능은 약 25개의 Query에서 정점을 찍고 그 이후로는 감소합니다. 이렇게 관찰 윈도우가 작으면, 나중에 중요해질 일부 Key가 영구적으로 제거되는 문제가 발생합니다.

이 문제는 연구팀이 '검색 헤드(Retrieval heads)'라고 부르는 것, 즉 긴 문맥에서 특정 사실적 토큰을 검색하는 역할을 하는 어텐션 헤드에서 특히 심각합니다. 검색 헤드와 관련된 토큰은 갑자기 추론 사슬에 필수적인 요소가 되기 전까지 수천 개의 토큰 동안 휴면 상태로 있을 수 있습니다. 좁은 관찰 윈도우에서 작동하는 Post-RoPE 방식은 휴면 기간 동안 해당 토큰에 대한 낮은 어텐션을 보고 이를 영구적으로 제거해 버립니다. 모델이 나중에 해당 정보를 다시 불러와야 할 때 이미 정보가 사라진 상태이므로 사고의 사슬이 끊어지게 됩니다.

Pre-RoPE 관찰: Q/K 집중 현상 (Q/K Concentration) TriAttention의 핵심 통찰은 RoPE 회전이 적용되기 전의 공간인 Pre-RoPE 공간에서 Query와 Key 벡터를 살펴보는 데서 비롯됩니다. 연구팀이 이 공간에서 Q와 K 벡터를 시각화했을 때, 일관되고 놀라운 사실을 발견했습니다. 즉, 압도적인 다수의 어텐션 헤드와 여러 모델 아키텍처에서 Q와 K 벡터가 고정된 0이 아닌 중심점 주변으로 조밀하게 클러스터링(집중)되는 현상입니다.

연구팀은 이 특성을 'Q/K 집중 현상(Q/K Concentration)'이라고 명명하며, R이 1에 가까우면 조밀한 클러스터링을 의미하고 0에 가까우면 모든 방향으로 분산됨을 나타내는 방향 통계학의 표준 측정값인 '평균 결과 벡터 길이(Mean Resultant Length, R)'를 사용하여 이를 측정했습니다. Qwen3-8B 모델에서 약 90%의 어텐션 헤드가 R > 0.95를 나타내는 것으로 확인되었습니다.

원문 보기

원문 보기 (영어)

Editors Pick Agentic AI Artificial Intelligence AI Infrastructure Tech News AI Paper Summary Technology AI Shorts Applications Deep Learning Language Model Large Language Model Machine Learning New Releases Open Source Software Engineering Staff Long-chain reasoning is one of the most compute-intensive tasks in modern large language models. When a model like DeepSeek-R1 or Qwen3 works through a complex math problem, it can generate tens of thousands of tokens before arriving at an answer. Every one of those tokens must be stored in what is called the KV cache — a memory structure that holds the Key and Value vectors the model needs to attend back to during generation. The longer the reasoning chain, the larger the KV cache grows, and for many deployment scenarios, especially on consumer hardware, this growth eventually exhausts GPU memory entirely. A team of researchers from MIT, NVIDIA, and Zhejiang University proposed a method called TriAttention that directly addresses this problem. On the AIME25 mathematical reasoning benchmark with 32K-token generation, TriAttention matches Full Attention accuracy while achieving 2.5× higher throughput or 10.7× KV memory reduction. Leading baselines achieve only about half the accuracy at the same efficiency level. The Problem with Existing KV Cache Compression To understand why TriAttention is important, it helps to understand the standard approach to KV cache compression. Most existing methods — including SnapKV, H2O, and R-KV — work by estimating which tokens in the KV cache are important and evicting the rest. Importance is typically estimated by looking at attention scores: if a key receives high attention from recent queries, it is considered important and kept. The catch is that these methods operate in what the research team calls post-RoPE space. RoPE, or Rotary Position Embedding , is the positional encoding scheme used by most modern LLMs including Llama, Qwen, and Mistral. RoPE encodes position by rotating the Query and Key vectors in a frequency-dependent way. As a result, a query vector at position 10,000 looks very different from the same semantic query at position 100, because its direction has been rotated by the position encoding. This rotation means that only the most recently generated queries have orientations that are ‘up to date' for estimating which keys are important right now. Prior work has confirmed this empirically: increasing the observation window for importance estimation does not help — performance peaks at around 25 queries and declines after that. With such a tiny window, some keys that will become important later get permanently evicted. This problem is especially acute for what the research team calls retrieval heads — attention heads whose function is to retrieve specific factual tokens from long contexts. The relevant tokens for a retrieval head can remain dormant for thousands of tokens before suddenly becoming essential to the reasoning chain. Post-RoPE methods, operating over a narrow observation window, see low attention on those tokens during the dormant period and permanently evict them. When the model later needs to recall that information, it is already gone, and the chain of thought breaks. The Pre-RoPE Observation: Q/K Concentration The key insight in TriAttention comes from looking at Query and Key vectors before RoPE rotation is applied — the pre-RoPE space. When the research team visualized Q and K vectors in this space, they found something consistent and striking: across the vast majority of attention heads and across multiple model architectures, both Q and K vectors cluster tightly around fixed, non-zero center points. The research team terms this property Q/K concentration , and measures it using the Mean Resultant Length R — a standard directional statistics measure where R → 1 means tight clustering and R → 0 means dispersion in all directions. On Qwen3-8B, approximately 90% of attention heads exhibit R > 0.95, meaning their pre-RoPE Q/K vectors are nearly perfectly concentrated around their respective centers. Critically, these centers are stable across different token positions and across different input sequences — they are an intrinsic property of the model's learned weights, not a property of any particular input. The research team further confirm that Q/K concentration is domain-agnostic: measuring Mean Resultant Length across Math, Coding, and Chat domains on Qwen3-8B yields nearly identical values of 0.977–0.980. This stability is what post-RoPE methods cannot exploit. RoPE rotation disperses these concentrated vectors into arc patterns that vary with position. But in pre-RoPE space, the centers remain fixed. From Concentration to a Trigonometric Series The research team then show mathematically that when Q and K vectors are concentrated around their centers, the attention logit — the raw score before softmax that determines how much a query attends to a key — simplifies dramatically. Substituting the Q/K centers into the RoPE attention formula, the logit reduces to a function that depends only on the Q-K distance (the relative positional gap between query and key), expressed as a trigonometric series: logit ( Δ ) ≈ ∑ f ‖ q ‾ f ‖ ‖ k ‾ f ‖ ⏟ amplitude cos ⁡ ( ω f Δ + ϕ ‾ f ⏟ phase ) = ∑ f [ a f cos ⁡ ( ω f Δ ) + b f sin ⁡ ( ω f Δ ) ] \text{logit}(\Delta) \approx \sum_{f} \underbrace{\|\bar{q}_f\| \|\bar{k}_f\|}_{\text{amplitude}} \cos(\omega_f \Delta + \underbrace{\bar{\phi}_f}_{\text{phase}}) = \sum_{f} [a_f \cos(\omega_f \Delta) + b_f \sin(\omega_f \Delta)] Here, Δ is the positional distance, ω f are the RoPE rotation frequencies for each frequency band f, and the coefficients a f and b f are determined by the Q/K centers. This series produces a characteristic attention-vs-distance curve for each head. Some heads prefer nearby keys (local attention), others prefer very distant keys (attention sinks). The centers, computed offline from calibration data, fully determine which distances are preferred. The research team validated this experimentally across 1,152 attention heads in Qwen3-8B and across Qwen2.5 and Llama3 architectures. The Pearson correlation between the predicted trigonometric curve and the actual attention logits has a mean above 0.5 across all heads, with many heads achieving correlations of 0.6–0.9. The research team further validates this on GLM-4.7-Flash, which uses Multi-head Latent Attention (MLA) rather than standard Grouped-Query Attention — a meaningfully different attention architecture. On MLA, 96.6% of heads exhibit R > 0.95, compared to 84.7% for GQA, confirming that Q/K concentration is not specific to one attention design but is a general property of modern LLMs. How TriAttention Uses This TriAttention is a KV cache compression method that uses these findings to score keys without needing any live query observations. The scoring function has two components: The Trigonometric Series Score (S trig ) uses the Q center computed offline and the actual cached key representation to estimate how much attention the key will receive, based on its positional distance from future queries. Because a key may be attended to by queries at many future positions, TriAttention averages this score over a set of future offsets using geometric spacing. S trig ( k , Δ ) = ∑ f ‖ 𝔼 [ q f ] ‖ ⋅ ‖ k f ‖ ⋅ cos ⁡ ( ω f Δ + ϕ f ) S_{\text{trig}}(k, \Delta) = \sum_{f} \|\mathbb{E}[q_f]\| \cdot \|k_f\| \cdot \cos(\omega_f \Delta + \phi_f) The Norm-Based Score (S norm ) handles the minority of attention heads where Q/K concentration is lower. It weights each frequency band by the expected query norm contribution, providing complementary information about token salience beyond distance preference alone. S norm ( 0 ) ( k ) = ∑ f 𝔼 [ ‖ q f ‖ ] ⋅ ‖ k f ‖ S_{\text{norm}}^{(0)}(k) = \sum_{f} \mathbb{E}[\|q_f\|] \cdot \|k_f\| The two scores are combined using the Mean Resultant Length R as an adaptive weight: when concentration is high, S trig do

KV 캐시 압축 대규모 언어 모델 GPU 메모리 최적화 TriAttention 추론 최적화