MarkTechPost • 100일 전

문샷 AI, 멀티 데이터센터 LLM 서빙 혁신

IMP

8/10

핵심 요약

문샷 AI와 칭화대 연구진이 대규모 언어 모델(LLM)의 추론 방식을 혁신하는 멀티 데이터센터 서빙 아키텍처 'PrfaaS(Prefill-as-a-Service)'를 제안했습니다. 이 아키텍처는 연산 집약적인 프리필(Prefill) 작업을 별도의 클러스터로 분리하고, 하이브리드 어텐션 모델을 통해 크게 감소된 KVCache를 일반 이더넷망으로 전송하여 54% 높은 처리량을 달성합니다.

번역된 본문

수년 동안 대규모 언어 모델(LLM)이 추론을 처리하는 방식은 문자 그대로 '상자' 안에 갇혀 있었습니다. 현대 LLM 서빙을 가능하게 하는 대역폭의 RDMA 네트워크는 프리필(prefill)과 디코드(decode) 과정을 동일한 데이터센터, 때로는 동일한 서버 랙에 국한시켰습니다. 문샷 AI(Moonshot AI)와 칭화대학교(Tsinghua University)의 연구진은 이러한 제약이 곧 깨질 것이며, 올바른 아키텍처를 통해 이미 이러한 변화를 활용할 수 있다고 주장합니다.

연구진은 긴 컨텍스트의 프리필 작업을 독립적이고 연산 집약적인 프리필 클러스터에 선택적으로 오프로드하고, 그 결과로 생성된 KVCache를 일반 상용 이더넷(Ethernet)을 통해 로컬 PD(프리필-디코드) 클러스터로 전송하여 디코딩하는 크로스 데이터센터 서빙 아키텍처인 'Prefill-as-a-Service(PrfaaS)'를 도입했습니다. 내부적인 1조(1T) 파라미터 하이브리드 모델을 사용한 사례 연구 결과, 동종 PD 기준선보다 54%, 단순 이기종 설정보다 32% 높은 서빙 처리량을 보여주었습니다. 그리고 이는 사용 가능한 데이터센터 간 대역폭의 극히 일부만 소모합니다. 연구진은 동일한 하드웨어 비용으로 비교할 때 처리량 향상은 약 15%라고 설명합니다. 이는 54%의 전체 이점이 프리필을 위한 더 높은 연산 능력의 H200 GPU와 디코드를 위한 H20 GPU를 페어링한 결과를 반영한 것입니다.

기존 아키텍처가 한계에 도달한 이유 PrfaaS가 해결하는 문제를 이해하려면, 애초에 LLM 서빙이 왜 두 단계로 나뉘는지 이해하는 것이 도움이 됩니다. 프리필(Prefill)은 모델이 모든 입력 토큰을 처리하여 KVCache를 생성하는 단계로, 연산 집약적(compute-intensive)입니다. 디코드(Decode)는 모델이 출력 토큰을 한 번에 하나씩 생성하는 단계로, 메모리 대역폭 집약적(memory-bandwidth-intensive)입니다. 프리필-디코드(PD) 분리(Disaggregation)는 이 두 단계를 서로 다른 하드웨어로 분리하여 활용도를 높이고 각 단계를 독립적으로 최적화할 수 있게 합니다.

문제는 프리필과 디코드를 분리하면 전송 문제가 발생한다는 것입니다. 프리필이 한 장비 세트에서 실행되고 디코드가 다른 장비에서 실행되면, 출력 생성이 시작되기 전에 프리필에서 생성된 KVCache를 디코드 측으로 전송해야 합니다. 그룹 쿼리 어텐션(Grouped Query Attention, GQA)을 사용하는 기존의 밀집 어텐션(Dense-attention) 모델에서는 이 KVCache의 크기가 상당합니다. 연구진은 GQA를 적용한 대표적인 밀집 모델인 MiniMax-M2.5를 벤치마크했는데, 단일 8×H200 인스턴스에서 32K 토큰 요청에 대해 초당 약 60Gbps의 속도로 KVCache를 생성했습니다. 이 정도 용량의 데이터는 연산 지연 없이 전송하기 위해 RDMA 수준의 상호 연결망이 필요하며, 이것이 바로 기존의 PD 분리가 단일 데이터센터 규모의 네트워크 패브릭에 강하게 묶여 있던 이유입니다. 프리필과 디코드를 별도의 클러스터로, 더 나아가 여러 데이터센터에 걸쳐 이동시키는 것은 전혀 불가능했습니다.

하이브리드 어텐션(Hybrid Attention)이 판도를 바꾸다 PrfaaS가 주목받는 이유는 모델 수준에서 발생하는 아키텍처의 변화 덕분입니다. Kimi Linear, MiMo-V2-Flash, Qwen3.5-397B, Ring-2.5-1T를 포함하여 점점 더 많은 모델이 소수의 전체 어텐션(Full-attention) 레이어와 다수의 선형 복잡도(Linear-complexity) 또는 유한 상태(Bounded-state) 레이어(예: Kimi Delta Attention(KDA), Multi-head Latent Attention(MLA), Sliding Window Attention(SWA))를 교차 배치하는 하이브리드 어텐션 스택을 채택하고 있습니다.

이러한 아키텍처에서는 전체 어텐션 레이어만 시퀀스 길이에 비례하여 커지는 KVCache를 생성합니다. 선형 복잡도 레이어는 크기가 고정된 순환 상태를 유지하므로, 긴 컨텍스트에서 그 메모리 점유율은 무시할 수 있을 정도로 작습니다. KVCache 크기를 프리필 지연 시간으로 나눈 값인 KV 처리량 수치는 이를 명확히 보여줍니다. 32K 토큰 기준으로 MiMo-V2-Flash는 4.66Gbps의 속도로 KVCache를 생성하는 반면, MiniMax-M2.5는 59.93Gbps로 약 13배의 감소율을 보여줍니다. Qwen3.5-397B는 8.25Gbps로, Qwen3-235B의 33.35Gbps와 비교해 4배 감소했습니다. 특히 Ring-2.5-1T 모델에 대해서는 논문에서 이러한 절감 효과를 세분화하여 설명합니다. MLA가 GQA 대비 약 4.5배의 압축을 제공하고, 7:1의 하이브리드 비율이 추가로 약 8배의 감소를 제공하여 전체적으로 KV 메모리를 획기적으로 절감할 수 있습니다.

원문 보기

원문 보기 (영어)

Tech News AI Paper Summary Technology AI Shorts Artificial Intelligence Applications Editors Pick Language Model Large Language Model Machine Learning Staff For years, the way large language models handle inference has been stuck inside a box — literally. The high-bandwidth RDMA networks that make modern LLM serving work have confined both prefill and decode to the same datacenter, sometimes even the same rack. A team of researchers at Moonshot AI and Tsinghua University is making the case that this constraint is about to break down — and that the right architecture can already exploit that shift. The research team introduces Prefill-as-a-Service (PrfaaS), a cross-datacenter serving architecture that selectively offloads long-context prefill to standalone, compute-dense prefill clusters and transfers the resulting KVCache over commodity Ethernet to local PD clusters for decode. The result, in a case study using an internal 1T-parameter hybrid model, is 54% higher serving throughput than a homogeneous PD baseline and 32% higher than a naive heterogeneous setup — while consuming only a fraction of available cross-datacenter bandwidth. The research team note that when compared at equal hardware cost, the throughput gain is approximately 15%, reflecting that the full 54% advantage comes partly from pairing higher-compute H200 GPUs for prefill with H20 GPUs for decode. Why the Existing Architecture Has Hit a Wall To understand what PrfaaS solves, it helps to understand why LLM serving is split into two phases in the first place. Prefill is the step where the model processes all of the input tokens and generates the KVCache — it is compute-intensive. Decode is where the model generates output tokens one at a time — it is memory-bandwidth-intensive. Prefill-decode (PD) disaggregation separates these two phases onto different hardware, which improves utilization and allows each phase to be independently optimized. The problem is that separating prefill from decode creates a transport problem. Once prefill runs on one set of machines and decode runs on another, the KVCache produced by prefill must be transferred to the decode side before output generation can begin. In conventional dense-attention models — those using Grouped Query Attention (GQA) — this KVCache is enormous. The research team benchmarks MiniMax-M2.5, a representative dense model with GQA, producing KVCache at roughly 60 Gbps for a 32K-token request on a single 8×H200 instance. That volume of data requires RDMA-class interconnects to transfer without stalling compute, which is why conventional PD disaggregation is tightly bound to a single datacenter-scale network fabric. Moving prefill and decode to separate clusters, let alone across datacenters, has simply not been feasible. Hybrid Attention Changes the Math What makes PrfaaS timely is an architectural shift happening at the model level. A growing class of models — including Kimi Linear, MiMo-V2-Flash, Qwen3.5-397B, and Ring-2.5-1T — adopt hybrid attention stacks that interleave a small number of full-attention layers with a larger number of linear-complexity or bounded-state layers such as Kimi Delta Attention (KDA), Multi-head Latent Attention (MLA), and Sliding Window Attention (SWA). In these architectures, only the full-attention layers produce KVCache that scales with sequence length. The linear-complexity layers maintain fixed-size recurrent states whose footprint is negligible at long context. The KV throughput numbers — defined as KVCache size divided by prefill latency — tell the story clearly. At 32K tokens, MiMo-V2-Flash produces KVCache at 4.66 Gbps versus 59.93 Gbps for MiniMax-M2.5, a 13× reduction. Qwen3.5-397B reaches 8.25 Gbps versus 33.35 Gbps for Qwen3-235B, a 4× reduction. For Ring-2.5-1T specifically, the paper decomposes the savings: MLA contributes roughly a 4.5× compression over GQA, and the 7:1 hybrid ratio contributes another approximately 8× reduction, yielding an overall KV memory saving of roughly 36×. For the internal 1T model used in the case study, KV throughput at 32K tokens is just 3.19 Gbps — a level that modern inter-datacenter Ethernet links can actually sustain. But the research team is careful to make a distinction that matters for AI devs building real systems: a smaller KVCache is necessary but not sufficient to make cross-datacenter PD disaggregation practical. Real workloads are bursty, request lengths are skewed, prefix caches are distributed unevenly across nodes, and inter-cluster bandwidth fluctuates. A naive design that routes every prefill to a remote cluster still runs into congestion and unstable queuing. What PrfaaS Actually Does The PrfaaS-PD architecture sits on top of three subsystems : compute, network, and storage . The compute subsystem separates clusters into two types — local PD clusters that handle end-to-end inference for short requests, and PrfaaS clusters with high-compute-throughput accelerators dedicated to long-context prefill. The network subsystem uses intra-cluster RDMA for fast local transfers and commodity Ethernet for cross-cluster KVCache transport. The storage subsystem builds a distributed hybrid prefix cache pool that handles linear attention recurrent states (request-level, fixed-size, exact-match only) and full-attention KVCache blocks (block-level, growing linearly with input length, supporting partial prefix matching) in separate groups backed by a unified block pool. The key routing mechanism is length-based threshold routing. Let l denote the incremental prefill length of a request after subtracting any cached prefix, and t a routing threshold. If l > t , the request goes to the PrfaaS cluster and its KVCache is shipped over Ethernet to a decode node. If l ≤ t , it stays on the local PD path. In the case study, the optimal threshold is t = 19.4K tokens, which routes approximately 50% of all requests — the longer ones — to the PrfaaS cluster. Making the Ethernet path reliable in practice requires more than just low KV throughput. The research team specifies three concrete transport mechanisms: layer-wise prefill pipelining to overlap KVCache generation with transmission, multi-connection TCP transport to fully utilize available bandwidth, and congestion monitoring integrated with the scheduler to detect loss and retransmission signals early and prevent congestion accumulation. On top of this, the research team introduces a dual-timescale scheduler. At short timescales, it monitors PrfaaS egress utilization and queue depth, adjusting routing when the link approaches its bandwidth ceiling. It also handles cache-affine routing: when bandwidth is scarce, each cluster's prefix cache is evaluated independently; when bandwidth is abundant, the scheduler considers the best cached prefix across all clusters and performs a cross-cluster cache transfer if it reduces redundant computation. At longer timescales, the scheduler rebalances prefill and decode node counts within the local PD cluster as traffic patterns shift, keeping the system near the throughput-optimal operating point. The Numbers In the case study, a PrfaaS cluster of 32 H200 GPUs is paired with a local PD cluster of 64 H20 GPUs, connected by a VPC network providing approximately 100 Gbps of cross-cluster bandwidth. The aggregate PrfaaS egress load under the optimal configuration is approximately 13 Gbps — just 13% of available Ethernet capacity — and the paper notes that the PrfaaS cluster remains compute-bound with substantial bandwidth headroom to spare. The research also projects this to larger deployments: even at the scale of a 10,000-GPU datacenter, the aggregate egress bandwidth required for KVCache transfer totals only about 1.8 Tbps, well within the capacity of modern inter-datacenter links. Mean Time to First Token (TTFT) drops by 50% and P90 TTFT drops by 64% compared to the homogeneous baseline. The naive heterogeneous configuration — all prefill on H200, all decode on H20, with no routing o

LLM 서빙 KVCache 분산 아키텍처 하이브리드 어텐션 문샷 AI