Hacker News • 101일 전

차세대 LLM 서비스 아키텍처: 데이터센터를 넘나드는 KVCache

IMP

8/10

핵심 요약

대규모 언어 모델(LLM) 서비스를 위한 새로운 분산 아키텍처인 'Prefill-as-a-Service(PrfaaS)'를 제안하는 연구 논문입니다. 최신 하이브리드 어텐션 모델을 활용하여 KVCache 크기를 획기적으로 줄이고, 이를 일반 이더넷 망을 통해 다른 데이터센터로 전송하여 연산 부하를 분산시킵니다. 이를 통해 이기종 GPU 클러스터를 유연하게 확장할 수 있으며, 실험 결과 기존 방식 대비 최대 54% 높은 서비스 처리량을 달성하여 대규모 AI 인프라 운영에 매우 중요한 의미를 갖습니다.

번역된 본문

컴퓨터 과학 > 분산, 병렬 및 클러스터 컴퓨팅(arXiv:2604.15039)

제목: Prefill-as-a-Service: 차세대 모델의 KVCache가 데이터센터를 넘나들 수 있는 방법 저자: Ruoyu Qin, Weiran He, Yaoyu Wang, Zheming Li, Xinran Xu, Yongwei Wu, Weimin Zheng, Mingxing Zhang

초록: Prefill-decode(PD) 분리 아키텍처는 대규모 LLM 서비스의 표준으로 자리 잡았지만, 실제 배포 환경에서는 여전히 KVCache 전송이라는 한계에 부딪힙니다. 기존의 밀집 어텐션(dense-attention) 모델에서는 prefill 단계가 거대한 KVCache 트래픽을 발생시켜, prefill 노드와 decode 노드가 하나의 고대역폭 네트워크 도메인 내에 긴밀하게 결합되어야만 했습니다. 이로 인해 이기종 하드웨어 배포와 리소스의 탄력적 운영이 제한되었습니다.

최근의 하이브리드 어텐션 아키텍처는 KVCache 크기를 상당히 줄여, 클러스터 간 KVCache 전송을 현실적으로 만들었습니다. 하지만 단순히 KVCache 크기가 작아진다고 해서 이기종 데이터센터 간의 PD 서비스가 실용화되는 것은 아닙니다. 실제 워크로드는 여전히 버스트(Burst) 성향을 띠고, 요청 길이의 편차가 크며, 프리픽스 캐시는 불균형하게 분포하고, 클러스터 간 대역폭은 계속 변동하기 때문입니다. 따라서 prefill을 단순히 외부로 완전히 분리하는 설계는 여전히 네트워크 정체, 불안정한 큐잉, 낮은 활용도 문제를 겪을 수 있습니다.

이 논문에서는 긴 컨텍스트를 가진 prefill 작업을 독립적이고 컴퓨팅 집약적인 prefill 클러스터에 선택적으로 오프로드하고, 그 결과물인 KVCache를 일반 이더넷(Ethernet)을 통해 로컬 PD 클러스터로 전송하여 decode 하는 새로운 교차 데이터센터 서비스 아키텍처인 'Prefill-as-a-Service(PrfaaS)'를 제안합니다.

PrfaaS는 줄어든 KVCache를 당연한 것으로 보는 것에 그치지 않고, 모델 수준의 KV 효율성과 시스템 수준의 선택적 오프로딩, 대역폭 인식 스케줄링, 캐시 인식 요청 배치를 결합합니다. 이 설계는 이기종 가속기(accelerator)들이 동일한 저지연 RDMA 통신망을 공유해야 한다는 요구 사항을 제거하여, 느슨하게 결합된 클러스터 전반에 걸쳐 prefill 및 decode 용량을 독립적으로 확장할 수 있게 해줍니다.

내부적으로 개발된 1조(1T) 파라미터 하이브리드 모델을 사용한 사례 연구에서, PrfaaS가 적용된 이기종 배포는 적은 데이터센터 간 대역폭만 사용하면서도 동종(homogeneous) PD 방식보다는 54%, 단순 이기종 베이스라인보다는 32% 더 높은 서비스 처리량(throughput)을 달성했습니다.

원문 보기

원문 보기 (영어)

--> Computer Science > Distributed, Parallel, and Cluster Computing arXiv:2604.15039 (cs) [Submitted on 16 Apr 2026] Title: Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter Authors: Ruoyu Qin , Weiran He , Yaoyu Wang , Zheming Li , Xinran Xu , Yongwei Wu , Weimin Zheng , Mingxing Zhang View a PDF of the paper titled Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter, by Ruoyu Qin and 7 other authors View PDF HTML (experimental) Abstract: Prefill-decode (PD) disaggregation has become the standard architecture for large-scale LLM serving, but in practice its deployment boundary is still determined by KVCache transfer. In conventional dense-attention models, prefill generates huge KVCache traffics that keep prefill and decode tightly coupled within a single high-bandwidth network domain, limiting heterogeneous deployment and resource elasticity. Recent hybrid-attention architectures substantially reduce KVCache size, making cross-cluster KVCache transport increasingly plausible. However, smaller KVCache alone does not make heterogeneous cross-datacenter PD serving practical: real workloads remain bursty, request lengths are highly skewed, prefix caches are unevenly distributed, and inter-cluster bandwidth fluctuates. A naive design that fully externalizes prefill can therefore still suffer from congestion, unstable queueing, and poor utilization. We present Prefill-as-a-Service (PrfaaS), a cross-datacenter serving architecture that selectively offloads long-context prefill to standalone, compute-dense prefill clusters and transfers the resulting KVCache over commodity Ethernet to local PD clusters for decode. Rather than treating reduced KVCache as sufficient, PrfaaS combines model-side KV efficiency with system-side selective offloading, bandwidth-aware scheduling, and cache-aware request placement. This design removes the requirement that heterogeneous accelerators share the same low-latency RDMA fabric, enabling independent scaling of prefill and decode capacity across loosely coupled clusters. In a case study using an internal 1T-parameter hybrid model, a PrfaaS-augmented heterogeneous deployment achieves 54% and 32% higher serving throughput than homogeneous PD and naive heterogeneous baselines, respectively, while consuming only modest cross-datacenter bandwidth. Comments: 16 pages, 5 figures, 6 tables Subjects: Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2604.15039 [cs.DC] (or arXiv:2604.15039v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2604.15039 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ruoyu Qin [ view email ] [v1] Thu, 16 Apr 2026 14:07:41 UTC (244 KB) Full-text links: Access Paper: View a PDF of the paper titled Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter, by Ruoyu Qin and 7 other authors View PDF HTML (experimental) TeX Source view license Current browse context: cs.DC < prev | next > new | recent | 2026-04 Change to browse by: cs References & Citations NASA ADS Google Scholar Semantic Scholar export BibTeX citation Loading... BibTeX formatted citation × loading... Data provided by: Bookmark Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer ( What is the Explorer? ) Connected Papers Toggle Connected Papers ( What is Connected Papers? ) Litmaps Toggle Litmaps ( What is Litmaps? ) scite.ai Toggle scite Smart Citations ( What are Smart Citations? ) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv ( What is alphaXiv? ) Links to Code Toggle CatalyzeX Code Finder for Papers ( What is CatalyzeX? ) DagsHub Toggle DagsHub ( What is DagsHub? ) GotitPub Toggle Gotit.pub ( What is GotitPub? ) Huggingface Toggle Hugging Face ( What is Huggingface? ) ScienceCast Toggle ScienceCast ( What is ScienceCast? ) Demos Demos Replicate Toggle Replicate ( What is Replicate? ) Spaces Toggle Hugging Face Spaces ( What is Spaces? ) Spaces Toggle TXYZ.AI ( What is TXYZ.AI? ) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower ( What are Influence Flowers? ) Core recommender toggle CORE Recommender ( What is CORE? ) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs . Which authors of this paper are endorsers? | Disable MathJax ( What is MathJax? )

인프라 스케일링 KVCache LLM 서비스 분산 컴퓨팅 아키텍처