r/LocalLLaMA • 61일 전

자이(Zai), GLM-5.1 추론 네트워크 교체로 33% 비용 절감 및 15% 성능 향상 달성

IMP

8/10

핵심 요약

자이(Zai)는 천 개 GPU 규모의 GLM-5.1 코딩 추론 클러스터에서 기존 ROFT 방식을 청화대와 공동 개발한 'ZCube'로 전면 교체했습니다. 동일한 GPU와 소프트웨어 스택을 유지하면서도 네트워크 아키텍처만의 변경으로 스위치 비용은 33% 줄이고, 처리량은 15% 높이며 첫 토큰 지연 시간(P99)은 40.6% 단축시키는 결과를 얻었습니다. 이는 하드웨어 비용을 추가하는 대신 트래픽 병목을 해소하는 네트워크 평면화(Flat) 방식을 통해 역설적인 성과를 입증했다는 점에서 AI 인프라 실무자들에게 매우 중요한 시사점을 줍니다.

번역된 본문

최근 AI 인프라 쪽에 관심을 가지게 되면서 자이(Zai)의 흥미로운 소식을 발견했습니다. 이들은 천 개 GPU 규모 클러스터에서 구동되던 GLM-5.1 코딩 추론 환경의 네트워크 아키텍처를 기존의 표준 ROFT 구성에서 청화대학교(Tsinghua University) 및 HarnetsAI와 공동 개발한 'ZCube'라는 새로운 방식으로 업그레이드했습니다.

실제 프로덕션 환경에서 나타난 수치는 다음과 같습니다:

스위치 및 광 모듈 비용 33% 감소
GPU 추론 처리량(Throughput) 15% 증가
첫 번째 토큰의 P99 꼬리 지연 시간(Tail latency) 40.6% 감소

동일한 GPU, 동일한 소프트웨어 스택, 동일한 모델이 사용되었습니다. 오직 네트워크 아키텍처만 변경되었을 뿐입니다.

이들이 해결하고자 했던 실질적인 문제도 매우 흥미롭습니다. Prefill-Decode 분리(PD disaggregated) 추론 환경에서는 KV Cache 전송으로 인해 노드 간에 매우 비대칭적인 트래픽이 발생합니다. ROFT 토폴로지는 학습 워크로드를 처리하는 데에는 문제가 없지만, PD 분리 구조에서는 트래픽 패턴이 정적인 레일 매핑(Rail mapping)과 들어맞지 않습니다. 그 결과 특정 리프(Leaf) 스위치에 병목 현상(Hotspot)이 집중되고, PFC(Priority Flow Control) 역압력이 발생하게 됩니다.

ZCube는 이 문제를 완전한 평면화(Flatten) 작업을 통해 해결합니다. 스파인(Spine) 레이어를 완전히 제거하고, 두 개의 스위치 그룹 간에 완전 이분 상호연결(Complete bipartite interconnect)을 사용하는 방식입니다. 이를 통해 ROFT의 구조적 한계로 인해 피할 수 없었던 전체적인 정체(Congestion) 카테고리를 하나 말끔히 없앴습니다.

성능이 향상됨과 동시에 비용이 감소한 점이 가장 돋보이는 부분입니다. 보통 더 나은 네트워크 성능을 원하면 하드웨어 비용을 더 지불해야 합니다. 하지만 여기서는 하드웨어 비용을 3분의 1이나 줄이면서도 동일한 GPU에서 15% 더 높은 처리량을 끌어냈습니다.

원문 보기

원문 보기 (영어)

Been following the infrastructure side of AI more lately and stumbled on this from Zai. They upgraded the network architecture on a thousand-GPU cluster running GLM-5.1 coding inference from the standard ROFT setup to something they built called ZCube, developed with Tsinghua University and HarnetsAI The numbers from production: \- Switch and optical module costs down 33% \- GPU inference throughput up 15% \- P99 tail latency on first token dropped 40.6% Same GPUs, same software stack, same model. Just the network architecture changed The actual problem they were solving is interesting. With Prefill-Decode disaggregated inference, KV Cache transfers create highly asymmetric traffic between nodes. ROFT topology handles training workloads fine but with PD disaggregation the traffic patterns dont match the static rail mapping, so you get hotspots on specific Leaf switches and PFC backpressure building up ZCube addresses it by going fully flattened, removing the Spine layer entirely and using a complete bipartite interconnect between two switch groups. Eliminates a whole category of congestion that ROFT cant avoid by design The cost reduction while getting better performance is the part that stands out. Usually you pay more for better network hardware. Here they cut hardware costs by a third and got 15% more throughput out of the same GPUs

인프라 최적화 네트워크 아키텍처 GPU 추론 GLM-5.1 비용 절감