Hacker News • 64일 전

GPU 없는 AI 데이터센터의 가능성과 한계

IMP

8/10

핵심 요약

과거 데이터센터는 단순히 서버와 스토리지를 연결하는 남-북(North-South) 트래픽 중심이었으나, AI 클러스터는 수천 개의 GPU가 데이터를 교환하는 동-서(East-West) 트래픽 중심의 분산 슈퍼컴퓨터로 변모했습니다. 이 과정에서 대규모 데이터 전송과 완벽한 동기화가 필수적이 되어, 단 하나의 패킷 지연이나 손실도 전체 모델 학습 속도에 치명적인 병목을 유발하게 됩니다. 이를 해결하기 위해 도입된 무손실(Lossless) 네트워크 기술(RoCEv2, PFC)은 새로운 병목 현상을 유발하며, 현재 업계는 이를 극복하기 위해 인피니밴드(InfiniBand)와 레일 최적화(Rail Optimization)를 핵심 해결책으로 삼고 있습니다.

번역된 본문

🌙 ☀️ 과거의 모델 지난 수십 년 동안 데이터센터 구축은 잘 이해되고 예측 가능한 인프라 공학 작업이었습니다. 컴퓨팅 서버를 프로비저닝하고, 스토리지 어레이를 연결하며, 이들을 묶는 네트워크를 구축했습니다. 목표는 명확했습니다. 비용을 최소화하면서 활용도를 극대화하는 것입니다. 주요 트래픽 패턴은 근본적으로 남-북(North-South, 클라이언트가 서버로 요청을 보내고 서버가 데이터베이스 쿼리로 응답하는 방식) 트래픽과 서버에서 스토리지로 가는 일부 동-서(East-West) 트래픽이었습니다. 네트워크는 트래픽 버스트(Bursty traffic, 폭발적인 트래픽)를 처리하도록 구축되었으며, 패킷이 손실되면 표준 TCP/IP가 이를 재전송했습니다. 웹 호스팅이나 클라우드 서비스에서 약간의 지연은 이미지 로딩이 약간 느려지거나 요청 완료가 몇 밀리초 지연되는 것을 의미했습니다. 이는 감당할 수 있는 수준이었습니다.

AI가 바꾼 패러다임 현대 AI 클러스터에서 네트워크는 더 이상 단순한 인프라가 아닙니다. 단순히 머신 간에 데이터를 전송하는 것이 아니라, 액셀러레이터(가속기)의 활용도를 직접적으로 결정합니다. 딥러닝 패러다임에서 거대 모델을 학습시킬 때, 독립적인 서버들을 다루는 것이 아닙니다. 오히려 수천 개의 GPU가 지속적으로 파라미터를 교환해야 하는 거대한 분산 슈퍼컴퓨터인 셈입니다. 주요 트래픽 패턴은 클러스터 내부의 동-서 트래픽(서버 대 서버, GPU 대 GPU, 랙 대 랙) 통신으로 완전히 이동합니다. 국지적이고 폭발적인 스파이크와 달리 AI 워크로드는 올투올(All-to-all) 및 올리듀스(All-reduce)와 같은 통신 패턴을 실행합니다. 수백만 개의 작고 독립적인 흐름 대신, 네트워크는 극소수의 매우 거대한 엘리펀트 플로우(Elephant flows, 대규모 데이터 흐름)를 전달해야 합니다. 그래디언트 동기화 단계에서 수천 개의 GPU가 패브릭을 통해 동시에 데이터를 교환하여 심각한 네트워크 인캐스트(Incast)를 유발하고 스위치 버퍼를 빠르게 포화시킵니다.

이러한 변화는 표준 네트워킹의 기반이었던 많은 가정을 깨뜨렸습니다. 현대 액셀러레이터가 800Gb/s 속도로 데이터를 소비하고 생성할 수 있을 때, 핵심 지표는 평균 지연 시간에서 '작업 완료 시간(JCT, Job Completion Time)' 및 '꼬리 지연 시간(Tail latency)'으로 뒤바뀝니다. 딥러닝 학습에서 워크로드는 매우 엄격하게 동기화된 단계로 실행됩니다. 즉, 전체 워크로드는 가장 느린 참가자의 속도에 맞춰 진행됩니다. 단 하나의 지연된 패킷이 수천 개의 GPU를 멈추게 할 수 있습니다.

그림 1: 동기화된 엘리펀트 플로우로 인한 스위치 버퍼 포화 발생

RDMA와 PFC의 함정 패킷 손실을 해결하려다 새로운 문제인 '헤드 오브 라인 블로킹(Head-of-line blocking)'이 발생했습니다. AI 클러스터가 의존하는 전송 계층은 패킷 지연에 대한 민감도를 더욱 높입니다. 현대의 분산 학습은 GPU가 CPU와 운영 체제를 완전히 우회하여 GPU 간에 짧은 지연 시간으로 메모리에 직접 액세스할 수 있게 해주는 RoCEv2(RDMA over Converged Ethernet)를 통해 RDMA를 광범위하게 사용합니다. 하지만 RoCEv2는 오버헤드를 극적으로 줄이는 동시에 패킷 손실에 매우 민감합니다. 단 하나의 패킷 손실은 클러스터 전체에 걸쳐 재전송, 타임아웃 연쇄 반응 및 동기화 지연을 유발할 수 있습니다.

손실 허용성(Loss tolerance)을 달성하기 위해 표준 RoCEv2 네트워크는 우선순위 흐름 제어(PFC, Priority Flow Control)에 의존합니다. 개념적으로 PFC는 일시 정지 메커니즘처럼 작동합니다. 스위치 버퍼가 차오르기 시작하면 스위치는 상위 장치에 트래픽 전송을 일시적으로 중지하라고 지시합니다. 하지만 이는 헤드 오브 라인 블로킹이라는 또 다른 문제를 야기합니다. PFC는 혼잡을 네트워크를 통해 역방향으로 전파하여 패킷 손실을 해결합니다. 지속적인 부하가 걸리는 상황에서 이는 관련 없는 트래픽이 혼잡한 흐름 뒤에 갇히게 만드는 헤드 오브 라인 블로킹을 생성합니다. 혼잡이 패브릭 전체에 퍼지고 큐 깊이가 증가하며, 네트워크의 전체 섹션이 사실상 가장 느린 트래픽 경로를 중심으로 동기화될 수 있습니다.

분산 학습 환경에서는 이 비용이 매우 큽니다. 모든 동기화 작업이 완료될 때까지 컴퓨팅 클러스터는 진행할 수 없습니다. 재전송되는 패킷이나 혼잡한 흐름이 해결될 때까지 GPU는 유휴 상태로 대기해야만 합니다.

인피니밴드와 레일 최적화 이러한 문제에 대한 기존의 해결책: 바로 인피니밴드(InfiniBand)와 레일 최적화(Rail Optimization)입니다.

원문 보기

원문 보기 (영어)

🌙 ☀️ The old model For the past few decades, building a datacenter has been a well-understood, predictable exercise in utility engineering. You provisioned compute servers, attached storage arrays, and built a network to stitch them together. The objective was straightforward: maximize utilization while minimizing cost. The dominant traffic pattern was fundamentally north-south (clients sending requests to servers, and servers responding with database queries) and a few east-west traffic from servers to storage. The networks were built to handle bursty traffic , and if a packet dropped, standard TCP/IP would retransmit it. In web hosting or cloud services, a minor delay meant an image loaded slightly slower or a request completed a few milliseconds later. It was tolerable. AI training changed that model completely. The network is no longer infrastructure. It directly determines accelerator utilization. The AI shift In modern AI clusters, the network is no longer just infrastructure sitting beneath compute. It is not simply transporting data between machines but determines accelerator utilization. If you are training large models under the deep learning paradigm, you aren't dealing with independent servers. It is rather a massive, distributed supercomputer where thousands of GPUs must continuously swap parameters. The dominant traffic pattern shifts completely to east-west traffic (server-to-server, GPU-to-GPU and rack-to-rack) communication inside the cluster. In contrast to localized, bursty spikes, AI workloads execute communication patterns like all-to-all and all-reduce . Instead of millions of small independent flows, the network must carry a small number of extremely large elephant flows. During gradient synchronization phases, thousands of GPUs may simultaneously exchange data across the fabric, creating severe network incast and rapidly saturating switch buffers. This shift broke many of the assumptions standard networking was built on. When a modern accelerator can consume and generate data at 800 Gb/s, the critical metric flips from average latency to Job Completion Time (JCT) and tail latency . In deep learning training, workloads execute in tightly synchronized steps. Meaning the entire workload progresses at the speed of the slowest participant. One delayed packet can stall thousands of GPUs. Figure 1: Synchronized elephant flows causing switch buffer saturation. RDMA & the PFC trap Solving packet loss created a new problem: head-of-line blocking. The sensitivity to packet delay is amplified by the transport layer AI clusters rely on. Modern distributed training heavily uses RDMA through RoCEv2 (RDMA over Converged Ethernet), allowing GPUs to bypass the CPU and operating system entirely for low-latency direct memory access across GPUs. But while RoCEv2 dramatically reduces overhead, it is also highly sensitive to packet loss. A single dropped packet can trigger retransmissions, timeout cascades, and synchronization delays across the cluster. To achieve loss tolerance , standard RoCEv2 networks rely on Priority Flow Control (PFC) . Conceptually, PFC acts like a pause mechanism: when switch buffers begin filling, the switch instructs upstream devices to temporarily stop transmitting traffic. But this creates another problem: head-of-line blocking . PFC solves packet loss by propagating congestion backward through the network. Under sustained load, this creates head-of-line blocking, where unrelated traffic becomes trapped behind congested flows. Congestion spreads across the fabric, queue depths increase, and entire sections of the network can become effectively synchronized around the slowest traffic path. In distributed training environments, this is expensive. The compute cluster cannot advance until every synchronization operation completes. GPUs remain idle while waiting for retransmitted packets or congested flows to clear. InfiniBand & rail optimization The incumbent answer: InfiniBand and Rail Optimization To maximize GPU utilization, the industry's immediate answer was to throw hardware at the problem. NVIDIA capitalized on this by dominating the AI datacenter landscape with InfiniBand — a native lossless fabric designed specifically for high-throughput, low-latency clustering. Unlike conventional Ethernet deployments, InfiniBand was built around deterministic transport behavior, hardware congestion management, adaptive routing, and tightly controlled latency characteristics. To scale these clusters, engineering teams have had to navigate three distinct network vectors: Scale Up: Maximizing the high-speed interconnectivity within a single chassis or node (e.g., stitching 8 GPUs together using NVLink). Scale Out: Expanding horizontally by connecting these multi-GPU nodes across an entire data hall using a dedicated backend network fabric. Scale Across / DCI (Datacenter Interconnect): Linking entire clusters together when physical power and cooling limits prevent a single site from expanding further. Figure 1: Scale-up is for memory & Scale-out is for compute We're entering the end of scale-up as NVIDIA now delivers complete racks with every GPU accessing every other GPU's memory through NVLink (on the same chassis) and NVSwitch (in the same rack). The next years will consist of focusing on using Connect-X NICs for connecting different racks. To manage the massive scale-out fabric, modern topologies are rigidly designed to be rail-optimized . In an 8-GPU node configuration, each of the 8 GPUs is mapped to a dedicated, independent network interface card (NIC). The network fabric is split into 8 parallel, isolated physical switch planes. GPU position 1 across every server communicates exclusively through rail 1, GPU position 2 through rail 2, and so on. This isolation reduces congestion interactions and improves failure containment. If one network plane experiences degradation, the cluster loses only a fraction of aggregate bandwidth rather than stalling the entire distributed workload. Figure 1: 2-Tier Rail-optimized topology. ECMP & elephant flows Static routing was designed for mice not elephants. Rail-optimized architectures exposed another weakness in conventional networking. Traditional routing protocols cannot handle this architecture efficiently. Standard IP networks rely on ECMP (Equal-Cost Multi-Path) to distribute traffic across paths. ECMP works by hashing the packet's header (static 5-tuple) to assign a flow to a specific path. In web applications this works extremely well because traffic consists of large numbers of relatively small independent flows. AI traffic behaves differently because distributed training creates a small number of massive elephant flows. ECMP hashing inevitably creates collisions where multiple large flows become pinned to the same physical links while alternative paths remain underutilized. The result is buffer pressure, more congestion, packet drops and tail latency spikes. To counter this, modern AI switches utilize DLB (Dynamic Load Balancing) and packet-spraying mechanisms. Instead of routing by flow, the hardware breaks elephant flows apart, and schedules traffic dynamically based on real-time port congestion. This is the environment that led to the emergence of the Ultra Ethernet Consortium. The Ultra Ethernet Consortium An open re-architecture of Ethernet for AI workloads. InfiniBand works, but it is expensive, closed, and forces vendor lock-in. The broader ecosystem's response is the Ultra Ethernet Consortium (UEC) : a comprehensive re-architecture of Ethernet designed specifically to challenge InfiniBand on AI workloads, without giving up Ethernet's vast ecosystem and economies of scale. Instead of relying on crude, flow-level pause mechanisms like PFC, Ultra Ethernet moves the intelligence to the transport layer. It natively introduces Packet Spraying : rather than forcing an entire elephant flow down a single hashed path via ECMP, UEC switches chop the flow down to individual packets

AI 인프라 데이터센터 GPU 네트워킹 대규모 언어 모델 네트워크 병목현상