Hacker News • 33일 전

분산 AI 학습의 혁신, '디커플드 디로코' 발표

IMP

9/10

핵심 요약

먼 거리에 있는 데이터센터 전역에 걸쳐 대규모 AI 모델을 학습할 수 있는 새로운 분산 아키텍처인 '디커플드 디로코(Decoupled DiLoCo)'가 소개되었습니다. 이 기술은 통신 대역폭을 크게 줄이면서도 하드웨어 장애에 강한 내구성을 제공하여, 기존 동기화 방식보다 20배 이상 빠른 속도로 글로벌 분산 사전 학습을 수행할 수 있습니다. 결과적으로 전 세계적으로 분산된 유휴 컴퓨팅 자원을 유연하게 활용할 수 있는 새로운 AI 인프라의 가능성을 열었다는 점에서 매우 중요합니다.

번역된 본문

2026년 4월 23일 | 연구

디커플드 디로코: 탄력적인 대규모 분산 AI 학습의 새로운 지평 Arthur Douillard 및 DiLoCo 팀 | 공유하기

우리의 새로운 분산 아키텍처는 먼 거리에 있는 데이터센터 전역에 걸쳐 대규모 언어 모델(LLM)을 학습하는 데 도움을 줍니다. 이를 통해 더 낮은 대역폭으로도 높은 하드웨어 탄력성(Hardware resiliency)을 확보할 수 있습니다.

전통적으로 최첨단(Frontier) AI 모델을 학습하려면 동일한 칩이 거의 완벽한 동기화 상태를 유지해야 하는 대규모의 긴밀하게 결합된(Tightly coupled) 시스템에 의존해 왔습니다. 이 접근 방식은 현재의 최고 수준(State-of-the-art) 모델들에 매우 효과적이지만, 차세대 규모를 바라보면 수천 개의 칩에 걸쳐 이 수준의 동기화를 유지하는 것이 중요한 물류적, 운영적 과제가 됩니다. 오늘 발표된 새로운 논문에서는 우리가 이 문제를 해결하기 위한 새로운 접근 방식인 '디커플드 디로코(Decoupled DiLoCo, 분산 저통신)'를 기쁜 마음으로 공유하고자 합니다.

이 아키텍처는 대규모 학습 작업을 분리된 '컴퓨팅 섬(Islands)' 단위로 나누고 그 사이에 비동기식 데이터(Asynchronous data)를 흘려보내는 방식으로 작동합니다. 이를 통해 국지적인 장애를 격리하여 시스템의 다른 부분들이 계속해서 효율적으로 학습할 수 있도록 합니다. 그 결과는 전 세계적으로 분산된 데이터센터 전역에 걸쳐 고도화된 모델을 학습할 수 있는 더욱 탄력적이고 유연한 방법입니다. 결정적으로, 디커플드 디로코는 데이터 병렬 처리(Data-Parallel)와 같은 기존의 분산 방식을 글로벌 규모에서 실용적으로 사용할 수 없게 만들었던 통신 지연 문제를 겪지 않습니다.

최첨단 모델들이 규모와 복잡성 면에서 계속해서 성장함에 따라, 우리는 더 많은 컴퓨팅, 다양한 위치, 그리고 다양한 하드웨어에 걸쳐 모델을 학습시킬 수 있는 다양한 접근 방식을 모색하고 있습니다.

대규모 환경에서 결함 허용(Fault-tolerant) 비동기 학습의 진화 디커플드 디로코는 두 가지 이전 기술의 발전을 기반으로 합니다. 첫째는 비동기식 데이터 흐름에 기반한 분산 AI 시스템을 도입한 'Pathways'이며, 둘째는 분산된 데이터센터 간에 필요한 대역폭을 극적으로 줄여 먼 거리에서도 대규모 언어 모델을 학습하는 것을 실용적으로 만든 'DiLoCo'입니다. 디커플드 디로코는 이러한 아이디어를 결합하여 대규모 환경에서 더 유연하게 AI 모델을 학습할 수 있게 해줍니다.

Pathways 위에 구축된 이 기술은 분리된 컴퓨팅 섬(학습 유닛, Learner units이라고 함)에 걸쳐 비동기 학습을 가능하게 합니다. 따라서 한 지역에서 칩 고장이 발생하더라도 다른 유닛들의 학습 진행이 중단되지 않습니다. 또한 이 인프라는 자가 복구(Self-healing) 기능도 갖추고 있습니다. 테스트 과정에서 우리는 학습 실행 중 인위적인 하드웨어 결함을 유발하는 '카오스 엔지니어링(Chaos engineering)'이라는 방법을 사용했습니다. 디커플드 디로코는 전체 학습 유닛이 유실된 이후에도 학습 과정을 계속 진행했으며, 해당 유닛들이 온라인 상태로 돌아왔을 때 아무런 문제 없이 다시 통합했습니다.

Gemma 4 모델을 사용한 테스트에서는 하드웨어 고장 발생 시 기존의 전통적인 학습 방식보다 시스템이 학습 클러스터의 가용성(Availability)을 훨씬 더 잘 유지하는 것으로 나타났으며, 궁극적으로 동일한 수준의 머신러닝(ML) 벤치마크 성능을 달성했습니다.

디커플드 디로코는 장애에 대한 복원력이 뛰어날 뿐만 아니라 프로덕션 수준의 완전히 분산된 사전 학습(Pre-training)을 실행하는 데에도 매우 실용적입니다. 우리는 미국의 4개 지역에 걸쳐 2~5Gbps의 광역 네트워크(Wide-area networking)를 사용하여 120억(12B) 개의 매개변수(Parameter)를 가진 모델을 성공적으로 학습했습니다. 이는 데이터센터 간에 새로운 맞춤형 네트워크 인프라를 구축할 필요 없이, 기존 데이터센터 시설 간의 인터넷 연결망을 사용해 비교적 달성 가능한 대역폭 수준입니다.

특히 이 시스템은 기존의 동기화 방식보다 20배 이상 빠른 속도로 이 학습 결과를 달성했습니다. 이는 우리의 시스템이 필요한 통신을 더 긴 연산 기간에 통합하여, 시스템의 한 부분이 다른 부분을 기다려야 하는 '블로킹(Blocking)' 병목 현상을 피했기 때문입니다.

AI 학습 인프라의 진화 주도 구글은 우리는 하드웨어, 소프트웨어 인프라 및 연구를 아우르는 전체 스택(Full-stack) 방식으로 AI 학습에 접근합니다. 점점 더 큰 성과는 이러한 계층들이 어떻게 결합되는지를 재고하는 것에서 나오고 있습니다. 디커플드 디로코가 그 한 가지 예입니다. 이 기술은 인터넷 규모의 대역폭으로 학습 작업을 가능하게 함으로써 유휴 상태의 모든 컴퓨팅 자원을 활용할 수 있게 하며, 방치된 자원을 유용한 용량으로 전환할 수 있습니다. 효율성과 탄력성 외에도 이 학습 패러다임은 이기종 하드웨어(Hardware)들을 혼합하여 사용할 수 있는 능력을 잠금 해제합니다.

원문 보기

원문 보기 (영어)

April 23, 2026 Research Decoupled DiLoCo: A new frontier for resilient, distributed AI training Arthur Douillard and the DiLoCo team Share Our new distributed architecture helps to train LLMs across distant data centers - with lower bandwidth and more hardware resiliency. Training a frontier AI model traditionally depends on a large, tightly coupled system in which identical chips must stay in near-perfect synchronization. This approach is highly effective for today’s state-of-the-art models, but as we look toward future generations of scale, maintaining this level of synchronization across thousands of chips becomes a significant logistical challenge. Today, in a new paper we are excited to share a new approach to this problem, called Decoupled DiLoCo (Distributed Low-Communication). By dividing large training runs across decoupled “islands” of compute, with asynchronous data flowing between them, this architecture isolates local disruptions so that other parts of the system can keep learning efficiently. The result is a more resilient and flexible way to train advanced models across globally distributed data centers. And crucially, Decoupled DiLoCo does not suffer the communication delays that made previous distributed methods like Data-Parallel impractical at global scale. As frontier models continue to grow in scale and complexity, we’re exploring diverse approaches to train models across more compute, locations and varied hardware. Developing more fault-tolerant asynchronous training at scale Decoupled DiLoCo builds on two earlier advances: Pathways , which introduced a distributed AI system based on asynchronous data flow, and DiLoCo , which dramatically reduced the bandwidth required between distributed data centers, making it practical to train large language models across distant locations. Decoupled DiLoCo brings those ideas together to train AI models more flexibly at scale. Built on top of Pathways, it enables asynchronous training across separate islands of compute (known as learner units) so that a chip failure in one area doesn’t interrupt the progress of the others. This infrastructure is also self-healing. In testing, we used a method called “chaos engineering” to introduce artificial hardware failures during training runs. Decoupled DiLoCo continued the training process after the loss of entire learner units, and then seamlessly reintegrated them when they came back online. Testing Decoupled DiLoCo with Gemma 4 models demonstrated that, when hardware fails, the system maintains greater availability of learning clusters than more traditional training methods — while ultimately delivering the same benchmarked level of machine learning (ML) performance. Decoupled DiLoCo is not only more resilient to failures, but is also practical for executing production-level, fully distributed pre-training. We successfully trained a 12 billion parameter model across four separate U.S. regions using 2-5 Gbps of wide-area networking (a level relatively achievable using existing internet connectivity between datacenter facilities, rather than requiring new custom network infrastructure between facilities). Notably, the system achieved this training result more than 20 times faster than conventional synchronization methods. This is because our system incorporates required communication into longer periods of computation, avoiding the "blocking" bottlenecks where one part of the system must wait for another. Driving the evolution of AI training infrastructure At Google, we take a full-stack approach to AI training, spanning hardware, software infrastructure and research. Increasingly, gains are coming from rethinking how these layers fit together. Decoupled DiLoCo is one example. By enabling training jobs at internet-scale bandwidth, it can tap any unused compute wherever it sits, turning stranded resources into useful capacity. Beyond efficiency and resilience, this training paradigm also unlocks the ability to mix different hardware generations, such as TPU v6e and TPU v5p, in a single training run. This approach not only extends the useful life of existing hardware, but also increases the total compute available for model training. In our experiments, chips from different generations running at different speeds still matched the ML performance of single-chip-type training runs, ensuring that even older hardware can meaningfully accelerate AI training. What’s more, because new generations of hardware don’t arrive everywhere all at once, being able to train across generations can alleviate recurring logistical and capacity bottlenecks. As we push the frontiers of AI infrastructure today, we’re continuing to explore approaches to resilient systems needed to unlock the next generation of AI. Read our technical report Acknowledgements This work was done by a team of members across Google DeepMind and Google Research. The leads and core contributors behind Decoupled DiLoCo are Arthur Douillard, Keith Rush, Yani Donchev, Zachary Charles, Ayush Dubey, Blake Woodworth, Ionel Gog, Josef Dean, Nova Fallen, Zachary Garrett. Operational support was done by Nate Keating and Jenny Bishop. We are also grateful for the additional support and advising from Jeff Dean, Marc’Aurelio Ranzato, Raia Hadsell, Arthur Szlam, Edouard Yvinec, Henry Prior, Paul Barham, Michael Isard, Daniel Ramage, Brendan McMahan, Chase Hensel, and Zoltan Egyed.

분산-학습 인프라 구글 대규모-언어-모델 시스템-아키텍처