The Decoder • 83일 전

오픈AI, 빅테크와 'MRC' 개발로 슈퍼컴 병목 해결

IMP

8/10

핵심 요약

오픈AI가 AMD, 브로드컴, 인텔, 마이크로소프트, 엔비디아와 협력하여 대규모 AI 슈퍼컴퓨터의 데이터 전송 병목 현상을 해결하기 위한 새로운 네트워크 프로토콜 'MRC(Multipath Reliable Connection)'를 개발했습니다. 이 프로토콜은 패킷을 수백 개의 경로로 동시에 분산시켜 전송 속도를 높이고 장애 발생 시 마이크로초 단위로 복구하여 모델 학습의 안정성을 극대화합니다. MRC는 이미 오픈AI의 최대 규모 슈퍼컴퓨터에 적용되어 실제 프론티어 모델 학습에 사용 중이며, 오픈 컴퓨트 프로젝트(OCP)를 통해 사양이 공개되었습니다.

번역된 본문

오픈AI, AMD, 브로드컴, 인텔, 마이크로소프트, 엔비디아와 협력하여 AI 슈퍼컴퓨터 병목 현상 해결을 위한 네트워크 프로토콜 개발 (작성자: Matthias Bastian, 2026년 5월 6일)

오픈AI는 AMD, 브로드컴(Broadcom), 인텔, 마이크로소프트, 엔비디아(NVIDIA)와 협력하여 'MRC(Multipath Reliable Connection)'라는 새로운 네트워크 프로토콜을 개발했습니다. MRC는 대규모 AI 슈퍼컴퓨터 내 GPU 간 데이터 전송을 더 빠르고 예측 가능하며 안정적으로 만들도록 설계되었습니다. 이는 거대한 AI 모델을 학습시키기 위한 핵심 요구 사항입니다.

MRC는 각 전송을 단일 네트워크 경로로 보내는 대신, 패킷을 수백 개의 경로로 동시에 분산시켜 네트워크 코어의 혼잡을 줄입니다. 네트워크 경로, 링크 또는 스위치에 장애가 발생하면, MRC는 문제를 감지하고 마이크로초 단위로 우회 경로를 설정할 수 있습니다. 오픈AI에 따르면, 기존의 네트워크 패브릭은 장애 발생 후 안정화되는 데 수 초에서 수십 초가 걸릴 수 있습니다.

이를 통해 이전에는 학습을 중단시키거나 지연시켰을 네트워크 장애 및 유지 보수 작업이 발생하더라도 학습이 중단 없이 계속 진행될 수 있습니다. 오픈AI는 MRC의 다중 플레인(Multi-plane) 네트워크 설계 덕분에 기존 800Gb/s 네트워크에 필요한 3~4계층의 이더넷 스위치 대신 단 2계층의 스위치만 사용하여 10만 개 이상의 GPU를 연결할 수 있다고 밝혔습니다. 이는 전력 소비, 부품 수 및 전체적인 네트워크 비용을 줄여줍니다.

MRC는 이미 오픈AI의 최대 규모 슈퍼컴퓨터에 배포되어 있습니다.

MRC는 이미 텍사스주 애빌린에 위치한 오라클 클라우드 인프라(OCI) 사이트와 마이크로소프트의 Fairwater 슈퍼컴퓨터를 포함하여, 프론티어 모델 학습에 사용되는 오픈AI의 가장 큰 엔비디아 GB200 슈퍼컴퓨터 전체에 이미 배포되어 있습니다.

오픈AI는 최근 ChatGPT 및 Codex를 위한 프론티어 모델을 학습하는 동안 1계층(Tier-1) 스위치 4대를 재부팅해야 했다고 밝혔습니다. MRC 덕분에 회사는 클러스터에서 학습 작업을 실행하는 팀과 재부팅 일정을 조율할 필요가 없었습니다.

MRC 사양은 관련 연구 논문과 함께 오늘 오픈 컴퓨트 프로젝트(OCP)를 통해 공개되었습니다. 오픈AI 외에도 AMD, 브로드컴, 인텔, 마이크로소프트, 엔비디아가 이번 개발에 기여했습니다.

원문 보기

원문 보기 (영어)

OpenAI built a networking protocol with AMD, Broadcom, Intel, Microsoft, and NVIDIA to fix AI supercomputer bottlenecks Matthias Bastian View the LinkedIn Profile of Matthias Bastian May 6, 2026 Nano Banana Pro prompted by THE DECODER Ask about this article… Search OpenAI has teamed up with AMD, Broadcom, Intel, Microsoft, and NVIDIA to develop a new network protocol called MRC, short for Multipath Reliable Connection. MRC is designed to make data transfers between GPUs in large AI supercomputers faster, more predictable, and more resilient, a key requirement for training large AI models. Instead of sending each transfer over a single network path, MRC spreads packets across hundreds of paths simultaneously, reducing congestion in the core of the network. When network paths, links, or switches fail, MRC can detect the problem and route around it on a microsecond timescale. Conventional network fabrics can take seconds or even tens of seconds to stabilize after failures, according to OpenAI . Ad This helps training runs continue through network failures and maintenance events that would previously have disrupted or stalled them. OpenAI says MRC’s multi-plane network design can connect more than 100,000 GPUs using only two tiers of Ethernet switches, instead of the three or four tiers required by conventional 800 Gb/s networks. That reduces power consumption, component count, and overall network cost. Ad DEC_D_Incontent-1 MRC already runs on OpenAI's largest supercomputers MRC is already deployed across all of OpenAI's largest NVIDIA GB200 supercomputers used to train frontier models, including its Oracle Cloud Infrastructure site in Abilene, Texas, and Microsoft's Fairwater supercomputers. During the training of a recent frontier model for ChatGPT and Codex, OpenAI says it had to reboot four tier-1 switches. With MRC, the company did not need to coordinate the reboot with the teams running training jobs in the cluster. Ad The MRC specification was published today through the Open Compute Project (OCP) , along with an accompanying research paper . Besides OpenAI, AMD , Broadcom , Intel, Microsoft , and Nvidia contributed to the development. Ad DEC_D_Incontent-2 AI News Without the Hype – Curated by Humans Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section. Subscribe now Source: OpenAI

오픈AI 네트워크 프로토콜 슈퍼컴퓨터 MRC 인프라