Hacker News • 112일 전

단일 GPU로 1000억 파라미터 LLM 훈련

IMP

9/10

핵심 요약

단일 GPU 환경에서 1000억 개 이상의 파라미터를 가진 대규모 언어 모델(LLM)을 최고 정밀도로 훈련할 수 있는 'MegaTrain' 시스템이 소개되었습니다. 이 시스템은 GPU 대신 CPU 메모리를 적극 활용하고 파이프라인 및 상태 없는 레이어 템플릿 기법을 통해 하드웨어 한계를 극복하여, 140억 파라미터 모델 훈련 시 기존 DeepSpeed ZeRO-3 대비 1.84배 높은 처리량을 달성했습니다.

번역된 본문

컴퓨터 과학 > 컴퓨팅 및 언어 arXiv:2604.05091 (cs) [2026년 4월 6일 제출]

제목: MegaTrain: 단일 GPU에서 100B+ 파라미터 대규모 언어 모델의 최고 정밀도(Full Precision) 훈련 저자: Zhengqing Yuan, Hanchi Sun, Lichao Sun, Yanfang Ye

초록: 본 논문에서는 단일 GPU에서 100B(1,000억)+ 파라미터 대규모 언어 모델을 최고 정밀도로 효율적으로 훈련할 수 있는 메모리 중심의 시스템인 'MegaTrain'을 소개합니다. 전통적인 GPU 중심 시스템과 달리, MegaTrain은 파라미터와 옵티마이저 상태를 호스트 메모리(CPU 메모리)에 저장하고 GPU를 일시적인 연산 엔진으로 취급합니다. 각 레이어별로 파라미터를 스트리밍하여 가져오고(in) 그래디언트를 계산하여 내보내며(out), GPU에 유지되는 상태를 최소화합니다.

CPU-GPU 간 대역폭 병목 현상을 해결하기 위해 두 가지 핵심 최적화 기법을 도입했습니다. 첫째, 여러 CUDA 스트림에 걸쳐 파라미터 사전 가져오기, 연산, 그래디언트 오프로딩을 중첩시키는 파이프라인 기반의 이중 버퍼링 실행 엔진을 도입하여 GPU 작업이 중단 없이 연속적으로 이루어지도록 했습니다. 둘째, 지속적인 autograd 그래프를 상태 없는(stateless) 레이어 템플릿으로 대체하고, 파라미터가 스트리밍될 때 가중치를 동적으로 바인딩하여 영구적인 그래프 메타데이터를 제거하는 동시에 스케줄링의 유연성을 제공합니다.

1.5TB의 호스트 메모리를 갖춘 단일 H200 GPU 환경에서 MegaTrain은 최대 120B 파라미터 모델을 안정적으로 훈련시킬 수 있습니다. 또한 14B 파라미터 모델을 훈련할 때 CPU 오프로딩을 사용하는 DeepSpeed ZeRO-3에 비해 1.84배 높은 훈련 처리량(throughput)을 달성했습니다. 나아가 단일 GH200 환경에서 512k 토큰 컨텍스트를 사용하여 7B 파라미터 모델을 훈련하는 것도 가능합니다.

주제: 컴퓨팅 및 언어 (cs.CL); 분산, 병렬 및 클러스터 컴퓨팅 (cs.DC); 운영 체제 (cs.OS) 인용: arXiv:2604.05091 [cs.CL]로 인용 제출 기록: Zhengqing Yuan [v1] 2026년 4월 6일 (월) 18:43:56 UTC (787 KB)

원문 보기

원문 보기 (영어)

--> Computer Science > Computation and Language arXiv:2604.05091 (cs) [Submitted on 6 Apr 2026] Title: MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU Authors: Zhengqing Yuan , Hanchi Sun , Lichao Sun , Yanfang Ye View a PDF of the paper titled MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU, by Zhengqing Yuan and 3 other authors View PDF HTML (experimental) Abstract: We present MegaTrain, a memory-centric system that efficiently trains 100B+ parameter large language models at full precision on a single GPU. Unlike traditional GPU-centric systems, MegaTrain stores parameters and optimizer states in host memory (CPU memory) and treats GPUs as transient compute engines. For each layer, we stream parameters in and compute gradients out, minimizing persistent device state. To battle the CPU-GPU bandwidth bottleneck, we adopt two key optimizations. 1) We introduce a pipelined double-buffered execution engine that overlaps parameter prefetching, computation, and gradient offloading across multiple CUDA streams, enabling continuous GPU execution. 2) We replace persistent autograd graphs with stateless layer templates, binding weights dynamically as they stream in, eliminating persistent graph metadata while providing flexibility in scheduling. On a single H200 GPU with 1.5TB host memory, MegaTrain reliably trains models up to 120B parameters. It also achieves 1.84$\times$ the training throughput of DeepSpeed ZeRO-3 with CPU offloading when training 14B models. MegaTrain also enables 7B model training with 512k token context on a single GH200. Subjects: Computation and Language (cs.CL) ; Distributed, Parallel, and Cluster Computing (cs.DC); Operating Systems (cs.OS) Cite as: arXiv:2604.05091 [cs.CL] (or arXiv:2604.05091v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.05091 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Zhengqing Yuan [ view email ] [v1] Mon, 6 Apr 2026 18:43:56 UTC (787 KB) Full-text links: Access Paper: View a PDF of the paper titled MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU, by Zhengqing Yuan and 3 other authors View PDF HTML (experimental) TeX Source view license Current browse context: cs.CL < prev | next > new | recent | 2026-04 Change to browse by: cs cs.DC cs.OS References & Citations NASA ADS Google Scholar Semantic Scholar export BibTeX citation Loading... BibTeX formatted citation × loading... Data provided by: Bookmark Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer ( What is the Explorer? ) Connected Papers Toggle Connected Papers ( What is Connected Papers? ) Litmaps Toggle Litmaps ( What is Litmaps? ) scite.ai Toggle scite Smart Citations ( What are Smart Citations? ) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv ( What is alphaXiv? ) Links to Code Toggle CatalyzeX Code Finder for Papers ( What is CatalyzeX? ) DagsHub Toggle DagsHub ( What is DagsHub? ) GotitPub Toggle Gotit.pub ( What is GotitPub? ) Huggingface Toggle Hugging Face ( What is Huggingface? ) ScienceCast Toggle ScienceCast ( What is ScienceCast? ) Demos Demos Replicate Toggle Replicate ( What is Replicate? ) Spaces Toggle Hugging Face Spaces ( What is Spaces? ) Spaces Toggle TXYZ.AI ( What is TXYZ.AI? ) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower ( What are Influence Flowers? ) Core recommender toggle CORE Recommender ( What is CORE? ) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs . Which authors of this paper are endorsers? | Disable MathJax ( What is MathJax? )

LLM훈련 메모리최적화 GPU 오픈소스 논문