Hacker News • 68일 전

CODA: 트랜스포머 블록을 GEMM 에필로그 프로그램으로 재작성

IMP

8/10

핵심 요약

AI 모델 학습 시 흔히 발생하는 메모리 병목 현상을 해결하기 위해, 개별적으로 처리되던 연산들을 하나의 GPU 커널(GEMM Epilogue)로 통합하여 성능을 극대화하는 새로운 추상화 기법인 CODA를 제안합니다. 이 방식은 데이터 이동을 최소화하면서도 프레임워크 수준의 생산성과 하드웨어 수준의 극적인 효율성을 동시에 달성할 수 있도록 돕습니다.

번역된 본문

--> 컴퓨터 과학 > 머신러닝 arXiv:2605.19269 (cs) [2026년 5월 19일 제출 (v1), 2026년 5월 20일 최종 수정 (현재 버전, v2)]

제목: CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs 저자: Han Guo, Jack Zhang, Arjun Menon, Driss Guessous, Vijay Thakkar, Yoon Kim, Tri Dao

초록: 트랜스포머(Transformer) 학습 시스템은 밀집 선형 대수(Dense linear algebra)를 기반으로 구축되지만, 엔드투엔드(end-to-end) 소요 시간 중 상당 부분은 메모리 제약을 받는(memory-bound) 주변 연산자들에 의해 소모됩니다. 정규화(Normalization), 활성화 함수(Activations), 잔차 업데이트(Residual updates), 리덕션(Reductions) 및 관련 연산들은 적은 산술 연산을 수행하면서도 대용량 중간 텐서를 반복적으로 전역 메모리(Global memory)로 이동시킵니다. 이로 인해 고도로 최적화된 학습 스택에서조차 데이터 이동이 점점 더 중요한 병목 현상으로 대두되고 있습니다.

본 논문에서는 이러한 연산들을 'GEMM(General Matrix Multiply)-plus-epilogue' 프로그램으로 표현하는 GPU 커널 추상화 모델인 CODA를 소개합니다. CODA는 개별적인 프레임워크 커널으로 노출되는 수많은 트랜스포머 연산자들을 대수적으로 재매개변수화(reparameterized)하여, GEMM의 출력 타일이 메모리에 기록되기 전에 온칩(On-chip) 상태에 머무는 동안 실행할 수 있다는 관찰에 기반합니다. 이러한 추상화는 GEMM의 메인 루프(Mainloop)를 고정하고, 스케일링(Scaling), 리덕션, 쌍대 변환(Pairwise transformations) 및 누적(Accumulation)을 위한 소규모 구성 가능한 에필로그 기본 요소(Primitives) 세트를 제공합니다.

이러한 제한된 인터페이스는 전문가가 직접 작성한 GEMM의 고성능 구조를 유지하면서도, 표준 트랜스포머 블록의 순전파(Forward pass) 및 역전파(Backward pass)에서 어텐션(Attention)을 제외한 거의 모든 연산을 포괄할 만큼 충분한 표현력을 갖추고 있습니다. 대표적인 트랜스포머 워크로드 전반에 걸쳐, 사람과 대형 언어 모델(LLM)이 작성한 CODA 커널 모두 높은 성능을 달성했습니다. 이는 'GEMM-plus-epilogue' 프로그래밍이 프레임워크 수준의 생산성과 하드웨어 수준의 효율성을 결합하는 실용적인 접근 방식을 제공함을 시사합니다.

주제: 머신러닝 (cs.LG) 인용: arXiv:2605.19269 [cs.LG]로 인용 (또는 이 버전의 경우 arXiv:2605.19269v2 [cs.LG]) https://doi.org/10.48550/arXiv.2605.19269

원문 보기

원문 보기 (영어)

--> Computer Science > Machine Learning arXiv:2605.19269 (cs) [Submitted on 19 May 2026 ( v1 ), last revised 20 May 2026 (this version, v2)] Title: CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs Authors: Han Guo , Jack Zhang , Arjun Menon , Driss Guessous , Vijay Thakkar , Yoon Kim , Tri Dao View a PDF of the paper titled CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs, by Han Guo and 6 other authors View PDF HTML (experimental) Abstract: Transformer training systems are built around dense linear algebra, yet a nontrivial fraction of end-to-end time is spent on surrounding memory-bound operators. Normalization, activations, residual updates, reductions, and related computations repeatedly move large intermediate tensors through global memory while performing little arithmetic, making data movement an increasingly important bottleneck in otherwise highly optimized training stacks. We introduce CODA, a GPU kernel abstraction that expresses these computations as GEMM-plus-epilogue programs. CODA is based on the observation that many Transformer operators exposed as separate framework kernels can be algebraically reparameterized to execute while a GEMM output tile remains on chip, before it is written to memory. The abstraction fixes the GEMM mainloop and exposes a small set of composable epilogue primitives for scaling, reductions, pairwise transformations, and accumulation. This constrained interface preserves the performance structure of expert-written GEMMs while remaining expressive enough to cover nearly all non-attention computation in the forward and backward pass of a standard Transformer block. Across representative Transformer workloads, both human- and LLM-authored CODA kernels achieve high performance, suggesting that GEMM-plus-epilogue programming offers a practical path toward combining framework-level productivity with hardware-level efficiency. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.19269 [cs.LG] (or arXiv:2605.19269v2 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.19269 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Han Guo [ view email ] [v1] Tue, 19 May 2026 02:30:43 UTC (1,121 KB) [v2] Wed, 20 May 2026 17:38:24 UTC (493 KB) Full-text links: Access Paper: View a PDF of the paper titled CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs, by Han Guo and 6 other authors View PDF HTML (experimental) TeX Source view license Current browse context: cs.LG < prev | next > new | recent | 2026-05 Change to browse by: cs References & Citations NASA ADS Google Scholar Semantic Scholar export BibTeX citation Loading... BibTeX formatted citation × loading... Data provided by: Bookmark Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer ( What is the Explorer? ) Connected Papers Toggle Connected Papers ( What is Connected Papers? ) Litmaps Toggle Litmaps ( What is Litmaps? ) scite.ai Toggle scite Smart Citations ( What are Smart Citations? ) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv ( What is alphaXiv? ) Links to Code Toggle CatalyzeX Code Finder for Papers ( What is CatalyzeX? ) DagsHub Toggle DagsHub ( What is DagsHub? ) GotitPub Toggle Gotit.pub ( What is GotitPub? ) Huggingface Toggle Hugging Face ( What is Huggingface? ) ScienceCast Toggle ScienceCast ( What is ScienceCast? ) Demos Demos Replicate Toggle Replicate ( What is Replicate? ) Spaces Toggle Hugging Face Spaces ( What is Spaces? ) Spaces Toggle TXYZ.AI ( What is TXYZ.AI? ) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower ( What are Influence Flowers? ) Core recommender toggle CORE Recommender ( What is CORE? ) IArxiv recommender toggle IArxiv Recommender ( What is IArxiv? ) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs . Which authors of this paper are endorsers? | Disable MathJax ( What is MathJax? )

머신러닝 GPU 최적화 커널 개발 트랜스포머 AI 하드웨어