Hacker News • 79일 전

Swift로 LLM 학습시키기: 행렬 연산 최적화

IMP

7/10

핵심 요약

본 글은 Apple Silicon 환경에서 Swift를 사용하여 외부 프레임워크 없이 대규모 언어 모델(LLM)을 학습시키기 위한 행렬 곱셈 코드를 처음부터 직접 작성하고 극한으로 최적화하는 과정을 다룹니다. 저자는 Andrej Karpathy의 'llm.c' 프로젝트를 Swift로 포팅하고, CPU, SIMD, AMX, GPU 등 Apple Silicon의 다양한 연산 유닛을 활용해 기존 C언어 구현체보다 빠르게 만드는 실험을 진행합니다. 이를 통해 Swift 환경에서 ML 연산을 최적화하는 핵심 기법과 Apple 기기의 하드웨어적 성능 한계를 체감할 수 있는 귀중한 인사이트를 제공합니다.

번역된 본문

이 글에서는 Swift로 대규모 언어 모델(LLM)을 학습시키기 위해 직접 작성한 행렬 곱셈(Matrix multiplication) 코드를 가능한 한 빠르게 실행되도록 최적화하는 과정을 다루고 있습니다. 이 글의 목적은 Swift에서 수학 연산 코드를 최적화하기 위한 핵심 단계에 대한 통찰력을 제공하는 것입니다. 또한 이러한 예제들이 Apple Silicon의 다양한 연산 장치(CPU, SIMD, AMX 및 GPU)의 성능에 대한 규모와 한계를 느끼게 해주길 바랍니다. 이 글은 Apple Silicon 환경에서 Swift로 신경망을 학습시키는 과정을 살펴보는 시리즈의 첫 번째 글입니다. 향후 기사에서는 Mac에서 기계 학습을 위해 Apple이 제공하는 (어쩌면 너무 많을지도 모르는) 다양한 프레임워크들을 살펴볼 것입니다. 여러분이 실제로 행렬 곱셈과 기계 학습에 사용해야 하는 것은 이미 검증된 프레임워크들입니다 (이 프레임워크들은 저보다 몇 년은 더 오랜 시간 동안 행렬 연산 커널을 최적화해 왔으니까요). 하지만 그때까지는 저만의 재미를 위해 '프레임워크도, 라이브러리도 없는' 순수 코드 방식으로 모든 것을 직접 작성해 보려고 합니다. 저는 단순히 행렬 곱셈 커널만 작성하는 것이 아닙니다. 샘플 앱은 완전한 LLM 구현의 일부로 이러한 커널들을 사용할 것이며, 제가 언급할 성능 수치 역시 순방향(forward) 및 역방향(backward) 학습 반복 전체에 대한 것입니다. 이 시리즈의 참조 구현은 Andrej Karpathy의 llm.c(GPT2 호환 모델의 순수 C 구현체)가 될 것입니다. 이는 꽤 기본적인 모델이지만 필요한 모든 구성 요소를 포함하고 있으며 실제 워크로드를 잘 대변합니다. 그럼 이제부터 제가 가장 좋아하는 게임을 시작하겠습니다. Swift가 C보다 빨라질 때까지 최적화하는 것이죠.

배경 스토리 약 2년 전, 저는 2000년대 초반에 작성했던 졸업 논문을 꺼내보았습니다. 신경망을 사용해 이미지를 분류하는 C++ 기반의 이미지 인식기였죠. 예전 코드를 다시 실행해 보고 싶었지만 오랫동안 ML 코드를 다루지 않았었습니다. 번거롭기도 하고 결국 포기했습니다. 2024년 초에 LLM을 둘러싼 많은 논의가 있었음에도 불구하고, Mac에서 신경망을 학습시키는 사람이 없는 것 같았습니다. 적어도 Swift와 같은 언어로는 말이죠. PyTorch나 TensorFlow 같은 Python 라이브러리를 사용해 보기도 했습니다. 하지만 Python은 직접 계산을 수행하는 것이 아니라 백그라운드에서 작동하는 다른 계산 엔진의 오케스트레이터 역할을 할 뿐이며, 이러한 분리된 구조는 제가 통제력을 느끼지 못하게 만들었습니다. 한 달 후, Andrej Karpathy가 llm.c를 공개했습니다. 이 프로젝트는 다른 기계 학습 콘텐츠들과 달리 제게 깊이 다가왔는데, 숨겨진 것이 하나도 없었기 때문입니다. 약 1,000줄의 순수 C 코드로 작성되었으며 (다소 알아보기 힘든 변수명들이 몇 가지 있긴 하지만) 비교적 읽기 쉬웠습니다. 그래서 당연히 저는 즉시 이것을 Swift로 다시 작성했고, 정말 재미있게 가지고 놀 수 있었습니다. 물론 코드를 제대로 돌려보려면 실행 속도를 높이는 작업이 필요했습니다. 여기서 약간의 복선을 깔아보자면, 초기 Swift 구현체는 정말 엄청나게 느렸습니다. 하지만 최적화는 끊임없는 과정입니다. 항상 시도해 볼 수 있는 더 많은 방법이 존재하니까요. 드디어 이 글에 도달하게 된 배경입니다. 저는 라이브러리를 사용하지 않고 LLM을 꽤 빠르게 학습시키기 위해 제가 작성했던 (그리고 지난주에 추가했던) 다양한 탐색 과정들을 단계별로 안내해 드릴 것입니다. 대부분의 코드는 Swift로 작성되겠지만 (마지막에는 Metal 구현체도 보여드릴 것입니다) 참고로 저는 신경망이나 LLM이 어떻게 작동하는지 자체를 설명하지는 않을 것입니다. 관심이 있다면 Karpathy의 영상 'Let's build GPT: from scratch, in code, spelled out.'이 GPT와 같은 LLM의 작동 원리를 배우기 위한 결정적인 가이드가 될 것입니다. 더 기초적인 학습을 원하신다면 'The spelled-out intro to language modeling: building makemore'로 시작하는 그의 초기 5부작 영상 시리즈가 다양한 기초 개념을 다루고 있으니 좋은 참고가 될 것입니다. 물론 두 영상 모두 Python을 사용하므로, Swift로 이러한 것들을 어떻게 할 수 있는지 보고 싶다면 꼭 다시 이곳으로 돌아와 주세요.

llm.c 기계 학습은 본질적으로 모델 가중치를 입력 데이터에 적용하는 과정(순방향 패스, 즉 추론이라고도 함)과, 그 후 오류 기울기(gradient)를 계산하고 해당 가중치를 업데이트하는 과정(역방향 패스)의 반복입니다. 우리는 일반적으로 이러한

원문 보기

원문 보기 (영어)

In this article, I try to get my own handwritten matrix multiplication code running as fast as possible for training a Large Language Model (LLM) in Swift. The aim is to give some insight into the key steps for optimizing mathematics code in Swift. I also hope that these examples will offer a sense of scale about the capabilities of the different units on Apple Silicon – CPU, SIMD, AMX and GPU. This will be the first in a series where I look at training neural networks in Swift on Apple Silicon. Future articles will look at the maybe-too-many frameworks Apple offer for machine learning on the Mac. Those established frameworks are what you should really use for matrix multiplication and machine learning (they’ve spent a few more years optimizing matrix kernels than I have). But until then, I’m having fun writing everything for myself in a “no frameworks, no libraries” plain code approach. And I’m not just writing matrix multiplication kernels. The sample app will use these kernels as part of a full LLM implementation and the numbers I’ll quote will be for entire forward and backward training iterations. The reference implementation for this series will be Andrej Karpathy’s llm.c (a plain C implementation of a GPT2-compatible model). It’s a fairly basic model but it does contain all the necessary components and is representative of real-world workloads. That means it’s time for my favorite game: optimize Swift until it’s faster than C. Backstory About two years ago, I dug up my engineering thesis from the early 2000s. It’s an image recognizer written in C++ that uses a neural network for classifying images. I wanted to get my old code running again but I hadn’t worked on ML code in a long time. It got annoying and I gave up. For all the discussion around LLMs in early 2024, it felt like no one was training neural networks on the Mac. At least, not in languages like Swift. I played with some Python libraries like PyTorch and TensorFlow but Python never does the calculations itself – it operates more like an orchestrator of another computational engine under the hood – and the separation left me feeling like I wasn’t in control. A month later, Andrej Karpathy released llm.c . This reached me in a way that other machine learning content didn’t because nothing is hidden. It is around 1000 lines of plain C and (although it’s filled with some pretty cryptic variable names) it’s relatively readable. So naturally, I immediately rewrote it in Swift. And it was a lot of fun to play with. Of course, playing with the code required some work to make it run fast. Some foreshadowing, here: the initial Swift implementation was really super slow. But optimization is a constant process: there’s always something more you can try. Which finally brings me to this article: I’m going to walk through the different explorations I wrote then (and a couple I’ve added in the last week) to make an LLM train fairly quickly without resorting to using a library. Most of the code will be in Swift (although I’ll show a Metal implementation at the end). By the way, I will not be explaining how a neural network or an LLM works . If you’re interested, Karpathy’s video Let’s build GPT: from scratch, in code, spelled out. is practically the definitive guide to learning how GPT-like LLMs work and his earlier series starting with The spelled-out intro to language modeling: building makemore covers plenty of introductory concepts in a 5 video series if you want a more introductory lesson. Of course, both are in Python, so please come back here when you’re ready to see how we can do things in Swift. llm.c Machine learning is essentially the application of model weights to input data (called the forward pass, a.k.a. inference), then the calculation of error gradients and an update to those weights (the backward pass). We typically package these calculations together and try to make them run as fast as possible. These packages of operations might be called: “linear tensor projection”, “matrix multiplication”, or even a series of “vector dot products” (depending on how big or small you slice the units of work). It’s ultimately a loop that performs z += x * y a lot of times. Since these matrix multiplications represent so much of the work in machine learning, I’m going to focus on the code that does this. I will be updating the rest of the implementation as I go, but only using the same improvements I’m showing to matrix multiplication. Let’s start by looking at the matmul_forward from llm.c which is the core matrix multiplication used on the forward pass. It iterates over the input ( inp ), multiplies by model weights ( weight ), and adds the result to the running total ( val ). void matmul_forward ( float * out , const float * inp , const float * weight , const float * bias , int B , int T , int C , int OC ) { for ( int b = 0 ; b < B ; b ++ ) { for ( int t = 0 ; t < T ; t ++ ) { int bt = b * T + t ; for ( int o = 0 ; o < OC ; o ++ ) { float val = ( bias != NULL ) ? bias [ o ] : 0.0f ; for ( int i = 0 ; i < C ; i ++ ) { val += inp [ bt * C + i ] * weight [ o * C + i ]; } out [ bt * OC + o ] = val ; } } } } The four layers of loops add some visual complexity but in reality, that val += inp[bt * C + i] * weight[o*C + i]; line is the heart of a neural network. Like I said: z += x * y a lot . How much? The val line contains 2 floating point operations but Karpathy says the number of floating point operations in a full training iteration should be roughly 6 x N x D where N is the number of weights in the model (124,439,808 in our case) and D is B * T = 4 * 64 = 256 for our app. So we’re talking about 6 x 124,439,808 x 256 ≈ 1.911×10¹¹ ≈ 0.2 trillion floating point operations per training iteration. So it’s got to run quick. Model Tokens/s Training iterations/s llm.c 0.92 0.174 The plain C code runs easily in a Swift Package. I’ve fixed the C implementation to always run at -O3 optimization level (regardless of Xcode settings). Even at this optimization level, the C implementation manages just one training iteration every 7 seconds and inference at less than 1 token per second. A wonderful proof of concept but 10 times slower than would ever be useful. Basic Swift I’ve tried my best to keep the basic Swift version as true to the C version as possible: static func matmul_forward ( out : inout [ Float ], inp : [ Float ], weight : [ Float ], bias : [ Float ]?, B : Int , T : Int , C : Int , OC : Int ) { for b in 0. .< B { for t in 0. .< T { let bt = b * T + t for o in 0. .< OC { var value = bias ?[ o ] ?? 0 for i in 0. .< C { value += inp [ bt * C + i ] * weight [ o * C + i ] } out [ bt * OC + o ] = value } } } } Since the C code is inherently “unsafe”, I went ahead and gave the Swift code the same advantage by setting it to run with -remove-runtime-asserts (removing the runtime checking on array indices) and made sure to always run the app in “Release” configuration. So the Swift and C implementations should be fairly equivalent, right? Don’t run in Debug. I will only be quoting Release configuration numbers. While I have run sections of this in Debug, I’ve never waited around for a full 20 iteration training run in Debug. I usually keep the Scheme in Xcode set to “Release” – even during debugging. If you read the backstory, I’ve already mentioned: this was “extremely slow”. Model Tokens/s Training iterations/s Training versus llm.c llm.c 0.926 0.175 100% Basic Swift 0.054 0.014 7.3% The Swift code is between 15 and 20 times slower. That’s an LLM producing 1 token every 19 seconds. Running 20 training iterations on this engine takes nearly 30 minutes. What on Earth is going on? This performance represents about 2.8 Gflop/

Swift LLM 학습 Apple Silicon 성능 최적화 llm.c