Hacker News • 67일 전

퍼스트 원칙으로 살펴보는 딥러닝 속도 최적화

IMP

8/10

핵심 요약

이 글은 딥러닝 모델의 퍼포먼스를 높이기 위해 근본적인 원리(First Principles)에 기반해 접근하는 방법을 제시합니다. 딥러닝 연산 효율성을 연산(Compute), 메모리(Memory), 오버헤드(Overhead) 세 가지로 나누어 설명하며, 현재 시스템이 어떤 병목 상태에 빠져 있는지 파악하는 것이 불필요한 최적화를 막고 GPU 성능을 극대화하는 데 매우 중요하다고 역설합니다.

번역된 본문

원문 제목: Making Deep Learning Go Brrrr from First Principles

딥러닝 모델의 성능을 향상시키고 싶으신가요? 이러한 작업에 어떻게 접근하시겠습니까? 사람들은 종종 예전에 효과가 있었거나 트위터에서 봤던 잡다한 팁들에 의존하곤 합니다. "인플레이스(in-place) 연산을 사용하세요! 그래디언트를 None으로 설정하세요! PyTorch 1.10.1 말고 1.10.0을 설치하세요!"와 같은 식입니다. 사용자들이 이러한 임시방편(ad-hoc)적인 접근 방식을 취하는 것은 이해할 만합니다. 현대 시스템(특히 딥러닝)에서의 성능 최적화는 과학만큼이나 연금술처럼 느껴질 때가 많기 때문입니다.

그렇다고 해도, 근본적인 원리(first principles)에서부터 추론해보면 접근 방식의 상당 부분을 제외할 수 있어 문제를 훨씬 더 쉽게 해결할 수 있습니다. 예를 들어, 딥러닝에서 데이터셋의 성능을 끌어내는 것 역시 많은 추측이 필요합니다. 하지만 훈련 손실(training loss)이 테스트 손실(test loss)보다 훨씬 낮다면, 이는 '과적합(overfitting)' 상태이며, 모델의 용량을 늘리려고 시도하는 것은 시간 낭비입니다. 반대로 훈련 손실과 검증 손실이 동일하다면, 모델을 정규화(regularize)하려는 시도는 시간 낭비일 것입니다.

이와 마찬가지로, 딥러닝 시스템의 효율성은 다음 3가지 구성 요소로 이해할 수 있습니다.

연산(Compute): GPU가 실제 부동소수점 연산(FLOPS)을 수행하는 데 소비하는 시간
메모리(Memory): GPU 내부에서 텐서(tensor)를 전송하는 데 소비하는 시간
오버헤드(Overhead): 이외의 모든 것

머신러닝 모델 훈련과 마찬가지로, 자신이 어떤 상황(regime)에 있는지 파악하면 중요한 최적화에만 집중할 수 있습니다. 예를 들어, 메모리 전송에 모든 시간을 쓰고 있다면(즉, 메모리 대역폭 제한 상태라면) GPU의 FLOPS를 높이는 것은 아무런 도움이 되지 않습니다. 반면 거대한 행렬 곱셈(matmul)을 수행하는 데 모든 시간을 쓰고 있다면(즉, 연산 제한 상태라면), 오버헤드를 줄이기 위해 모델 로직을 C++로 다시 작성하는 것은 도움이 되지 않습니다.

따라서 GPU가 최고의 속도로 연산을 수행하게(brrrr) 만들고 싶다면, 시스템이 시간을 소비하는 세 가지 구성 요소인 연산, 메모리 대역폭, 오버헤드에 대해 논의해 보겠습니다.

참고: 이 글은 대부분 GPU와 PyTorch를 예시로 사용하지만(제가 PyTorch 팀에서 일하고 있기 때문입니다), 이 원칙들은 거의 모든 하드웨어와 프레임워크에 적용할 수 있습니다.

연산(Compute)

딥러닝 시스템 최적화에 대한 한 가지 관점은 '연산 제한(compute-bound) 상태'에 머무는 시간을 최대화하는 것입니다. 312테라플롭스라는 성능을 돈을 주고 샀으니, 이상적으로는 그 312테라플롭스를 모두 활용해야 합니다. 하지만 비싼 행렬 곱셈의 진가를 제대로 발휘하려면 다른 부분에 소요되는 시간을 줄여야만 합니다.

그렇다면 왜 메모리 대역폭이 아닌 연산량을 최대화하는 데 집중할까요? 이유는 간단합니다. 오버헤드나 메모리 비용은 줄일 수 있지만, 실제 수행하는 연산을 변경하지 않는 한 (대부분의 경우) 필요한 연산량 자체를 줄일 수는 없기 때문입니다.

연산 활용도를 극대화하기 어려운 이유는 연산 성능의 성장 속도가 메모리 대역폭의 성장 속도보다 훨씬 빠르기 때문입니다. CPU FLOPS의 배가되는 시간과 메모리 대역폭이 배가되는 시간을 비교한 표를 보면 이를 알 수 있습니다.

연산을 하나의 '공장'으로 비유해 볼 수 있습니다. 공장이 효율적으로 돌아가게(연산) 하기 위해 공장에 지시를 내리고(오버헤드), 자재를 공급(메모리 대역폭)합니다. 따라서 공장의 효율성이 자재를 공급할 수 있는 속도보다 더 빠르게 증가한다면, 공장이 최고 효율을 달성하기는 더욱 어려워집니다.

이러한 연산 활용의 어려움은 머신러닝 시스템 엔지니어들에게 영구적인 일자리를 보장해 줄 뿐만 아니라, 시스템의 병목 현상을 이해하는 것의 중요성을 더욱 부각시킵니다.

FLOPS에 대해 덧붙이자면, 현대 머신러닝 가속기에는 엔비디아의 '텐서 코어(Tensor Cores)'와 같이 행렬 곱셈에 특화된 하드웨어가 있습니다. 따라서 행렬 곱셈을 수행하지 않으면, 명시된 312테라플롭스 대신 단 19.5테라플롭스의 성능밖에 낼 수 없습니다. 이러한 특징은 GPU에만 국한되지 않습니다. 사실 TPU는 GPU보다 범용성이 훨씬 떨어집니다. GPU가 압도적인...

원문 보기

원문 보기 (영어)

Making Deep Learning Go Brrrr From First Principles So, you want to improve the performance of your deep learning model. How might you approach such a task? Often, folk fall back to a grab-bag of tricks that might've worked before or saw on a tweet. "Use in-place operations! Set gradients to None! Install PyTorch 1.10.0 but not 1.10.1!" It's understandable why users often take such an ad-hoc approach performance on modern systems (particularly deep learning) often feels as much like alchemy as it does science. That being said, reasoning from first principles can still eliminate broad swathes of approaches, thus making the problem much more approachable. For example, getting good performance on a dataset with deep learning also involves a lot of guesswork. But, if your training loss is way lower than your test loss, you're in the "overfitting" regime, and you're wasting your time if you try to increase the capacity of your model. Or, if your training loss is identical to your validation loss, you're wasting your time if you try to regularize your model. Similarly, you can understand efficiency of your deep learning regime as consisting of 3 different components. Compute: Time spent on your GPU computing actual floating point operations (FLOPS) Memory: Time spent transferring tensors within a GPU Overhead: Everything else Just like with training ML models, knowing what regime you're in allows you to narrow in on optimizations that matters. For example, if you're spending all of your time doing memory transfers (i.e. you are in an memory-bandwidth bound regime), then increasing the FLOPS of your GPU won't help. On the other hand, if you're spending all of your time performing big chonky matmuls (i.e. a compute-bound regime), then rewriting your model logic into C++ to reduce overhead won't help. So, if you want to keep your GPUs going brrrr, let's discuss the three components your system might be spending time on - compute, memory bandwidth, and overhead. Note: Most of this post will use GPUs and PyTorch as examples (as I work on the PyTorch team), but the principles almost all generalize across hardware and frameworks. Compute One perspective on optimizing deep learning systems is that we'd like to maximize the time in the compute-bound regime. You paid for all of those 312 teraflops, and ideally, you'd get those 312 teraflops. But, in order to get your money's worth out of your expensive matrix multiplication, you need to reduce the amount of time spent in the other parts. But why the focus on maximizing compute and not say, memory bandwidth? The reason is simple - you can reduce the overhead or memory costs, but you (mostly) can't reduce the computation required without changing the actual operations you're performing. Exacerbating the difficulty of maximizing compute utilization is the rate at which compute grows compared to memory bandwidth. Take this table on CPU FLOPS doubling times vs. memory bandwidth doubling times One way to think about compute is as a factory. We send instructions to our factory (overhead), send it materials (memory-bandwidth), all to keep our factory running efficiently (compute). So, if our factory increases efficiency faster than the rate at which we can supply it materials, it becomes harder for our factory to achieve its peak efficiency. Along with implying permanent job security for ML systems engineers, this growing difficulty in utilizing our compute also makes understanding our bottlenecks even more important. One more addendum about FLOPS. Modern machine learning accelerators all have hardware specialized for matrix-multiplication, such as Nvidia's "Tensor Cores". So, if you aren't doing matrix multiplication, you'll only be able to achieve 19.5 teraflops instead of the stated 312. Note that this isn't unique to GPUs - in fact, TPUs are even less general than GPUs. The fact that GPUs are so much slower at everything that isn't a matrix multiply might seem problematic at first - what about our other operators like layer norm or activation functions? Well, the truth is, those operators are just rounding errors in terms of FLOPS. For example, let's look at this table of FLOP counts on BERT for different operator types from this paper , where "Tensor Contraction" = matmuls. You can see that altogether, our non-matmul ops only make up 0.2% of our FLOPS, so it doesn't matter that our GPU computes non-matmul ops 15x slower. But, in this case, the normalization and pointwise ops actually achieve 250x less FLOPS and 700x less FLOPS than our matmuls respectively. So why do our non-matmul ops take so much more time than they should? Going back to our analogy, the culprit is often how long it takes to transport materials to and from the factory. In other words, the memory bandwidth. Bandwidth Bandwidth costs are essentially the cost paid to move data from one place to another. This might be moving the data from CPU to GPU, from one node to another, or even from CUDA global memory to CUDA shared memory. This last one, in particular, is what we'll be focusing on here, and is typically referred to as "bandwidth cost" or "memory bandwidth cost". The other two (typically referred to as "data transfer costs" and "network costs" respectively) are certainly important, but going into distributed performance would cause me to never finish this post. To understand what the memory bandwidth cost is, let's head back to our factory analogy. Although our factory is where we do the actual work, it's not suitable as a bulk storage unit. A large part of this is that since we're doing actual work here, all the storage is optimized for being fast to actually use (SRAM), instead of having a lot of it. So, where do we store the actual results and materials? The typical approach is to have a warehouse, probably somewhere where land is cheap and we have a lot of space (DRAM). Then, we can ship supplies to and from our factories (memory bandwidth). This cost of moving stuff to and from our compute units is what's called the "memory bandwidth" cost. As an aside, your GPU's DRAM is what shows up in nvidia-smi , and is the primary quantity responsible for your lovely "CUDA Out of Memory' errors. One thing to note is that every single time we perform a GPU kernel, we need to move our data from and back to our GPU's DRAM (i.e. our warehouse). Now, imagine what happens when we perform an unary operation like torch.cos . We need to ship our data from our storage to the warehouse, then perform a tiny bit of computation for each piece of data, and then ship that storage back. Shipping things around is quite expensive. As a result, nearly all of our time here is spent shipping data around, and not on the actual computation itself. Since we're spending all of our time on memory-bandwidth, such an operation is called a memory-bound operation , and it means that we're not spending a lot of time on compute. Ok, so that's not ideal. What can we do about it? Let's take a look at how a sequence of operators might look. Hey! This is a very stupid arrangement. Why are we sending the same data to global memory and then back to the compute units, over and over? We should just keep the data at the factory, perform all of our compute, and then send it back! This is operator fusion - the most important optimization in deep learning compilers. Simply put, instead of writing our data to global memory just to read it again, we elide the extra memory accesses by performing several computations at once. For example, if we perform x.cos().cos() , usually we need to perform 4 global reads and writes. x1 = x . cos ( ) # Read from x in global memory, write to x1 x2 = x1 . cos ( ) # Read from x1 in global memory, write to x2 But, with operator fusion, we only need 2 global memory reads and writes! So operator f

딥러닝 최적화 PyTorch GPU 성능 메모리 병목 퍼스트 원칙