Hacker News • 61일 전

일반 GPU에서 3k tokens/s 달성한 실시간 LLM 추론 기술

IMP

9/10

핵심 요약

전체 소프트웨어 스택(아키텍처, 엔진, 커널)을 공동 설계(Co-design)하여 일반 데이터센터 GPU에서도 전용 추론 하드웨어 수준의 초고속 LLM 디코딩 속도(초당 3,000토큰)를 달성할 수 있음을 증명한 기술 프리뷰입니다. AI 에이전트의 작업 방식이 순차적이고 반복적이기 때문에 기존의 '총 처리량'보다 '단일 요청 디코딩 속도'가 핵심 성능 지표로 부상했으며, 이를 통해 에이전트의 작업 완료 시간을 기존 8분에서 20초 미만으로 획기적으로 단축할 수 있습니다.

번역된 본문

TL;DR: 우리는 전체 소프트웨어 스택을 아키텍처, 엔진, 커널의 공동 설계(Co-design)로 최적화할 경우, GPU에서의 AI 추론이 전용 추론 하드웨어 카드의 속도 영역에 도달할 수 있음을 보여줍니다. 저희 라이브 코딩 플레이그라운드(playground.kog.ai)에서 이 속도를 직접 테스트해 보세요.

이 글은 단일 요청 LLM 디코딩 속도를 최적화하는 것이 AI 에이전트에 왜 중요한지; 이것이 FLOPS(초당 부동소수점 연산) 문제가 아니라 주로 메모리 대역폭(Maximizing memory-bandwidth) 최대화 문제인 이유; 소프트웨어 병목 현태 때문에 현재의 추론 스택이 노출하는 것보다 표준 데이터센터 GPU 하드웨어의 디코딩 속도 한계가 훨씬 더 높은 이유; 그리고 (대규모 MoE 모델에서도) 모델 아키텍처, 런타임, 저수준 GPU 코드를 단일 지연 시간 최적화 파이프라인으로 공동 설계하여 그 한계에 어떻게 도달할 수 있는지 설명합니다.

저희의 공개 기술 프리뷰는 기업, AI 연구소, 국가 주도 AI 구매자들이 이미 보유하고 있는 표준 데이터센터 GPU에서 극도로 빠른 단일 요청 디코딩이 가능함을 증명하기 위한 것입니다. 제한 요인은 이러한 유형의 워크로드에 대해 기존 추론 소프트웨어 스택이 최적화되지 않았다는 점이었습니다. 이 GPU 경로를 열어주면 독점적인 실리콘(전용 칩)에 종속되지 않고도 그 속도를 달성할 수 있습니다.

오늘 저희 2B 코딩 모델의 속도를 테스트해 볼 수 있습니다. 이 모델은 소규모이며 최첨단 프론티어 모델은 아닙니다(저희는 규모보다 속도에 집중해 왔습니다만), 특정 소프트웨어 엔지니어링 작업에 미세 조정(Fine-tuning)되었을 때 여전히 상당히 유능합니다.

자율 에이전트가 바꾸는 것: 단일 요청 디코딩 속도가 이제 중요한 지표가 되었습니다

추론 벤치마크는 일반적으로 세 가지 지표를 혼합합니다. 총 처리량(Aggregate throughput, 모든 사용자의 초당 총 생성 토큰 수)은 서버 활용도를 측정하고 대규모 배치에 보상을 줍니다. 첫 번째 토큰까지의 시간(Time to first token)은 프리필(Prefill) 지연 시간을 측정합니다. 요청당 디코딩 속도(Decode speed per request)는 토큰 생성 속도를 측정하며, 한 사용자가 전체 응답을 받기까지 얼마나 기다려야 하는지를 결정합니다.

마지막 지표가 모든 긴 직렬 상호작용을 지배하며, 이것이 바로 AI 에이전트가 병목 현상을 겪는 부분입니다. 에이전트 기반 소프트웨어 엔지니어링은 검사, 계획, 편집, 테스트, 수정의 순차적인 루프입니다. 각 단계는 이전 단계에 의존합니다. 테스트를 실행하고 웹 페이지를 로드해야 하므로 도구 사용 시간이 때때로 지배적일 수 있지만, 생성 집중적인 단계(계획, 코드 작성, 추적 분석, 디버깅, 리팩토링)가 루프의 속도를 결정합니다. 그리고 여기에 추론 토큰(Reasoning tokens)이 더해집니다.

이러한 숫자는 제품 및 사용자 경험과 직접적으로 연결됩니다. 에이전트가 워크플로우에서 50,000개의 토큰을 생성해야 한다면, 100 tokens/s는 약 8분이 걸리지만, 3,000 tokens/s는 20초 미만입니다. 이러한 차이는 구축할 수 있는 제품 자체를 바꿉니다. 에이전트가 더 자율적이 됨에 따라, 생산성의 최전선은 지능(Intelligence)만에서 지능 × 반복 속도(iteration speed)로 이동합니다. 최고의 에이전트는 동일한 실제 시간 예산 내에서 더 유용한 토큰을 생성하고, 더 많이 추론하며, 더 많은 도구 호출, 테스트 및 수정을 수행할 것입니다.

이것이 바로 Kog가 단일 요청 지연 시간을 먼저 최적화하는 이유이며, 이 프리뷰가 배치 크기(Batch size) 1로 실행되는 이유입니다. 대규모 배치도 중요하며 프로덕션에서 이를 지원할 예정이지만, 이는 다른 질문에 대한 대답입니다.

그렇다면 GPU의 디코딩 속도를 제한하는 것은 무엇일까요? 빠른 토큰 생성을 위한 주요 병목 현상은 메모리 대역폭입니다 (그리고 GPU 노드에는 이것이 충분히 많습니다)

배치 크기가 1일 때, 자기회귀 디코딩(Autoregressive decoding)은 주로 행렬-벡터 연산에 의해 지배됩니다. 생성되는 각 토큰에 대해 모델의 모든 활성 가중치가 HBM(High Bandwidth Memory)에서 연산 프로세서로 GPU 내부의 메모리 계층 구조를 통과해야 합니다. 따라서 1차 한계는 다음과 같습니다:

tokens/s ≤ (유효 메모리 대역폭) / (β × 활성 가중치 바이트 + KV 캐시)

여기서 타일이 다시 로드되거나 캐시 재사용이 불완전한 경우 β는 1보다 클 수 있습니다.

핵심적인 사실은 낮은 배치의 디코딩은 산술 강도(Arithmetic intensity, 연산 집중도)가 매우 낮다는 것입니다. FP16에서 모델 가중치는 2바이트를 차지하며 대략 하나의 곱셈-누산(Multiply-add, 2 FLOPs)에 기여하며, 이는 바이트당 약 1 FLOP입니다. FP8은 이를 ~2 FLOP/바이트로 높이고, FP4는 ~4 FLOP/바이트로 높입니다. 그러나 현대 AI GPU는 HBM 대역폭의 바이트당 수백 개의 피크 FLOPs를 제공합니다. 예를 들어, NVIDIA의 H200은 바이트당 약 400 FLOPs의 피크 밸런스를 제공한다고 주장합니다. 따라서...

원문 보기

원문 보기 (영어)

TL;DR: we show that AI inference on GPUs can be super-fast, reaching the speed regime of dedicated inference hardware cards when optimizing the whole software stack with architecture/engine/kernel co-design. Test the speed in our live coding playground: playground.kog.ai . This post explains why optimizing for single-request LLM decoding speed is important for AI agents; why it's primarily a memory-bandwidth maximization problem, not a FLOPS one; why standard datacenter GPU hardware has a much higher decoding-speed ceiling than current inference stacks expose due to software bottlenecks ; and how that ceiling can be reached (even on large MoE models) by co-designing the model architecture, runtime, and low-level GPU code as a single latency-optimized pipeline. Our public tech preview is about proving that extremely fast single-request decoding is possible on the standard datacenter GPUs enterprises already own — including AI labs and sovereign-AI buyers. The limiting factor has been that existing inference software stacks are not optimized for this type of workload. Opening the GPU path could deliver that speed without the lock-in of proprietary silicon. You can test the speed of our 2B coding model today. It's small and not a frontier model (we've been focused on speed rather than scale), though still quite capable when fine-tuned for specific software engineering tasks. What autonomous agents change: single-request decode speed is now the metric that matters Inference benchmarks typically conflate three quantities. Aggregate throughput (total tokens generated per second across all users) measures server utilization and rewards large batches. Time to first token measures prefill latency. Decode speed per request measures token generation speed and defines how long one user waits before receiving the full response. That last one governs every long serial interaction, and it's what AI agents are bottlenecked on. Agentic software engineering is a sequential loop : inspect, plan, edit, test, revise. Each step depends on the previous one. Tool time sometimes dominates, as tests have to run and web pages have to load, but the generation-heavy steps (planning, code writing, trace analysis, debugging, refactoring) set the loop rate. And reasoning tokens compound on top. The numbers translate directly into product and user experience. If an agent needs to generate 50,000 tokens in a workflow, 100 tokens/s is roughly eight minutes; 3,000 tokens/s is under twenty seconds. The difference changes the product that can be built. As agents become more autonomous, the productivity frontier shifts from intelligence alone to intelligence × iteration speed . The best agents will generate more useful tokens, reason more, and perform more tool calls, tests, and revisions inside the same wall-clock budget. This is why Kog optimizes single-request latency first, and why this preview runs at batch size 1. Large batches do matter and we will support them in production, but they answer a different question. But what is limiting decode speed on GPUs? Memory bandwidth is the primary bottleneck for fast token generation (and GPU nodes have plenty) At batch size 1, autoregressive decoding is dominated by matrix-vector work. For each generated token, all the active weights of the model must move through the memory hierarchy inside the GPU, from HBM to compute processors. Thus, a first-order bound is: tokens/s ≤ effective_memory_bandwidth / (β × active_weight_bytes + KV cache) where β can be greater than one when tiles are reloaded or cache reuse is imperfect. The key fact is that low-batch decode has very low arithmetic intensity. In FP16, a model weight occupies two bytes and contributes roughly one multiply-add (two FLOPs) which is about 1 FLOP/byte . FP8 raises it to ~2 FLOPs/byte; FP4 to ~4. However, modern AI GPUs expose hundreds of peak FLOPs per byte of HBM bandwidth. NVIDIA's H200, for example, claims a peak balance of roughly 400 FLOPs/byte. Thus, token generation speed is capped by memory bandwidth before being limited by FLOPS. This is why Memory Bandwidth Utilization (MBU) is the central metric for single-request speed, not Model FLOP Utilization (MFU). MFU can still be improved by batching several requests together, which can however increase the latency experienced by each user as more KV cache data needs to be streamed inside the GPU. For batch-size-1 decode, more memory bandwidth equals more tokens generated per second. The good news is that memory bandwidth of GPUs is already very high. An 8× NVIDIA H200 node exposes roughly 30.7 TB/s of effective aggregate memory bandwidth (taking 80% of the 4.8 TB/s theoretical per GPU as a realistic ceiling). An 8× AMD MI300X node reaches about 33.6 TB/s in practice (assuming 4.2 TB/s achievable per GPU). Let's take a 2B-parameter dense model in FP16 as an example. It has roughly 4 GB of active weights, so if weights alone could be streamed perfectly (ignoring KV cache traffic and potential β reloads), the speed-of-light upper bounds would be: 8× H200: 30.7 TB/s ÷ 4 GB ≈ 7,700 tokens/s 8× MI300X: 33.6 TB/s ÷ 4 GB ≈ 8,400 tokens/s Let's consider a few more examples: at batch size 1, the same speed results apply to a MoE with 4B active parameters in FP8; and a 32B-active-parameter MoE in FP4 would be bounded at ~2,000 tokens/s. In a latency-first inference stack, a valid strategy is thus to parallelize inference on a full server node providing eight GPUs worth of HBM bandwidth. It should also be noted that the next GPU generations (Rubin and MI450) coming in H2 2026 will provide about 4x higher memory bandwidth, thus allowing to reach the same speed for 4x bigger models, or with 4x fewer GPUs (potentially one or two instead of a full node). This will also help support bigger batch sizes at the same speed. At the end of this post, we'll dig a bit more on this topic to show that a decoding speed of thousands of tokens per second should be achievable on datacenter GPUs for current large state-of-the-art MoE models. There is a catch, though. These bounds do not take into account non-GEMM operations stalls, intra-GPU synchronization, inter-GPU communication, instruction overhead, and so on. The key question is how continuously the system can stream the active model parameters through HBM and cache without interruptions. It turns out that making an 8-GPU server behave like a single continuous memory-streaming machine is, indeed, a hard problem. Where standard inference stacks lose precious microseconds At 3,000 tokens/s, the per-token budget is roughly 333 microseconds , including all layers, LM head and sampling. On a 25-layer model, spending just an extra 1 microsecond per layer consumes 7.5% of the time budget! The usual abstraction stack — model graph logic written in a high-level language or framework like PyTorch or Triton, lowered into many kernels, scheduled by a CPU runtime, synchronized at kernel boundaries, and mediated by framework-level communication libraries — is flexible, facilitates maintainability and integration, and is great for general-purpose serving, including maximizing aggregated throughput at high batch sizes. This is the approach usually taken for models running on inference engines like vLLM, SGLang, and TensorRT-LLM. It is, however, poorly matched to a 333-microsecond token budget. A simple launch-overhead calculation shows the problem. If a kernel launch and cleanup costs about 4.5 µs (as per our measurements on AMD MI300X), ten kernels per Transformer layer over twenty-five layers create 1,125 µs of overhead per token before any useful work , thus capping the achievable speed at ~890 tokens/s. Even just five aggressively fused kernels per layer still produce ~563 µs of overhead, capping speed around 1,780 tokens/s. And this is before taking into account the other sources of overhead, which compound on top of this. Turning theoretical HBM bandwidth into useful model bandwidth is thus a matter of systemat

[object Object] [object Object] [object Object] [object Object] [object Object]