r/LocalLLaMA • 75일 전

TurboQuant 정밀 성능 분석과 검증

IMP

8/10

핵심 요약

최근 주목받는 KV-cache 양자화 기법인 TurboQuant의 실제 성능을 검증한 종합 연구 결과입니다. 30B~200B 이상의 다양한 모델과 장문 컨텍스트, 추론 벤치마크를 테스트한 결과, 기존 FP8 방식이 정확도 손실이 거의 없고 처리량과 지연 시간 면에서도 우수한 것으로 나타났습니다. 반면 TurboQuant는 약간의 메모리 절약 효과 외에는 처리 속도 저하를 유발하여 프로덕션 환경에는 FP8이 더 적합한 기본값으로 권장됩니다.

번역된 본문

소개

TurboQuant는 모델의 KV-cache를 매우 낮은 비트 수로 양자화하여 GPU 메모리를 크게 절약할 수 있다고 홍보하며 최근 커뮤니티에서 큰 주목을 받은 KV-cache 양자화 방식입니다. 하드웨어에 최적화된 FP8 텐서 코어 연산을 사용하여 KV-cache 저장과 어텐션(attention) 연산 자체를 모두 양자화하는 FP8 KV-cache 양자화와 달리, TurboQuant는 KV-cache 저장 공간만 3~4비트로 압축한 뒤 어텐션 연산을 위해 다시 BF16으로 역양자화(dequantize)하는 방식을 사용합니다. 이러한 아키텍처의 차이는 정확도와 성능 모두에 중요한 영향을 미칩니다.

하지만 지금까지 발표된 대부분의 결과는 KV-cache 양자화에 대한 엄격한 스트레스 테스트가 아닌, 소규모 모델을 대상으로 짧은 컨텍스트 벤치마크에서 평가한 수치였습니다. 커뮤니티에 보다 실용적인 데이터를 제공하기 위해, 우리는 30B부터 200B 이상의 파라미터를 가진 4개의 모델(Dense 및 MoE 아키텍처 포함)과, 프리필(prefill)이 많은 장문 컨텍스트 검색 및 디코드(decode)가 많은 추론 워크로드를 포함한 5개의 벤치마크를 아우르는 종합적인 연구를 진행했습니다.

핵심 요약 (TL;DR)

--kv-cache-dtype fp8을 통한 FP8 방식은 여전히 KV-cache 양자화의 가장 좋은 기본값(Default)입니다. 이는 무시할 수 있을 정도의 정확도 손실만으로 2배의 KV-cache 용량을 제공하며, 대부분의 성능 지표에서 BF16과 동등한 성능을 발휘하고 메모리가 제한된 서비스 환경에서는 성능을 크게 향상시킵니다.

TurboQuant k8v4는 FP8 대비 어떤 중요한 이점도 제공하지 않습니다. 단지 약간의 KV-cache 절약 효과(2.4배 vs 2배)만을 제공할 뿐, 처리량(throughput) 및 지연 시간(latency) 지표에 지속적인 악영향을 미치므로 그 가치가 없습니다.

TurboQuant 4bit-nc가 가능성이 가장 높은 가장 실용적인 TurboQuant 변형입니다. KV-cache 메모리 압박이 있는 환경에서는 도움이 되지만, 추가 용량을 얻는 대가로 정확도, 지연 시간 및 처리량 측면에서 적지 않은 비용을 치릅니다. 메모리가 절대적인 제약 조건인 엣지(Edge) 배포 환경에서는 여전히 실용적인 선택지가 될 수 있습니다.

TurboQuant k3v4-nc 및 3bit-nc는 특히 추론 및 매우 긴 컨텍스트 작업에서 눈에 띄는 정확도 저하를 보이며, 지연 시간과 처리량도 크게 저하시킵니다. 따라서 실제 프로덕션 배포에는 부적합한 선택지입니다.

실험 설정
정확도 결과
- 장문 컨텍스트 검색
- 추론
성능 결과
- 지연 시간
- 처리량
- 서비스 속도(Serving Speed)
핵심 발견 및 권장 사항
빠른 시작

실험 설정

양자화 방식 (Quantization Schemes): 우리는 양자화되지 않은 BF16 및 FP8 KV-cache 기준선(Baseline)과 비교하여 4가지 TurboQuant 변형( --kv-cache-dtype turboquant_{k8v4, 4bit_nc, k3v4_nc, 3bit_nc} )을 벤치마크했습니다. turboquant_k8v4는 8비트 키(Key)와 4비트 값(Value)을 사용합니다. turboquant_4bit_nc는 정규화 보정(Norm correction)이 적용된 4비트 키와 값을 사용합니다. turboquant_k3v4_nc는 정규화 보정이 적용된 3비트 키와 4비트 값을 사용합니다. turboquant_3bit_nc는 정규화 보정이 적용된 3비트 키와 값을 사용합니다. FP8 기준선( --kv-cache-dtype fp8 )은 쿼리(Query), 키, 값을 FP8 정밀도로 저장하며, 어텐션 연산 자체도 양자화합니다. 이는 저장 공간만 압축하는 TurboQuant와의 핵심적인 차이점입니다. 각 TurboQuant 변형에 대한 자세한 내용은 해당 논문과 vLLM 문서를 참조하십시오. FP8 KV-cache 양자화에 대한 자세한 내용은 FP8 KV-cache 블로그 게시물을 참조하십시오.

벤치마크 (Benchmarks): 우리는 프리필이 많은 워크로드와 디코드가 많은 워크로드 모두에서 KV-cache 양자화를 엄격하게 테스트하기 위해 설계된 5개의 벤치마크로 평가를 진행했습니다. 장문 컨텍스트 검색(프리필 집중)의 경우, 모델이 지원하는 최대 시퀀스 길이까지 테스트하는 까다로운 다중 라운드 컨텍스트 검색 작업인 openai/mrcr을 사용했습니다. 추론(디코드 집중)의 경우, AIME25, GPQA:Diamond, MATH500 및 LiveCodeBench-v6을 사용했습니다. 모든 평가는 실제 배포 환경을 모방하기 위해 모델 제작자가 권장하는 기본 비탐욕적(Non-greedy) 샘플링 매개변수를 채택했습니다.

모델 (Models): 우리는 소규모 및 대규모, 그리고 Dense 전용 및 MoE 아키텍처를 모두 아우르는 4개의 모델에 중점을 두었습니다. 평가 대상 모델은 Llama-3.3-70B-Instruct, Qwen3-30B-A3B-Instruct-2507, Qwen3-30B-A3B-Thinking-2507 및 MiniMax-M2.7입니다. 글을 작성하는 현재, TurboQuant는 표준 어텐션 메커니즘(예: GQA)을 사용하는 모델만 지원하며, 슬라이딩 윈도우(Sliding-window) 또는 하이브리드(Hybrid) 방식을 사용하는 모델은 지원하지 않습니다.

원문 보기

원문 보기 (영어)

Table of Contents Introduction TurboQuant , a method for KV-cache quantization, recently gained significant traction in the community due to the large advertised savings in GPU memory from very low bit-width quantization of a model's KV-cache. Unlike FP8 KV-cache quantization , which quantizes both the KV-cache storage and the attention computation itself using hardware-native FP8 Tensor Core operations, TurboQuant compresses only the KV-cache storage to 3-4 bits and dequantizes back to BF16 for the attention computation. This architectural difference has significant implications for both accuracy and performance. However, most of the reported results were based on small models evaluated on short-context benchmarks that do not stress-test KV-cache quantization. To provide the community with more actionable data, we conducted a comprehensive study spanning four models (both dense-only and MoEs), from 30B to 200B+ parameters, and five benchmarks including both prefill-heavy long-context retrieval and decode-heavy reasoning workloads. TL;DR FP8 via --kv-cache-dtype fp8 remains the best default for KV-cache quantization: it provides 2x KV-cache capacity with negligible accuracy loss, while matching BF16 on most performance metrics and substantially improving them in memory-constrained serving scenarios. TurboQuant k8v4 does not provide any significant advantage over FP8: it only provides modest KV-cache savings (2.4x vs 2x) which are not worth the consistent negative impact on throughput and latency metrics. TurboQuant 4bit-nc is likely the most practical TurboQuant variant: it helps under KV-cache memory pressure, but trades the extra capacity for moderate accuracy, latency, and throughput costs. It may still be viable for edge deployments where memory is the dominant constraint. TurboQuant k3v4-nc and 3bit-nc show meaningful accuracy drops, especially on reasoning and very long-context tasks, while also substantially degrading latency and throughput. This makes them poor candidates for production deployments. Table of Contents Experimental Setup Accuracy Results Long-context Retrieval Reasoning Performance Results Latency Throughput Serving Speed Key Findings and Recommendations Quick start: Experimental Setup Quantization Schemes: We benchmark four TurboQuant variants ( --kv-cache-dtype turboquant_{k8v4, 4bit_nc, k3v4_nc, 3bit_nc} ) against unquantized BF16 and FP8 KV-cache baselines. turboquant_k8v4 uses 8-bit keys and 4-bit values; turboquant_4bit_nc uses 4-bit keys and values with norm correction; turboquant_k3v4_nc uses 3-bit keys and 4-bit values with norm correction; and turboquant_3bit_nc uses 3-bit keys and values with norm correction. The FP8 baseline ( --kv-cache-dtype fp8 ) stores queries, keys, and values in FP8 precision, and also quantizes the attention computation itself — a key difference from TurboQuant, which only compresses storage. For more details on each TurboQuant variant, please refer to the paper and vLLM documentation . For more details on FP8 KV-cache quantization, please refer to the FP8 KV-cache blog post . Benchmarks: We evaluate on five benchmarks designed to stress-test KV-cache quantization across both prefill-heavy and decode-heavy workloads. For long-context retrieval (prefill-heavy), we use openai/mrcr — a challenging multi-round context retrieval task testing sequence lengths up to each model's maximum supported length. For reasoning (decode-heavy), we use AIME25, GPQA:Diamond, MATH500, and LiveCodeBench-v6. All evaluations adopt the default non-greedy sampling parameters suggested by model creators to mimic real-world deployment. Models: We focus on four models spanning both small and large scale, and both dense-only and MoE architectures: Llama-3.3-70B-Instruct , Qwen3-30B-A3B-Instruct-2507 , Qwen3-30B-A3B-Thinking-2507 , and MiniMax-M2.7 . At the time of writing, TurboQuant supports only models with standard attention mechanisms (e.g. GQA) — models with sliding-window or hybrid attention are not yet supported. Accuracy Results Long-context Retrieval For long-context evaluation, we use the openai/mrcr task, testing sequence lengths up to each model's maximum supported length. We report the average pass@1 score for each sequence-length bucket over 5 repetitions, and the Area-Under-Curve (AUC) as an aggregate metric across all tested lengths ( Context Arena ). On Llama-3.3-70B-Instruct (Figure 3), the higher-bit TurboQuant variants (k8v4 and 4bit-nc) preserve long-context retrieval well and maintain competitive AUC (~52%). However, TQ k3v4-nc (48.6%) and 3bit-nc (50.3%) show noticeable and consistent degradation across all sequence lengths, with the gap widening at 64k context where the accuracy drop is up to 8 points. On Qwen3-30B-A3B-Instruct-2507 (Figure 4), which supports longer contexts up to 256k, discrepancies are more pronounced. BF16 (45.8%), FP8 (43.1%), and TQ k8v4 (43.0%) remain within the standard deviation of each other. TQ 4bit-nc (42.3%) is also competitive. But the aggressive variants degrade substantially: TQ k3v4-nc drops to 33.5% AUC and TQ 3bit-nc to 31.2% — a ~30% relative degradation from BF16. The degradation is concentrated at the longest context lengths (128k-256k), suggesting that low-bit KV-cache quantization errors accumulate with sequence length. Takeaway: TQ k8v4 and 4bit-nc are safe for long-context retrieval. TQ k3v4-nc and 3bit-nc show meaningful accuracy degradation, especially at very long contexts. FP8 matches the higher-bit TQ variants while providing better inference performance (shown later). Reasoning For decode-heavy reasoning benchmarks, we use AIME25, GPQA:Diamond, MATH500, and LiveCodeBench-v6. We report the average pass@1 score: over 10 repetitions for AIME25 and LiveCodeBench-v6, and over 5 repetitions for GPQA:Diamond and MATH500. On Qwen3-30B-A3B-Thinking-2507 (Figure 5), we see a clear accuracy hierarchy. FP8 and TQ k8v4 are close to the BF16 baseline with >98% average accuracy recovery. TQ 4bit-nc shows a slightly larger drop with 96% recovery, whereas TQ k3v4-nc and 3bit-nc show drastic accuracy drops of ~20 points. Even on the relatively easy MATH500 benchmark, the accuracy drop is ~4 points, indicating that aggressive TurboQuant variants are not suitable for long-generation reasoning tasks. On MiniMax-M2.7 (Figure 6), a much larger 200B+ parameter model, we observe similar patterns. FP8 and TQ k8v4 maintain >99% accuracy recovery, whereas TQ 4bit-nc shows a modest drop. Just like with the smaller Qwen model, aggressive TQ variants (k3v4-nc, 3bit-nc) show significant accuracy degradation, especially on AIME25 and LiveCodeBench-v6 with accuracy drops of up to ~8 points. Takeaway: Aggressive TurboQuant variants (k3v4-nc, 3bit-nc) show significant accuracy degradation, especially on hard math and coding tasks like AIME25 and LiveCodeBench-v6. TQ 4bit-nc shows a modest accuracy drop, whereas TQ k8v4 performs on par with the unquantized BF16 baseline. FP8 also matches the unquantized baseline; however, it provides significantly better inference performance than any of the TurboQuant variants (shown later). Performance Results For performance benchmarking, we focus on Qwen3-30B-A3B-Instruct-2507 (2xH100) and Llama-3.3-70B-Instruct (4xH100). We measure latency, offline throughput, and online serving metrics (TPOT and TTFT) under various request rates. We deploy models with vLLM version 0.20.2 (commit 6ec9bbec3 ). Latency We measure latency with vllm bench latency using fixed synthetic requests with input length 1024 and output length 256, sweeping batch sizes 1, 8, 32, and 64. Each configuration used 10 warmup iterations followed by 30 measured iterations. Results are shown as slowdown relative to BF16 (lower is better). FP8 consistently runs at negligible or no latency overhead across both models and all batch sizes — this is expected since FP8 quantizes the attention computation itself using hardware-native FP8 Tensor Core operations, avoiding dequant

KV-cache 양자화 vLLM FP8 모델 최적화 추론 성능