r/LocalLLaMA • 95일 전

Gemma 4·Qwen 3.6 KV캐시 양자화 성능 비교

IMP

7/10

핵심 요약

Gemma 4와 Qwen 3.6 모델의 메모리 절약 기법인 KV 캐시 양자화(q8_0, q4_0) 결과를 비교한 벤치마크입니다. Gemma 모델은 흔히 '무손실'로 알려진 q8_0 양자화에서도 품질 저하가 크게 발생하며, 특히 MoE 모델에서 민감도가 극심합니다. 반면 Qwen 모델은 q8_0은 물론 q4_0 수준에서도 뛰어난 안정성을 보여주어, 로컬 환경 등에서 메모리 최적화를 고려할 때 모델 선택의 중요한 기준이 됩니다.

번역된 본문

Gemma 4 및 Qwen 3.6의 q8_0 및 q4_0 KV 캐시 적용 결과: KL 발산(Divergence) 결과

4개의 모델이 완정 정밀도(full-precision) 기준선과 비교하여 q8_0 및 q4_0 KV 캐시로 테스트되었습니다. oobabooga 2026년 4월 24일 3 공유

무엇을 측정하는가 KV 캐시 양자화(Quantization)는 메모리를 절약하기 위해 키-값(Key-Value) 캐시를 더 낮은 정밀도로 저장하는 기법입니다. q8_0은 캐시 메모리를 절반으로 줄이고, q4_0은 1/4로 줄입니다. 일반적인 통념은 q8_0이 "사실상 무손실"이라는 것입니다.

각 모델은 Unsloth의 BF16 GGUF를 사용하여 테스트되었으며, 동일한 머신에서 f16, q8_0 및 q4_0 캐시로 각각 3번 로드되었습니다. 실행 간에 변경되는 유일한 변수는 캐시 정밀도입니다. 이러한 측정에는 최근 추가된 llama.cpp가 자동으로 적용하는 TurboQuant 기반 어텐션 회전(attention rotation)이 포함되어 있습니다. 전체 방법론.

결과

핵심 발견 Gemma는 KV 캐시 양자화에 매우 민감하며, Qwen은 그렇지 않습니다. Gemma에 대해 "q8_0은 사실상 무손실"이라는 말은 틀렸습니다. q8_0 캐시를 적용한 Gemma 31B의 KL 발산은 0.108입니다. Gemma 26B A4B는 0.377입니다. 반면 Qwen은 이를 훨씬 더 잘 견딥니다. 두 Qwen 모델 모두 q8_0에서 KL 0.04 미만을 유지하며, 심지어 q4_0 캐시(KL 0.087~0.117)도 사용 가능한 수준입니다.

MoE(Mixture of Experts) 구조는 Gemma에서 문제를 악화시킵니다. Gemma 4 26B A4B는 가중치 및 KV 캐시 모두에서 지금까지 테스트된 모델 중 양자화에 가장 민감한 모델입니다. 이 모델의 q8_0 캐시 KL(0.377)은 Dense 기반 Gemma 31B(0.108)보다 3.5배 더 나쁩니다. q4_0은 68.0%의 top-1 정확도와 함께 KL 1.088에 도달합니다. 반면 Qwen의 MoE 모델은 이러한 악화를 보여주지 않습니다 (Dense 0.024, MoE 0.039).

캐시 양자화와 가중치 양자화의 비교 캐시 양자화와 가중치 양자화는 품질 저하의 독립적인 원인입니다. 가중치 양자화인 Q4_K_M과 함께 q8_0 캐시를 실행하면 두 가지 불이익이 겹칩니다. 아래 표는 각 캐시 결과를 동등한 손상을 초래하는 가중치 양자화 수준에 매핑한 것입니다. 전체 가중치 양자화 벤치마크: Gemma 4 31B, Gemma 4 26B A4B, Qwen 3.6 35B A3B.

카테고리별 분석: Gemma는 모든 영역에서 성능이 저하되며, Qwen은 긴 문서에서만 저하됩니다. Gemma는 전반적으로 균일하게 저하됩니다. q8_0에서 Gemma의 가장 좋은 카테고리(과학, KL 0.214)조차도 Qwen의 가장 나쁜 카테고리(긴 문서, KL 0.142)보다 더 심각합니다. Qwen은 거의 모든 손상을 긴 문서(q4_0에서 KL 0.581)와 도구 호출(tool calling, 0.086)에 집중시키며, 다른 카테고리는 거의 0에 가깝게 유지됩니다.

카테고리별 성능 Gemma 4 31B (Dense) Gemma 4 26B A4B (MoE) Qwen 3.6 27B (Dense) Qwen 3.6 35B A3B (MoE)

방법론 추론(Inference): TextGen + 패치된 llama.cpp (프롬프트에서 로그 확률(logprob) 추출) 참조(Reference): f16 KV 캐시로 로드된 BF16 GGUF (llama.cpp 기본값) 테스트(Test): q8_0 또는 q4_0 KV 캐시로 로드된 동일한 BF16 GGUF 데이터셋(Dataset): 6개 카테고리(코딩, 일반 채팅, 도구 호출, 과학, 비라틴어 텍스트, 긴 문서)에 걸쳐 약 250,000개의 토큰 측정 항목(Metric): KL 발산 - f16 캐시와 양자화된 캐시 간의 토큰별 상위 40개 로그 확률 분포를 사용하여 계산 전체 방법론 3 공유 이전

원문 보기

원문 보기 (영어)

Gemma 4 and Qwen 3.6 with q8_0 and q4_0 KV cache: KL divergence results 4 models tested with q8_0 and q4_0 KV cache against full-precision baseline oobabooga Apr 24, 2026 3 Share What this measures KV cache quantization stores the key-value cache in lower precision to save memory. q8_0 halves the cache memory, q4_0 quarters it. The common wisdom is that q8_0 is “practically lossless.” Each model was tested using the BF16 GGUF from Unsloth, loaded three times with f16, q8_0, and q4_0 cache on the same machine. The only variable changing between runs is cache precision. These measurements include the recently added TurboQuant-inspired attention rotation that llama.cpp applies automatically. Full methodology . Results Findings Gemma is sensitive to KV cache quantization, Qwen is not “q8_0 is practically lossless” is wrong for Gemma. Gemma 31B at q8_0 cache has KL 0.108. Gemma 26B A4B at q8_0 is 0.377. Qwen handles it much better: both Qwen models stay below KL 0.04 at q8_0, and even q4_0 cache (KL 0.087-0.117) is usable. MoE amplifies the problem for Gemma The Gemma 4 26B A4B is the most quantization-sensitive model tested so far, both for weights and for KV cache. Its q8_0 cache KL (0.377) is 3.5x worse than the dense Gemma 31B (0.108), and q4_0 reaches KL 1.088 with 68.0% top-1. Qwen’s MoE shows no such amplification (0.024 dense, 0.039 MoE). How cache quant compares to weight quant Cache quantization and weight quantization are independent sources of quality loss. If you run a Q4_K_M with q8_0 cache, both penalties stack. The table below maps each cache result to the weight quant that causes equivalent damage. Full weight quant benchmarks: Gemma 4 31B , Gemma 4 26B A4B , Qwen 3.6 35B A3B . Per-category: Gemma degrades everywhere, Qwen only on long docs Gemma degrades uniformly: even its best category at q8_0 (science, KL 0.214) is worse than Qwen’s worst (long docs, KL 0.142). Qwen concentrates nearly all damage in long documents (KL 0.581 at q4_0) and tool calling (0.086), with other categories staying near zero. Per-category performance Gemma 4 31B (Dense) Gemma 4 26B A4B (MoE) Qwen 3.6 27B (Dense) Qwen 3.6 35B A3B (MoE) Methodology Inference : TextGen + patched llama.cpp (logprob extraction from prompt) Reference : BF16 GGUF loaded with f16 KV cache (the llama.cpp default) Test : Same BF16 GGUF loaded with q8_0 or q4_0 KV cache Dataset : ~250,000 tokens across 6 categories (coding, general chat, tool calling, science, non-Latin scripts, long documents) Metric : KL divergence, computed token-by-token between f16-cache and quantized-cache top-40 log-probability distributions Full methodology 3 Share Previous

로컬 AI 성능 벤치마크 양자화 모델 평가 메모리 최적화