r/LocalLLaMA • 113일 전

Gemma 4 31B GGUF 양자화 모델 KL 발산 성능 순위

IMP

8/10

핵심 요약

oobabooga 사용자가 Hugging Face 주요 업로더들의 Gemma 4 31B GGUF 양자화 모델 52종의 품질을 KL 발산 지표로 비교 분석했습니다. 그 결과 파레토 최적화 기준 unsloth의 UD- 시리즈가 동일 용량 대비 가장 뛰어난 성능을 보여주었으며, 코딩 및 과학 분야보다 긴 문맥이나 비라틴어 텍스트 처리 시 품질 저하가 크게 나타났습니다. 이는 로컬 환경에서 LLM을 구동하는 사용자들에게 자신의 메모리 용량에 맞는 최적의 양자화 모델을 선택하는 중요한 가이드를 제공합니다.

번역된 본문

Gemma 4 31B GGUF 양자화 모델 KL 발산 기준 순위 (unsloth, bartowski, lmstudio-community, ggml-org)

oobabooga / 2026년 4월 7일

위 차트는 Hugging Face의 다음 업로더들이 제공한 Gemma 4 31B의 52개 GGUF 양자화 모델에 대해 측정된 KL 발산(KL divergence)을 보여줍니다:

unsloth (20개 모델)
bartowski (27개 모델)
ggml-org (2개 모델)
lmstudio-community (3개 모델)

KL 발산 수치는 낮을수록 좋습니다. 이는 양자화된 모델의 토큰 확률 분포가 원본 모델과 얼마나 다른지를 측정합니다. KL 수치가 0이라는 것은 양자화 모델이 원본 모델과 완전히 동일하다는 것을 의미합니다.

연구 방법론

모든 측정값은 text-generation-webui의 OpenAI 호환 API를 사용하여 수집되었습니다. llama.cpp는 기본적으로 프롬프트에서 logprobs를 추출하는 기능을 지원하지 않기 때문에, 저는 이를 패치했습니다. 이 패치는 최신 text-generation-webui 릴리즈에 포함된 llama.cpp 바이너리에 적용되어 있습니다. 소스 코드는 저의 llama.cpp 포크에 있는 해당 커밋에서 확인할 수 있습니다.

참조용 logprobs로는 unsloth의 BF16 GGUF 모델을 사용했습니다. 평가는 다음 세 단계로 진행됩니다:

모델에서 Jinja2 채팅 템플릿을 추출합니다. 모든 업로더 간의 비교를 동일하게 유지하기 위해 동일한 템플릿이 모든 양자화 모델에 사용됩니다.
각 샘플의 메시지는 instruction_template_str 키를 사용하여 /v1/internal/chat-prompt 엔드포인트를 통해 프롬프트 문자열로 변환됩니다.
각 프롬프트는 echo: true 및 logprobs: 40 설정과 함께 /v1/completions 엔드포인트로 전송되며, 이는 프롬프트 내의 모든 토큰에 대해 상위 40개 토큰 로그 확률을 반환합니다.
그런 다음 참조본과 양자화된 분포 사이에서 토큰 단위로 KL 발산이 계산됩니다.

데이터셋

대부분의 KL 발산 벤치마크는 컨텍스트 길이가 2048인 위키피디아 등을 사용합니다. 저는 실제 사용 사례를 바탕으로 KL 발산을 측정하고 싶었기 때문에, 6개 범주에 걸쳐 약 250,000개의 토큰으로 구성된 데이터셋을 구축했습니다:

코딩
일반 채팅
도구 호출
과학
비라틴 문자
긴 문서

각 샘플은 완전한 OpenAI 호환 입력 형식입니다. 도구 호출 샘플에는 도구 정의와 tool_calls가 포함되어 있으며, 이는 모델의 채팅 템플릿을 통해 프롬프트로 렌더링됩니다.

결과

위 차트를 로그 스케일로 나타낸 것은 다음과 같습니다:

unsloth와 bartowski가 동일한 양자화 타입(예: Q6_K, Q5_K_M, Q4_K_M 등)의 파일을 모두 제공하는 경우, bartowski의 파일은 최대 1.5GB 더 크지만 KL 수치가 약간 더 낮습니다. 파레토 최적화(Pareto frontier) 측면에서는 어느 한쪽이 절대적인 우위를 점하지 않으며, 파일 크기 범위에 따라 번갈아 가며 우수한 성능을 보여줍니다.

unsloth의 UD- 변형 모델들은 사용자 정의 양자화 방식을 사용하며, 동일한 크기 범위 내에서 표준 양자화 모델들을 능가하는 경향이 있습니다. 예를 들어, UD-Q3_K_XL(15.3GB, KL 0.87)은 크기가 1.5GB 더 작음에도 불구하고 bartowski의 Q3_K_L(16.8GB, KL 0.97)보다 더 나은 성능을 보여줍니다. 그러나 더 높은 비트율에서는 이러한 이점이 줄어듭니다. UD-Q6_K_XL(27.5GB, KL 0.20)은 bartowski의 Q6_K_L(27.1GB, KL 0.20)과 거의 동등한 성능을 보입니다.

lmstudio-community와 ggml-org의 양자화 모델들은 동일한 파일 크기에서 성능이 더 떨어집니다. 이들의 Q4_K_M 파일(18.7GB)은 unsloth의 것(18.3GB)과 크기가 비슷하지만 KL 수치가 unsloth의 0.61에 비해 0.76으로, 양자화 과정의 최적화가 덜 되었음을 시사합니다.

모든 업로더의 Q8_0 모델은 KL = 0.16으로 동일한 성능을 보입니다. 눈에 띄는 점은 unsloth의 UD-Q8_K_XL(35.0GB)이 Q8_0(32.6GB)보다 크기가 더 크고 성능(KL = 0.16)도 약간 떨어진다는 것입니다.

파레토 최적화(Pareto frontier)

파레토 최적화는 KL 발산이 더 낮으면서 더 작은 용량의 양자화 모델이 존재하지 않는 모델들의 집합을 의미합니다. 이는 사용자의 가용 메모리에 따라 선택할 수 있는 최적의 옵션들입니다:

Q5_K_S 수준 이하로 내려가면 품질이 급격히 떨어집니다. Top-1 일치율이 Q5_K_S에서 84%였던 것이 Q4_K_S에서는 80% 미만으로 떨어집니다.

범주별 세부 내용

KL 발산은 작업에 따라 균일하지 않습니다. Q8_0, Q6_K 및 Q5_K_S에 대한 세부 수치는 다음과 같습니다:

Q8_0 모델조차도 긴 문서에서는 0.45, 비라틴 문자에서는 0.24의 KL 발산을 보여줍니다. 모든 범주는 Q8_0에서 Q5_K_S로 갈수록 KL 수치가 대략 두 배가 되지만, 과학 및 도구 사용 범주는 전체적으로 가장 낮은 수치를 유지합니다(Q8_0에서 각각 0.07 및 0.08).

범주별 차트

범주별로 세분화된 전체 KL 발산 차트는 다음과 같습니다:

원문 보기

원문 보기 (영어)

Gemma 4 31B GGUF quants ranked by KL divergence (unsloth, bartowski, lmstudio-community, ggml-org) oobabooga Apr 07, 2026 7 Share The plot above shows the KL divergence measured for 52 GGUF quants of Gemma 4 31B, from the following uploaders on Hugging Face: unsloth (20 quants) bartowski (27 quants) ggml-org (2 quants) lmstudio-community (3 quants) Lower KL divergence is better. It measures how different the quantized model’s token probability distribution is relative to the original model. A KL of 0 would mean the quant is identical to the original. Methodology All measurements were collected using the OpenAI-compatible API in text-generation-webui . llama.cpp doesn’t natively support extracting logprobs from the prompt, so I patched it. The patch is included in the llama.cpp binary shipped with the latest text-generation-webui release. The source can be found in this commit on my llama.cpp fork. For the reference logprobs, I used the BF16 GGUF model by unsloth . The evaluation works in three steps: The Jinja2 chat template is extracted from the model. The same template is used for every quant to keep the comparison identical across uploaders. Each sample’s messages are converted into a prompt string using the /v1/internal/chat-prompt endpoint with the instruction_template_str key. Each prompt is sent to the /v1/completions endpoint with echo: true and logprobs: 40 , which returns the top-40 token log-probabilities for every token in the prompt. The KL divergence is then computed token-by-token between the reference and quantized distributions. Dataset Most KL divergence benchmarks use Wikipedia with a context length of 2048 or similar. I wanted to measure KL divergence across real-world use cases, so I built a dataset with ~250,000 tokens across 6 categories: Coding General chat Tool calling Science Non-Latin scripts Long documents Each sample is a full OpenAI-compatible input. Tool calling samples include tool definitions and tool_calls , which get rendered into the prompt via the model’s chat template. Results Here is the plot above in log scale: For every quant type where both unsloth and bartowski provide a file (Q6_K, Q5_K_M, Q4_K_M, etc.), bartowski’s file is up to 1.5 GB larger and has slightly lower KL. On the Pareto frontier, neither dominates the other: they alternate depending on the size range. Unsloth’s UD- variants use a custom quantization scheme and tend to beat standard quants in their size range. For example, UD-Q3_K_XL (15.3 GB, KL 0.87) outperforms bartowski’s Q3_K_L (16.8 GB, KL 0.97) despite being 1.5 GB smaller. At higher bit rates the advantage shrinks: UD-Q6_K_XL (27.5 GB, KL 0.20) is essentially tied with bartowski’s Q6_K_L (27.1 GB, KL 0.20). The lmstudio-community and ggml-org quants are worse at the same file size. Their Q4_K_M files (18.7 GB) are similar in size to unsloth’s (18.3 GB) but have a KL of 0.76 vs unsloth’s 0.61, suggesting a less optimized quantization process. Q8_0 is identical across all uploaders at KL = 0.16. Notably, unsloth’s UD-Q8_K_XL (35.0 GB) is both larger and slightly worse (KL = 0.16) than Q8_0 (32.6 GB). Pareto frontier The Pareto frontier is the set of quants for which no smaller quant exists with lower KL divergence. These are the optimal choices depending on how much memory you have: Below Q5_K_S, quality drops fast. Top-1 agreement falls from 84% at Q5_K_S to under 80% at Q4_K_S. Per-category breakdown KL divergence is not uniform across tasks. Here is the breakdown for Q8_0, Q6_K, and Q5_K_S: Even Q8_0 shows a KL of 0.45 on long documents and 0.24 on non-Latin scripts. All categories roughly double from Q8_0 to Q5_K_S, but science and tool use remain the lowest throughout (0.07 and 0.08 at Q8_0). Per-category plots Here is the full KL divergence plot broken down by category: 7 Share

로컬-LLM 양자화 Gemma-4 벤치마크

Gemma 4 기반 화면 관찐 워크플로 자동 스킬화

오픈소스 Mac 메뉴바 앱 AgentHandover가 로컬 Gemma 4(Ollama)로 화면을 관찰해 반복 워크플로를 구조화된 Skill 파일로 자동 생성합니다. MCP를 통해 Claude Code, Cursor 등 어떤 에이전트든 즉시 연동 가능하며, 전 과정이 온디바이스에서 암호화되어 처리되어 프라이버시가 강력합니다.

에이전트 로컬 모델 워크플로 자동화

r/LocalLLaMA • 112일 전

IMP 8

8GB VRAM으로 Gemma 4 로컬 파인튜닝 및 버그 수정 안내

Unsloth에서 무료 노트북을 통해 Gemma 4 E2B 및 E4B 모델을 파인튜닝할 수 있게 되었습니다. 단 8GB VRAM만으로도 로컬 환경에서 학습이 가능하며, 기존 대비 약 1.5배 빠르고 60% 적은 VRAM을 사용합니다. 또한 학습 시 Loss 폭주, 추론 오류 등 4가지 핵심 버그를 수정하여 안정적인 학습 및 추론 환경을 제공합니다.

Gemma-4 파인튜닝 오픈소스