r/LocalLLaMA • 102일 전

Qwen3.6 GGUF 벤치마크 및 양자화 오류 정정

IMP

8/10

핵심 요약

AI 최적화 기업 Unsloth가 최근 공개한 Qwen3.6-35B-A3B GGUF 모델의 성능 벤치마크 결과를 발표했습니다. 이와 함께 최적의 성능과 용량 효율을 보여준 자사 양자화(Quantization) 모델의 우수성을 강조했습니다. 또한 커뮤니티 내에서 제기된 빈번한 모델 업데이트에 대한 오해를 해명하고, MiniMax 2.7 모델에서 발생한 연산 오류(NaN) 및 기타 이슈의 원인이 자체적인 실수가 아닌 외부 요인 때문이었음을 구체적인 데이터로 증명했습니다.

번역된 본문

안녕하세요 여러분, 최적의 양자화(Quant) 버전을 선택하실 수 있도록 Qwen3.6-35B-A3B GGUF의 KLD(쿨백-라이블러 발산) 성능 벤치마크를 진행했습니다.

파레토 최적(Pareto Frontier)에서 Unsloth 양자화 모델이 22회 중 21회에 걸쳐 KLD 대비 디스크 공간 효율성에서 가장 뛰어난 성능을 보였습니다.

GGUF 모델: https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF

또한 저희의 GGUF 업데이트와 관련된 몇 가지 오해를 바로잡고 싶습니다. 일부에서는 저희의 실수 때문에 파일을 자주 재업로드하거나, CUDA 13.2에서 발생하는 알 수 없는 텍스트(깨진 글자) 출력 문제 등을 변명으로 삼는다고 말씀하셨습니다.

그러한 우려는 이해하지만, 실제로는 저희가 문제를 빠르게 공개하고 사용자들에게 업데이트를 권장하는 편입니다. 대략 95%의 경우에서 근본적인 원인은 저희가 통제할 수 없는 외부 요인이었습니다. 저희는 단지 투명하게 상황을 알리고 커뮤니티에 정보를 제공하려고 노력할 뿐입니다.

몇 가지 예시를 들어보겠습니다:

Gemma 4는 4번 재업로드되었습니다. 이 중 3번은 llama.cpp의 약 10~20개의 버그 수정 때문이었으며, 저희가 해당 문제의 원인을 조사하고 수정에 기여한 경우도 있었습니다. 마지막 4번째는 구글의 공식 Gemma 채팅 템플릿 개선이었습니다. 이는 저희뿐만 아니라 모든 제공자가 업데이트해야 했던 사항입니다. Gemma-4와 관련하여 약 30개의 PR(풀 리퀘스트) 수정 및 개선 사항이 있었음을 llama.cpp PR 목록에서 확인하실 수 있습니다.

MiniMax 2.7 NaN(숫자 아님) 오류 저희는 Bartowski의 모델 중 38%(26개 중 10개), 저희 모델 중 22%(23개 중 5개)에서 NaN 오류를 발견했습니다. 저희는 수정 사항을 파악하고 이미 저희 모델에 패치를 적용했습니다. 관련 내용은 다음 링크에서 확인해 주세요: https://www.reddit.com/r/LocalLLaMA/comments/1slk4di/minimax_m27_gguf_investigation_fixes_benchmarks/ Bartowski는 아직 패치하지 않았지만 현재 작업 중인 상태입니다.

Bartowski 모델 중 10/26개(38%)에서 NaN 발생: https://huggingface.co/bartowski/MiniMaxAI_MiniMax-M2.7-GGUF: 청크-32 오류(9건): IQ3_XXS, IQ3_XS, IQ3_M, Q3_K_M, Q3_K_L, Q3_K_XL, Q4_K_S, Q4_1, Q5_K_S. 후반부 오류(1건): IQ1_S (청크 311에서 충돌 발생)
저희 모델 중 5/23개(21%)에서 NaN 발생 - 현재는 모두 수정 완료: https://huggingface.co/unsloth/MiniMax-M2.7-GGUF: UD-Q4_K_S, UD-Q4_K_M, UD-Q4_K_XL, UD-Q5_K_S, MXFP4_MOE. 모두 블록 32에서 발생한 문제입니다.
AesSedai의 Q4_K_M 모델(https://huggingface.co/AesSedai/MiniMax-M2.7-GGUF)에는 저희의 Q6_K 트릭이 적용되어 다시 제공되었습니다.

Qwen3.5 SSM 이슈 저희는 어떤 레이어를 양자화하면 안 되는지 보여주는 7TB 분량의 연구 자료를 공유했습니다. 이 문제는 다른 제공자들의 양자화 결과가 완전히 망가졌다는 것이 아니라, 단순히 최적화되지 않았다는 점이었습니다. 주로 ssm_out 및 ssm_* 텐서(Tensor)와 관련된 문제였습니다. 이후 저희는 모델을 개선했고, 현재 Qwen3.5 모델 역시 KLD 대비 디스크 공용 효율성에서 가장 앞서나가고 있습니다.

대부분의 양자화 제공자들은 저희의 연구 결과를 바탕으로 자신들의 모델을 업데이트합니다. 저희의 분석과 연구 내용은 다음 링크에서 확인하실 수 있습니다: https://www.reddit.com/r/LocalLLaMA/comments/1rgel19/new_qwen3535ba3b_unsloth_dynamic_ggufs_benchmarks/

원문 보기

원문 보기 (영어)

Hey guys, we ran Qwen3.6-35B-A3B GGUF KLD performance benchmarks to help you choose the best quant. **Unsloth quants have the best KLD vs disk space 21/22 times on the pareto frontier.** GGUFs: [https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF) We also want to **clear up a few misunderstandings** around our GGUF updates. Some people have said we re-upload often because of our own mistakes, or that issues like CUDA 13.2 gibberish are just excuses. We understand the concern, but the reality is that we tend to **publicize issues quickly** and tell people to update. In roughly **95% of cases, the root causes were out of our hands** \- we just try to be transparent and keep the community informed. A few examples: **Gemma 4 was re-uploaded 4 times** Three were due to about 10 to 20 llama.cpp bug fixes, some of which we helped investigate and contribute a fix as well. The fourth was an official Gemma chat template improvement from Google. Every provider had to update, not just us. See [llama.cpp PRs](https://github.com/search?q=repo%3Aggml-org%2Fllama.cpp+%22gemma+4%22++is%3Amerged+created%3A%3E2026-04-01&type=pullrequests) which shows \~30 PR fixes / improvements for Gemma-4 **MiniMax 2.7 NaNs** We found NaNs in 38% of Bartowski’s (10/26 quants) and 22% of ours (5/23 quants). We identified a fix and already patched ours - see [https://www.reddit.com/r/LocalLLaMA/comments/1slk4di/minimax\_m27\_gguf\_investigation\_fixes\_benchmarks/](https://www.reddit.com/r/LocalLLaMA/comments/1slk4di/minimax_m27_gguf_investigation_fixes_benchmarks/) Bartowski has not patched yet, but is actively working on it. * 10/26 NaNs (38%) found at [https://huggingface.co/bartowski/MiniMaxAI\_MiniMax-M2.7-GGUF:](https://huggingface.co/bartowski/MiniMaxAI_MiniMax-M2.7-GGUF:) Chunk-32 failures (9): IQ3\_XXS, IQ3\_XS, IQ3\_M, Q3\_K\_M, Q3\_K\_L, Q3\_K\_XL, Q4\_K\_S, Q4\_1, Q5\_K\_S. Late failure (1): IQ1\_S (crashed at chunk 311) * 5/23 NaNs (21%) ours had NaNs - **all fixed now** at [https://huggingface.co/unsloth/MiniMax-M2.7-GGUF:](https://huggingface.co/unsloth/MiniMax-M2.7-GGUF:) UD-Q4\_K\_S, UD-Q4\_K\_M, UD-Q4\_K\_XL, UD-Q5\_K\_S, MXFP4\_MOE. All block 32. * AesSedai's Q4\_K\_M at [https://huggingface.co/AesSedai/MiniMax-M2.7-GGUF](https://huggingface.co/AesSedai/MiniMax-M2.7-GGUF) was re-provided with our Q6\_K trick. **Qwen3.5 SSM issues** We shared 7TB of research artifacts showing which layers should not be quantized. The issue was not that providers’ quants were broken, but that they were not optimal - mainly around \`ssm\_out\` and \`ssm\_\*\` tensors. We have since improved ours and now lead on KLD vs. disk space for Qwen3.5 as well. Most if not all quant providers then take our findings then update their quants. We talked about our analysis and research at [https://www.reddit.com/r/LocalLLaMA/comments/1rgel19/new\_qwen3535ba3b\_unsloth\_dynamic\_ggufs\_benchmarks/](https://www.reddit.com/r/Local

오픈소스 로컬 LLM 양자화 GGUF 벤치마크

알리바바 Qwen3.6, 구글 Gemma 4 능가

알리바바가 350억 파라미터의 새로운 오픈소스 AI 모델 'Qwen3.6-35B-A3B'를 공개했습니다. 이 모델은 Mixture-of-Experts 기술을 적용해 연산 비용을 줄이면서도 코딩 및 추론 벤치마크에서 구글의 Gemma 4를 압도했으며, 클로드 소네 4.5와도 대등한 성능을 발휘합니다. 사용자는 Qwen Studio, API, 또는 허깅페이스를 통해 즉시 이 모델을 활용할 수 있습니다.

알리바바 Qwen3.6 오픈소스 모델