r/LocalLLaMA • 103일 전

Bonsai 모델은 그저 과대광고에 불과하다

IMP

4/10

핵심 요약

새로 출시된 Bonsai-8B 모델은 1비트 및 1.58비트(삼진법) 양자화 버전 모두 구글의 Gemma-4-E2B 모델보다 지능 및 정답률이 현저히 낮은 것으로 나타났습니다. 특히 1.58비트 모델은 파일 크기마저 Gemma보다 33% 더 큰 치명적인 단점을 보여주며, 실무적인 활용 가치가 거의 없음을 시사합니다.

번역된 본문

저는 Bonsai 테스트를 위해 https://github.com/PrismML-Eng/llama.cpp 포크(fork)를 사용했고, Gemma는 기존 일반 llama.cpp를 사용했습니다.

임베딩 파라미터를 제외한 수치는 다음과 같습니다: Gemma 4는 23억(2.3B) 파라미터에 4.8 bpw (Q4_K_M) = 1104 MB Bonsai-8B는 69억 5천만(6.95B) 파라미터에 1.125 bpw (Q1_0) = 782 MB (용량이 고작 29% 작음)

Gemma 4를 더 낮은 수준의 양자화(quantization) 버전으로 테스트할 수도 있었지만, 소형 모델을 Q4_K_M 이하로 압축하지 않는 것이 일반적인 통념이기에 그대로 진행했습니다. 나중에 그들의 삼진법(ternary) 모델을 테스트해 볼 수도 있겠지만, 큰 기대는 하지 않고 있습니다...

[업데이트]

1.58비트/삼진법 모델(https://huggingface.co/prism-ml/Ternary-Bonsai-8B-mlx-2bit)을 테스트해 보았는데, 답변이 오히려 1비트 모델보다 더 엉망이었습니다. 69억 5천만(6.95B) 파라미터에 2.125 bpw는 1477 MB로, Gemma보다 33%나 더 큽니다!

최신 버전의 oMLX에서 테스트한 결과입니다: https://i.imgur.com/NsNNwzj.png

원문 보기

원문 보기 (영어)

I'm using the [https://github.com/PrismML-Eng/llama.cpp](https://github.com/PrismML-Eng/llama.cpp) fork for Bonsai, regular llama.cpp for Gemma. Without embedding parameters: Gemma 4 has 2.3B at 4.8 bpw (Q4\_K\_M) = 1104 MB Bonsai-8B has 6.95B at 1.125 bpw (Q1\_0) = 782 MB (only 29% smaller) I could've gone with a smaller quant of Gemma 4, it's just conventional wisdom to not push small models beyond Q4\_K\_M. I might try their ternary model later, but I don't have much hope... # [UPDATE] Tried the 1.58 bit/ternary model (https://huggingface.co/prism-ml/Ternary-Bonsai-8B-mlx-2bit), its answers were somehow even more wrong than the 1-bit one. 6.95B parameters at 2.125 bpw is 1477 MB, **33% LARGER** than Gemma! Tested in latest version of oMLX: [https://i.imgur.com/NsNNwzj.png](https://i.imgur.com/NsNNwzj.png)

모델 벤치마크 양자화 오픈소스 LLM Gemma