r/LocalLLaMA • 84일 전

밀집 모델 대결: 느린 게 더 빠르다?

IMP

6/10

핵심 요약

이 글은 최신 소규모 밀집 모델인 Qwen3.6 27B의 성능을 이전 버전(Qwen3.5 27B) 및 Gemma 4 31B와 다각적으로 비교 평가합니다. 수학 및 세계 지식 벤치마크에서 Qwen3.6이 눈에 띄는 향상을 보였지만, 전반적인 비에이전트(Non-agentic) 과제와 지시어 수행 능력에서는 Gemma 4가 여전히 우수한 경쟁력을 입증했습니다. 실무적 관점에서 각 모델의 정확도와 효율성, 그리고 기대와 다른 벤치마크 결과의 이면을 확인할 수 있는 중요한 분석입니다.

번역된 본문

Qwen3.6 27B vs Qwen3.5 27B vs Gemma 4 31B: 정확도, 지연 시간, 메모리, 토큰 효율성 테스트 결과. Qwen3.6은 Qwen3.5보다 개선되었지만, Gemma 4는 여전히 놀라운 경쟁력을 유지하고 있다. (Benjamin Marie, 2026년 5월 5일)

이전 글에서 나는 Gemma 4 31B가 대부분의 영역에서 Qwen3.5 27B보다 우수하거나 비슷하며, 비슷하거나 더 나은 정확도와 더 낮은 지연 시간(latency)을 보인다고 평가한 바 있다. (글 읽기: Gemma 4 31B vs Qwen3.5 27B: 추론 속도, 토큰 효율성, 정확도 및 메모리 소비)

하지만 Qwen3.6 업데이트로 이 결론이 바뀔 가능성이 높다. 벤치마크 상으로 Qwen3.6은 Qwen3.5보다 상당히 강력해 보인다. 따라서 이 모델 클래스(동일 규모)에서 최고의 정확도는 이제 Qwen3.6이 달성했을 가능성이 높지만, 그에 따른 비용은 얼마일까? 그리고 이러한 트레이드오프는 작업에 따라 어떻게 달라질까?

이 기사는 'The Kaitchup – AI on a Budget'의 유료 구독자 지원 콘텐츠입니다.

이 기사에서 이 질문들에 답해보겠다. 나는 '생각(thinking)' 모드를 활성화한 경우와 비활성화한 경우 모두에 대해 정확도, 지연 시간 및 토큰 효율성을 측정했으며, 이 결과를 이전에 Qwen3.5 27B 및 Gemma 4 31B에서 얻은 수치와 비교했다. 모든 결과를 직접적으로 비교할 수 있도록 완전히 동일한 테스트 환경(setup)을 사용했다.

감사의 말 (Acknowledgments) 이 기사는 Verda의 관대한 컴퓨팅 후원 없이는 불가능했을 것이다. 이 연구 전체에 걸쳐 Verda의 B200 및 RTX Pro 6000 GPU를 사용했다. Verda는 B200 및 B300과 같은 최고급 GPU에 대한 액세스를 제공하며(GB300 지원 예정), 시장에서 시간당 가장 저렴한 RTX 6000 Ada와 같은 소규모 GPU도 제공한다. Verda는 주권, 지속 가능성, 데이터 프라이버시 및 성능을 핵심으로 하는 유럽의 AI 중심 클라우드 및 GPU 인프라 제공업체이다.

Qwen3.6 27B: Gemma 4 및 Qwen3.5보다 훨씬 나은 정확도? 결과를 살펴보기 전에 짚고 넘어가야 할 점이 있다. 내가 사용한 모든 벤치마크는 비에이전트(non-agentic) 방식이며 외부 도구 호출(tool calls)을 사용하지 않는다는 것이다. 이러한 설정은 더 강력한 에이전트 성능을 위해 특별히 미세 조정된 것으로 보이는 Qwen3.6에게 특별히 유리하지 않다. 실제로 내가 실행한 여러 벤치마크에서 Qwen3.6은 Qwen3.5와 Gemma 4 모두에 뒤처졌지만, 일부 벤치마크에서는 상당히 뛰어난 성능을 보였다.

고난도 수학 문제: AIME AIME로 측정한 고난도 수학 문제에서 Qwen3.6은 Qwen3.5와 Gemma 4를 모두 크게 앞섰다. 또한 Math 500과 같은 비교적 간단한 수학 벤치마크에서도 Qwen3.5보다 더 나은 성능을 보였다.
단일 턴 코딩: LiveCodeBench LiveCodeBench로 측정한 단일 턴 코딩 작업에서 Qwen3.6은 Qwen3.5보다 향상되었지만 여전히 Gemma 4에는 약간 못 미쳤다.
세계 지식: MMLU Pro Qwen3.6은 더 강력한 세계 지식을 보여주었다. MMLU Pro에서 Qwen3.5와 Gemma 4보다 더 정확하게 답변했다.

기이한 결과들 (The strange results) 일부 결과는 더 놀라웠다.

첫째, IFBench로 측정한 결과, Qwen3.6은 지시어 수행 능력(instructions following)에서 Qwen3.5보다 현저히 떨어졌다.

둘째, Qwen3.6은 GPQA Diamond에서도 상당히 저조한 성능을 보였다. Qwen이 이 벤치마크에서 Qwen3.5보다 2.3포인트 향상되었다고 발표했기 때문에 이는 예상 밖의 결과였다. 내 테스트 설정에 문제가 있었던 걸까?

스스로 결과에 의심이 들 때마다, 일부 벤치마크를 동일하게 실행하는 'Artificial Analysis'의 결과와 교차 확인한다. 이번 경우에도 그들은 GPQA Diamond에서 Qwen3.5가 Qwen3.6보다 더 나은 성능을 보인다는 동일한 결론을 찾았다. 이는 아마도 Qwen이 우리와 다른 것을 테스트했음을 의미할 것이다. 다른 하이퍼파라미터, 다른 버전의 벤치마크, 다른 후처리(post-processing) 방식 또는 평가 설정의 다른 변형이 있었을 수 있다. 이는 서로 다른 그룹에서 발표한 벤치마크 점수를 직접적으로 비교하면 안 된다는 유용한 교훈을 준다.

전반적으로 Qwen3.6은 평균적으로 Qwen3.5보다 약간 더 나을 뿐이며, 이러한 유형의 비에이전트 작업에서는 여전히 Gemma 4에 뒤처진다.

물론 벤치마크 점수는 전체 그림의 일부만 보여줄 뿐이다. 나는 이러한 결과를 더 잘 이해하기 위해 추가 분석을 실행했다. 내가 이제 체계적으로 실행하는 분석 중 하나는 CoDeC 채점(scoring)이다...

원문 보기

원문 보기 (영어)

Qwen3.6 27B vs Qwen3.5 27B vs Gemma 4 31B: Accuracy, Latency, Memory, and Token Efficiency Tested Qwen3.6 improves on Qwen3.5, but Gemma 4 remains surprisingly competitive. Benjamin Marie May 05, 2026 ∙ Paid 6 Share In a previous article, I found Gemma 4 31B to be superior or comparable to Qwen3.5 27B in most areas, with similar or better accuracy and lower latency. Gemma 4 31B vs Qwen3.5 27B: Inference Speed, Token-Efficiency, Accuracy, and Memory Consumption Benjamin Marie · Apr 15 Read full story However, the Qwen3.6 update likely changes that conclusion. On benchmarks, it appears to be significantly stronger than Qwen3.5. The best accuracy in this model class is therefore probably now achieved by Qwen3.6, but at what cost? And how does that trade-off vary depending on the task? The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. Subscribe I’ll answer these questions in this article. I measured accuracy, latency, and token efficiency, both with thinking enabled and disabled, and compared the results with the numbers I previously obtained for Qwen3.5 27B and Gemma 4 31B. I used the exact same setup to ensure that all results are directly comparable. Acknowledgments This article would not have been possible without the compute sponsorship generously provided by Verda , whose B200 and RTX Pro 6000 GPUs I used throughout this work. Verda provides access to high-end GPUs such as the B200 and B300, with GB300 support coming soon, as well as smaller GPUs such as the RTX 6000 Ada, which are among the most affordable per hour on the market. Verda is a European, AI-focused cloud and GPU infrastructure provider with sovereignty, sustainability, data privacy, and performance at its core. You can check them out here . Qwen3.6 27B: Much Better Accuracy than Gemma 4 and Qwen3.5? Before diving into the results, I should note that all the benchmarks I used are non-agentic and do not use tool calls. This setup is not particularly favorable to Qwen3.6, which appears to have been fine-tuned specifically for stronger agentic performance. And indeed, on several of the benchmarks I ran, Qwen3.6 performs behind both Qwen3.5 and Gemma 4, while for a few benchmarks, it’s significantly better. Hard math questions: AIME On hard math questions, as measured by AIME, Qwen3.6 significantly outperforms both Qwen3.5 and Gemma 4. It also performs better than Qwen3.5 on simpler math benchmarks such as Math 500. Single-turn coding: LiveCodeBench On single-turn coding tasks, measured with LiveCodeBench, Qwen3.6 improves over Qwen3.5 but still slightly underperforms Gemma 4. World knowledge: MMLU Pro Qwen3.6 also shows stronger world knowledge. On MMLU Pro, it answers more accurately than both Qwen3.5 and Gemma 4. The strange results Some results were more surprising. First, Qwen3.6 is significantly worse than Qwen3.5 at following instructions, as measured by IFBench. Second, Qwen3.6 is also significantly worse on GPQA Diamond. This was unexpected, since Qwen reports a 2.3-point improvement over Qwen3.5 on this benchmark. Was something wrong with my setup? Whenever I doubt my own results, I cross-check them against Artificial Analysis, since they also run some of these benchmarks. In this case, they found the same thing: Qwen3.5 performs better than Qwen3.6 on GPQA Diamond. This likely means Qwen ran something different: different hyperparameters, a different version of the benchmark, different post-processing, or some other variation in the evaluation setup. It is a useful reminder that benchmark scores published by different groups are not directly comparable. Overall, Qwen3.6 is only slightly better than Qwen3.5 on average, and it still underperforms Gemma 4 on this type of non-agentic task. Of course, benchmark scores only tell part of the story. I ran additional analyses to better understand these results. One analysis I now run systematically is CoDeC scoring, which identifies benchmarks where a model appears more comfortable, suggesting possible special training or fine-tuning on benchmark-like data. Did the Model See the Benchmark During Training? Detecting LLM Contamination Benjamin Marie · Feb 2 Read full story My CoDeC run confirms that Qwen3.6 is much more comfortable with AIME-style benchmarks. It reaches a score above 62, which is very rare. By contrast, AIME-like data appears to be much newer to Gemma 4 31B IT. This helps explain why Qwen3.6 improves so much over Qwen3.5 on AIME. I also examined how Qwen3.6 behaves on LiveCodeBench, since this is one of the benchmarks where Gemma 4 remains significantly ahead. Interestingly, the accuracy gap can be closed by leveraging random sampling, that is, by looking at pass@k rather than pass@1. At pass@4, Gemma 4 31B and Qwen3.6 27B achieve nearly the same accuracy. How to Reduce LLM Inference Cost and Improve Accuracy with Pass@k and Majority Voting Benjamin Marie · Apr 27 Read full story However, as we will see later, this does not make Qwen3.6 the better choice in practice. Because of its much higher cost, using Gemma 4 31B remains substantially cheaper when targeting the same level of coding accuracy. Qwen3.6 27B: Token Efficiency and Latency This post is for paid subscribers Subscribe Already a paid subscriber? Sign in Previous

오픈소스 모델 벤치마크 로컬 LLM AI 성능 비교 구글 vs 알리바바