r/LocalLLaMA • 87일 전

Qwen3.6-27B vs Coder-Next 모델 비교 결과

IMP

7/10

핵심 요약

RTX PRO 6000 GPU 2대로 약 20시간 동안 Qwen3.6-27B와 Coder-Next 모델을 심층 비교한 결과, 두 모델은 전반적인 벤치마크에서 통계적으로 비등한 성능을 보였습니다. 흥미롭게도 Qwen3.6-27B는 '사고(Thinking)' 기능을 비활성화했을 때 오히려 결과물의 일관성이 95.8%로 가장 높게 나타났으며, Coder-Next는 제한된 비즈니스 문서 작성 등 특정 작업에서 60~100배 낮은 비용으로 완벽한 성공률을 기록해 각기 다른 강점을 입증했습니다.

번역된 본문

두 개의 RTX PRO 6000 블랙웰 GPU에서 약 20시간의 연산 시간을 들여가며, 이 두 모델 중 어느 것이 확실히 더 뛰어난지 결론을 내리려고 했습니다. 인생의 많은 일들이 그렇듯, 수많은 토큰과 전력을 소모한 끝에 얻은 답은 "상황에 따라 다르다(It depends)"였습니다.

이 두 모델은 전반적으로 놀라울 정도로 비슷한 성능을 보여줍니다. 다양한 테스트와 시나리오에서 전반적으로 유사한 점수를 받았고, 각기 다른 부분에서 성공과 실패를 겪었습니다. N=10으로 진행한 4개의 테스트 셀(Cell) 전반에 걸쳐, Coder-Next는 40회 중 25회 성공(25/40)했고, 27B-thinking은 30/40을 기록했습니다. 통계적으로 볼 때 윌슨 신뢰구간(Wilson CIs)이 겹쳐 사실상 동점이나 다름없었습니다.

표면적으로 보면 이 결과는 어느 정도 타당해 보입니다. 27B는 '사고(Thinking)' 기능에 최적화된 차세대 덴스(Dense) 모델입니다. 반면 Coder-Next는 약 3배에 달하는 파라미터를 탑재하고 있지만, 작동 시 한 번에 3B씩만 활성화하는 MoE(Mixture of Experts) 방식을 사용합니다. 따라서 사용자가 무엇을 하려느냐에 따라 둘 중 어느 것이든 정답이 될 수 있습니다.

흥미로운 점은, 사고 기능을 비활성한(--no-think) 27B 모델이 가장 일관되게 과제를 완수하는 모델이었다는 것입니다. N=10 조건의 총 12개 테스트 셀에서 95.8%의 성공률을 기록했습니다(Wilson 95% [90.5%, 98.2%]). 가중치(weights) 자체는 27B-thinking과 동일하지만 단지 사고 기능만 꺼둔 것입니다. 두 모델이 모두 성공한 케이스를 나란히 놓고 수동 평가해 본 결과, 핵심적인 출력물의 내용은 그대로 보존되었습니다. 차이점은 추론 과정의 장황함이었지, 최종 출력물의 결정이 아니었습니다. '루프 기반 기질로서의 사고 추적(Thinking-trace as loop substrate)' 메커니즘이 실제로 작동한 셈인데, 문서 종합(doc-synthesis) 작업에서 발생하는 단어 잘림(Word-trim) 현상이 사고 기능을 끄면 절반으로 줄어들었습니다 (4/10 → 2/10).

참고로 3.6-35B-A3B 모델은 과제 수행 중 너무 자주 오류를 일으켜서 다른 두 모델과 계속 비교할 가치가 없어 보였습니다. 해당 실패 기록은 폴더에 따로 보관해 두었습니다.

며칠에 걸쳐 이 모델들에 온갖 까다로운 테스트를 던지면서 제 두 개의 GPU를 아주 뜨겁고 바쁘게 만들었습니다. 제가 이 테스트를 시작한 주된 이유는, 기존의 전통적인 벤치마크들이 숏컷을 이용해 점수를 높이는 일종의 '게임(Game)'이 되어버렸다는 느낌을 받았기 때문입니다. 그래서 이 모델들을 실제 엉망진창인 환경에 던져놓고 마음껏 혹사시켜 보며 어떤 일이 벌어지는지 보고 싶었습니다.

모델들이 쉽게 해결할 수 있는 과제부터 근본적으로 실패할 수밖에 없는 과제까지 주면서, 그들이 어떻게 성공하고 실패하는지 그 과정을 연구했습니다. 가장 압도적인 결과 차이를 보인 단일 테스트는 이랬습니다: 실시간 시장 조사 과제에서 Coder-Next는 10회 중 0회(0/10) 성공한 반면, 27B는 8/10을 기록했습니다(Coder-Next의 실패에 대한 Wilson 95% [0%, 27.8%], 이 결과는 재현 가능했습니다). 반대로, 제한된 비즈니스 메모 및 문서 종합 과제에서는 Coder-Next가 10/10의 완벽한 성공률을 보였으며, 27B 변형 모델들에 비해 실행당 비용(cost-per-shipped-run)이 60~100배 저렴했습니다. 동일한 모델이지만 '무엇에 뛰어난가'의 모양이 완전히 다른 것입니다.

분석할 수 있도록 방대한 양의 데이터를 정리해 두었으며, 지금 이 글은 철저하게 이 두 모델을 비교하는 데 초점을 맞추고 있습니다.

어찌 되었든, 이제 저는 너무 졸립습니다. 여러분의 생각이나 궁금한 점이 있다면 댓글로 남겨주시길 바라며, 관련 깃허브 리포지토리는 아래에 링크해 두겠습니다. 제가 쓰러져서 자지 않을 때 나중에 더 자세히 이야기하겠습니다. 하하.

https://github.com/Light-Heart-Labs/MMBT-Messy-Model-Bench-Tests

원문 보기

원문 보기 (영어)

Burned about 20 hours of side-by-side compute on my two RTX PRO 6000 Blackwells trying to get a definitive answer on which of these two models was clearly better. As with many things in life, after many tokens and kWhs later the answer was "it depends." These models in the aggregate are actually crazy well matched against each other — scoring similarly overall across a wide range of tests and scenarios, hitting and missing on different things, failing and succeeding in different ways. Across the 4 cells I ran at N=10, Coder-Next 25/40 ships, 27B-thinking 30/40 — statistically tied with overlapping Wilson CIs. On the face of that, it kind of makes sense. 27B is a later-gen dense model that's high on thinking. Coder-Next has roughly 3x the parameters to work with but only activates 3B at a time as it works. Depending on what you're trying to do, either could be the correct choice. Kind of interestingly, 27B with thinking disabled was the most consistent shipper of work — 95.8% across the full 12-cell grid at N=10 (Wilson 95% \[90.5%, 98.2%\]). Same model weights as 27B-thinking, just \`--no-think\`. A side-by-side hand-graded read on the both-ship cells found substantive output is preserved; the difference is verbosity of reasoning prose, not output decisions. The "thinking-trace as loop substrate" mechanism turned out to be real — the documented word-trim loop on doc-synthesis halves with no-think (4/10 → 2/10). 3.6-35B-A3B pretty much fell flat on its face so often for tasking that it didn't seem worth carrying on to keep comparing against the other two. Folder kept as failure-mode evidence. I tossed a lot of crazy stuff at these models over the course of a few days and kept my two GPUs very warm and very busy in the process. I jumped into this mainly because, for lack of a better term, I felt like the traditional benchmarks were being gamed. So I wanted to just chuck these guys in the dirt and abuse them and see what happened. Give them tasks they could win, tasks where they were essentially destined to fail, study how they won and failed and what that looked like. The most lopsided single result: Coder-Next 0/10 on a live market-research task where 27B was 8/10 (Wilson 95% \[0%, 27.8%\] for the Coder-Next collapse, reproducible). Inverse: Coder-Next ships 10/10 on bounded business-memo and doc-synthesis tasks at 60–100x lower cost-per-shipped-run than either 27B variant. Same models, very different shapes of "good at." There's a ton of data, I tried to make it easy to sort through, and right now this is all pretty much just about thoroughly comparing these two models. Either way, I'm sleepy now. Let me know your thoughts or if you have any questions, and the repo is below. I'll talk more about this when I'm not looking to pass out lol. [https://github.com/Light-Heart-Labs/MMBT-Messy-Model-Bench-Tests](https://github.com/Light-Heart-Labs/MMBT-Messy-Model-Bench-Tests)

오픈소스 모델 벤치마크 Qwen 로컬 AI 비용 효율성