메뉴
BL
r/LocalLLaMA 14일 전

Testing llama.cpp MTP support on Qwen3.6 - RTX 5090

IMP
3/10
핵심 요약

[요약 오류] Testing llama.cpp MTP support on Qwen3.6 - RTX 5090

원문 보기
원문 보기 (영어)
Setup: \- RTX 5090, 32 GB, Linux \- Built llama.cpp from 4f13cb7 (the official [ghcr.io/ggml-org/llama.cpp:server-cuda](http://ghcr.io/ggml-org/llama.cpp:server-cuda) image hasn't picked up the merge yet as of writing — had to docker build from source with CUDA\_DOCKER\_ARCH=120) \- Unsloth's Qwen3.6-27B-MTP-GGUF Q5\_K\_M and Qwen3.6-35B-A3B-MTP-GGUF UD-Q4\_K\_M \- 128k context, flash-attn, q8\_0 KV cache, temp 0.8, --parallel 1 (required for MTP) \- Same GGUF for "MTP on" and "MTP off" — only the --spec-type draft-mtp --spec-draft-n-max 3 flag toggled. This isolates MTP from quant differences. \- 2 prompts: "short story about a cat" (\~400 tokens) and "Flappy Bird clone as a single HTML file" (\~3000 tokens) \- 3 seeds per config, averaged
관련 소식