r/LocalLLaMA • 86일 전

Llama.cpp, MTP(다중 토큰 예측) 베타 지원 공개

IMP

8/10

핵심 요약

로컬 AI 추론 엔진인 Llama.cpp에 여러 개의 토큰을 동시에 예측하여 처리 속도를 비약적으로 높이는 MTP(다중 토큰 예측) 기능이 베타로 추가되었습니다. 개발자는 기존 GGUF 모델 파일 하나만으로 MTP 모델을 자동으로 불러와 별도의 추가 배포 없이도 추론 속도를 2배 이상 크게 향상시킬 수 있습니다. 이는 로컬 환경에서 구동되는 오픈소스 대형 언어 모델(LLM)의 실질적인 응답 성능을 개선하는 중요한 이정표입니다.

번역된 본문

ggml-org / llama.cpp 공지 (알림 설정을 변경하려면 로그인해야 합니다.) Fork 17.7k Star 108k 대화 링크 복사 마크다운 복사

기여자 am17an이 2026년 5월 4일에 댓글을 남겼습니다 (수정됨)

개요 이 풀 리퀘스트(PR)는 MTP(Multi Token Prediction, 다중 토큰 예측) 헤드에 대한 지원을 추가합니다. 저는 Qwen3.6 27B 및 Qwen3.6 35BA3B 모델에서 이 기능을 테스트했지만, 원칙적으로는 모든 MTP 모델에서 작동해야 합니다. 아래에 자세한 결과를 게시했지만, 일반적으로 3개의 드래프트 토큰(draft tokens)을 사용할 때 약 75%의 안정적인 수용률(acceptance rate)을 확인했으며, 이는 기존 대비 2배 이상의 속도 향상을 의미합니다.

이 단계에 도달하기 위해 내린 설계 결정은 다음과 같습니다:

MTP 모델은 동일한 GGUF 파일에서 로드되는 별도의 모델입니다. 즉, MTP가 자동으로 시작되어야 하므로 MTP용 GGUF를 별도로 배포할 필요가 없도록 설계했으며, 이를 위해 자체 컨텍스트 및 KV 캐시 등을 가집니다.
[추론 디코딩(Speculative decoding)] 기능 개발 중 숨겨진 특징(hidden features)이 여러 마이크로 배치(ubatches)에 걸쳐 올바르게 전파되지 않는 문제(EAGLE3 추론 디코딩 지원 #18039)를 발견했습니다. 따라서 이 PR은 각 마이크로 배치 후에 MTP가 처리할 수 있도록 별도의 '훅(hook)'을 추가했습니다.
MTP 추론 클래스(Speculative class)는 매우 단순합니다. (GDN 모델을 위한 부분적인 seq_rm 허용 기능 #22400에 의존하지만, 이 기능이 없어도 작동할 수는 있습니다.)

성능 다양한 프롬프트를 테스트하기 위한 간단한 벤치마크는 여기에서 확인할 수 있습니다: https://gist.github.com/am17an/228edfb84ed082aa88e3865d6fa27090 아래에 벤치마크 결과를 공유합니다:

DGX Spark 시스템에서의 성능 🧵

MTP 미사용 (기준 모드) 명령어: ./llama-server -m ../qwen3.6-q8_0.gguf -np 1 --chat-template-kwargs "{"preserve_thinking": true}"

code_python: 예측(pred)= 192, 드래프트(draft)= 0, 수용(acc)= 0, 수용률(rate)= 해당 없음, 속도= 7.0 tok/s
code_cpp: 예측= 192, 드래프트= 0, 수용= 0, 수용률= 해당 없음, 속도= 7.3 tok/s
explain_concept: 예측= 192, 드래프트= 0, 수용= 0, 수용률= 해당 없음, 속도= 7.3 tok/s
summarize: 예측= 53, 드래프트= 0, 수용= 0, 수용률= 해당 없음, 속도= 7.1 tok/s
qa_factual: 예측= 177, 드래프트= 0, 수용= 0, 수용률= 해당 없음, 속도= 7.0 tok/s
translation: 예측= 22, 드래프트= 0, 수용= 0, 수용률= 해당 없음, 속도= 7.7 tok/s
creative_short: 예측= 192, 드래프트= 0, 수용= 0, 수용률= 해당 없음, 속도= 7.1 tok/s
stepwise_math: 예측= 192, 드래프트= 0, 수용= 0, 수용률= 해당 없음, 속도= 7.2 tok/s
long_code_review: 예측= 192, 드래프트= 0, 수용= 0, 수용률= 해당 없음, 속도= 7.0 tok/s

종합 결과: { "n_requests": 9, "total_predicted": 1404, "total_draft": 0, "total_draft_accepted": 0, "aggregate_accept_rate": null, "wall_s_total": 201.07 }

MTP 사용 (--spec-draft-max-n 3 모드) 명령어: ./llama-server -m ../qwen3.6-q8_0-mtp.gguf -np 1 --chat-template-kwargs "{"preserve_thinking": true}" --spec-type mtp --spec-draft-n-max 3

code_python: 예측= 192, 드래프트= 153, 수용= 139, 수용률= 0.908, 속도= 21.6 tok/s
code_cpp: 예측= 192, 드래프트= 176, 수용= 132, 수용률= 0.750, 속도= 18.7 tok/s
explain_concept: 예측= 192, 드래프트= 191, 수용= 126, 수용률= 0.660, 속도= 16.3 tok/s
summarize: 예측= 55, 드래프트= 51, 수용= 37, 수용률= 0.726, 속도= 17.9 tok/s
qa_factual: 예측= 177, 드래프트= 174, 수용= 118, 수용률= 0.678, 속도= 16.5 tok/s
translation: 예측= 22, 드래프트= 24, 수용= 13, 수용률= 0.542, 속도= 13.9 tok/s
creative_short: 예측= 192, 드래프트= 200, 수용= 123, 수용률= 0.615, 속도= 15.8 tok/s
stepwise_math: 예측= 192, 드래프트= 171, 수용= 133, 수용률= 0.778, 속도= 19.3 tok/s
long_code_review: 예측= 192, 드래프트= 179, 수용= 131, 수용률= 0.732, 속도= 18.0 tok/s

종합 결과: { "n_requests": 9, "total_predicted": 1406, "total_draft": 1319, "total_draft_accepted": 952, "aggregate_accept_rate": 0.7218, "wall_s_total": 83.8 }

MTP 사용 (--spec-draft-max-n 2 모드) 명령어: ./llama-server -m ../qwen3.6-q8_0-mtp.gguf -np 1 --chat-template-kwargs "{"preserve_thinking": true}" --spec-type mtp --spec-draft-n-max 2

code_python: 예측= 192, 드래프트= 134, 수용= 123, 수용률= 0.918, 속도= 17.4 tok/s
code_cpp: 예측= 192, 드래프트= 145, 수용= 118, 수용률= 0.814, 속도= 16.5 tok/s
explain_concept: 예측= 192, 드래프트= 148, 수용= 116, 수용률= 0.784, 속도= 16.1 tok/s
summarize: 예측= 55, 드래프트= 44, 수용= 32, 수용률= 0.727, 속도= 15.6 tok/s
qa_factual: 예측= 192, 드래프트= 132, 수용= 125, 수용률= 0.947, 속도= 18.2 tok/s
translation: 예측= 22, 드래프트= 18, 수용= 12, 수용률= 0.667, 속도= 15.2 tok/s
creative_short: 예측= 192, 드래프트= 149, 수용= 116, 수용률= 0.778, 속도= 16.1 tok/s
stepwise_math: 예측= 192, 드래프트= 139, 수용= 121, 수용률= 0.871, 속도= 17.2 tok/s
long_code_review: 예측= 192, 드래프트= 153, 수용= 114, 수용률= 0.745, 속도= 15.6 tok/s

종합 결과: { "n_requests": 9, "total_predicted": ...

원문 보기

원문 보기 (영어)

ggml-org / llama.cpp Public Notifications You must be signed in to change notification settings Fork 17.7k Star 108k Conversation Copy link Copy Markdown Contributor am17an commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page . Overview This PR adds support for MTP (Multi Token Prediction) heads. I tested this on Qwen3.6 27B and Qwen3.6 35BA3B but in principle it should work for any MTP model. I've posted the detailed results below, but typically I see a steady-state acceptance of around 75% with 3 draft tokens, which is more than >2x speed-up over baseline. The design decisions I took to get to this stage are as follows: The MTP model is a separate model which loads from the same GGUF, the idea is that MTP should automatically start and we shouldn't need to distribute the MTP gguf separately but also it has it's own context/kv-cache etc. I saw a problem in [Speculative decoding] feat: add EAGLE3 speculative decoding support #18039 where the hidden features weren't propagated correctly across multiple ubatches, so this PR adds a separate "hook" for the MTP to consume after each ubatch The MTP speculative class is fairly trivial (although it does depend on llama: allow partial seq_rm for GDN models for speculative decoding #22400 , but could work without it) Performance A simple bench for testing various prompts is here: https://gist.github.com/am17an/228edfb84ed082aa88e3865d6fa27090 . Posting the results below: Performance on DGX Spark 🧵 No MTP (baseline) ./llama-server -m ../qwen3.6-q8_0.gguf -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}" code_python pred= 192 draft= 0 acc= 0 rate=n/a tok/s=7.0 code_cpp pred= 192 draft= 0 acc= 0 rate=n/a tok/s=7.3 explain_concept pred= 192 draft= 0 acc= 0 rate=n/a tok/s=7.3 summarize pred= 53 draft= 0 acc= 0 rate=n/a tok/s=7.1 qa_factual pred= 177 draft= 0 acc= 0 rate=n/a tok/s=7.0 translation pred= 22 draft= 0 acc= 0 rate=n/a tok/s=7.7 creative_short pred= 192 draft= 0 acc= 0 rate=n/a tok/s=7.1 stepwise_math pred= 192 draft= 0 acc= 0 rate=n/a tok/s=7.2 long_code_review pred= 192 draft= 0 acc= 0 rate=n/a tok/s=7.0 Aggregate: { "n_requests": 9, "total_predicted": 1404, "total_draft": 0, "total_draft_accepted": 0, "aggregate_accept_rate": null, "wall_s_total": 201.07 } MTP --spec-draft-max-n 3 ./llama-server -m ../qwen3.6-q8_0-mtp.gguf -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}" --spec-type mtp --spec-draft-n-max 3 code_python pred= 192 draft= 153 acc= 139 rate=0.908 tok/s=21.6 code_cpp pred= 192 draft= 176 acc= 132 rate=0.750 tok/s=18.7 explain_concept pred= 192 draft= 191 acc= 126 rate=0.660 tok/s=16.3 summarize pred= 55 draft= 51 acc= 37 rate=0.726 tok/s=17.9 qa_factual pred= 177 draft= 174 acc= 118 rate=0.678 tok/s=16.5 translation pred= 22 draft= 24 acc= 13 rate=0.542 tok/s=13.9 creative_short pred= 192 draft= 200 acc= 123 rate=0.615 tok/s=15.8 stepwise_math pred= 192 draft= 171 acc= 133 rate=0.778 tok/s=19.3 long_code_review pred= 192 draft= 179 acc= 131 rate=0.732 tok/s=18.0 Aggregate: { "n_requests": 9, "total_predicted": 1406, "total_draft": 1319, "total_draft_accepted": 952, "aggregate_accept_rate": 0.7218, "wall_s_total": 83.8 } MTP --spec-draft-max-n 2 ./llama-server -m ../qwen3.6-q8_0-mtp.gguf -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}" --spec-type mtp --spec-draft-n-max 2 code_python pred= 192 draft= 134 acc= 123 rate=0.918 tok/s=17.4 code_cpp pred= 192 draft= 145 acc= 118 rate=0.814 tok/s=16.5 explain_concept pred= 192 draft= 148 acc= 116 rate=0.784 tok/s=16.1 summarize pred= 55 draft= 44 acc= 32 rate=0.727 tok/s=15.6 qa_factual pred= 192 draft= 132 acc= 125 rate=0.947 tok/s=18.2 translation pred= 22 draft= 18 acc= 12 rate=0.667 tok/s=15.2 creative_short pred= 192 draft= 149 acc= 116 rate=0.778 tok/s=16.1 stepwise_math pred= 192 draft= 139 acc= 121 rate=0.871 tok/s=17.2 long_code_review pred= 192 draft= 153 acc= 114 rate=0.745 tok/s=15.6 Aggregate: { "n_requests": 9, "total_predicted": 1421, "total_draft": 1062, "total_draft_accepted": 877, "aggregate_accept_rate": 0.8258, "wall_s_total": 90.44 } Draft model (Qwen3.5 0.8B) with spec-draft-n-max 16 with partial rollback llama-server -m ../qwen3.6/Qwen3.6-27B-Q8_0.gguf -hfd unsloth/Qwen3.5-0.8B-GGUF:Q8_0 --spec-draft-n-max 16 -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}" code_python pred= 192 draft= 188 acc= 156 rate=0.830 tok/s=26.4 code_cpp pred= 192 draft= 201 acc= 126 rate=0.627 tok/s=16.8 explain_concept pred= 192 draft= 263 acc= 112 rate=0.426 tok/s=12.7 summarize pred= 57 draft= 63 acc= 39 rate=0.619 tok/s=16.9 qa_factual pred= 192 draft= 178 acc= 177 rate=0.994 tok/s=47.7 translation pred= 23 draft= 18 acc= 15 rate=0.833 tok/s=18.7 creative_short pred= 192 draft= 189 acc= 120 rate=0.635 tok/s=15.4 stepwise_math pred= 192 draft= 190 acc= 148 rate=0.779 tok/s=22.3 long_code_review pred= 192 draft= 207 acc= 120 rate=0.580 tok/s=14.5 Aggregate: { "n_requests": 9, "total_predicted": 1424, "total_draft": 1497, "total_draft_accepted": 1013, "aggregate_accept_rate": 0.6767, "wall_s_total": 81.39 } Master with draft model with spec-draft-n-max 64 with no partial rollback llama-server -m ../qwen3.6/Qwen3.6-27B-Q8_0.gguf -hfd unsloth/Qwen3.5-0.8B-GGUF:Q8_0 --spec-draft-n-max 64 -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}" code_python pred= 192 draft= 174 acc= 159 rate=0.914 tok/s=27.2 code_cpp pred= 192 draft= 138 acc= 120 rate=0.870 tok/s=15.0 explain_concept pred= 192 draft= 170 acc= 101 rate=0.594 tok/s=11.4 summarize pred= 55 draft= 48 acc= 36 rate=0.750 tok/s=14.6 qa_factual pred= 177 draft= 126 acc= 106 rate=0.841 tok/s=13.9 translation pred= 22 draft= 13 acc= 13 rate=1.000 tok/s=16.5 creative_short pred= 192 draft= 136 acc= 104 rate=0.765 tok/s=12.8 stepwise_math pred= 192 draft= 172 acc= 147 rate=0.855 tok/s=22.0 long_code_review pred= 192 draft= 160 acc= 111 rate=0.694 tok/s=13.0 Aggregate: { "n_requests": 9, "total_predicted": 1406, "total_draft": 1137, "total_draft_accepted": 897, "aggregate_accept_rate": 0.7889, "wall_s_total": 97.13 } How to use I've uploaded the GGUF which I made by using the convert_hf_to_gguf.py changes in this PR. Here is another GGUF for the MoE (35BA3B) model Requirements I have read and agree with the contributing guidelines AI usage disclosure: Yes, for debugging and reviewing. Also the convert_hf_to_gguf.py + model definitions. Writing bench for validation against vLLM. --> Sorry, something went wrong. Uh oh! There was an error while loading. Please reload this page . --> 👍 12 ruixiang63, pwilkin, simonxluo, wsbagnsv1, mbednarek360, Stealt91, bayorm, AbdullahMPrograms, intra64, osma, and 2 more reacted with thumbs up emoji 🎉 4 AbdullahMPrograms, NickM-27, TetrisBlack, and MaxOTS reacted with hooray emoji 🚀 21 pwilkin, CISC, simonxluo, ragoune, michaelw9999, wsbagnsv1, mbednarek360, compunect-gmbh, Berndwl, Stealt91, and 11 more reacted with rocket emoji 👀 5 thomasstockermc, wsbagnsv1, Stealt91, AbdullahMPrograms, and TetrisBlack reacted with eyes emoji All reactions 👍 12 reactions 🎉 4 reactions 🚀 21 reactions 👀 5 reactions github-actions Bot added model Model specific testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs Vulkan Issues specific to the Vulkan backend examples python python script changes server ggml changes relating to the ggml tensor library for machine learning labels May 4, 2026 Copy link Copy Markdown Contributor ngxson commented May 4, 2026 Nice, I think this is a fresh start better than my WIP #18886 (that I still never find the time to continue) There were some other attempts to add MTP support but they all heavily rely on host <--> device data copy. I assume you tried addressed this, right? (Maybe there was a discussion somewhere but I wasn't aware of) --> All reactions --> Sorry, something went wrong. Uh oh! There was an error while loading. Please reload this page . ngxson reviewed May 4, 2026 View reviewed changes Copy link Copy M

Llama.cpp 추론 속도 최적화 MTP(다중 토큰 예측) 오픈소스 AI GGUF