r/LocalLLaMA • 72일 전

llama.cpp: MTP 프롬프트 처리 속도 개선 PR 병합

IMP

7/10

핵심 요약

오픈소스 프로젝트 llama.cpp에 MTP(다중 토큰 예측) 적용 시 프롬프트 처리(PP) 속도를 크게 향상시키는 PR이 병합되었습니다. 기존에는 불필요한 로짓(logit) 복사로 인해 메모리 부하가 발생했으나, 이를 최적화하여 MTP 사용 시 발생하던 성능 저하를 절반 수준으로 줄였습니다.

번역된 본문

개요 MTP(Multi-Token Prediction)를 위한 프롬프트 처리 시, 사전 정규화(pre-norm) 단계만 필요하므로 배치 내 모든 토큰에 대해 로짓(logit)을 복사하는 과정을 생략했습니다. 이를 통해 메모리 트래픽을 크게 줄여 MTP 사용 시 프롬프트 처리(PP) 속도를 향상시킵니다.

참여 및 리뷰 내역 • 기여자 am17an이 2026년 5월 17일 해당 PR을 생성했습니다. • 핵심 관리자인 ggerganov와 CISC가 코드를 검토한 후 승인했습니다. • ggerganov는 코드 리뷰 과정에서 주석 업데이트 및 t_h_pre_norm에 대한 set_output 호출 등의 수정을 요청했으며, 이후 반영되었습니다.

성능 벤치마크 결과 멤버 pwilkin과 여러 사용자들은 다양한 환경에서의 벤치마크 결과를 공유했습니다. • ggerganov (RTX 5090, Qwen3.6 27B Q4_K): 해당 PR을 통해 성능이 개선됨을 확인했습니다. • cb88 (AMD MI50 2개, Qwen 27B Q4_1): MTP 미사용 시 500t/s, 기존 MTP 사용 시 250t/s였으나, 이 PR 적용 시 300t/s로 속도가 개선되었습니다. • d-r-e가 벤치마크 차트의 범례 색상이 바뀐 것 같다고 질문했으나, pwilkin은 "MTP가 프롬프트 처리에 부정적인 영향을 미치는 것은 맞지만, 이 PR을 통해 그 부정적인 영향이 절반으로 줄었다"고 명확히 설명했습니다.

결과 am17an이 2026년 5월 17일 해당 커밋을 마스터 브랜치에 병합(merge)했으며, 81개의 테스트 중 75개가 성공적으로 통과했습니다.

원문 보기

원문 보기 (영어)

ggml-org / llama.cpp Public Notifications You must be signed in to change notification settings Fork 18.3k Star 111k Conversation Copy link Copy Markdown Contributor am17an commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page . Overview Avoid copying the logits for every token in the batch when doing prompt processing for MTP since it only requires the pre-norm. This reduces memory traffic quite a bit and in turn increases PP speed with MTP. Additional information Requirements I have read and agree with the contributing guidelines AI usage disclosure: YES, for debugging and reviewing --> Sorry, something went wrong. Uh oh! There was an error while loading. Please reload this page . --> 👍 13 lym000000, Green-Sky, teverwintin-hue, dur-randir, staralt, Cr4xy, nokyan, cezn, 0FL01, how02, and 3 more reacted with thumbs up emoji ❤️ 7 othermod, tha80, stan4cb, Israel-Laguan, alkeryn, wie-jmagder, and IamGianluca reacted with heart emoji 🚀 14 lym000000, jacekpoplawski, mbednarek360, sdroege, momendo, stan4cb, cb88, Cr4xy, 0FL01, janus-reith, and 4 more reacted with rocket emoji All reactions 👍 13 reactions ❤️ 7 reactions 🚀 14 reactions llama: avoid copying logits during prompt decode in MTP 0abcf8f am17an requested review from a team , CISC and ggerganov as code owners May 17, 2026 10:22 ggerganov reviewed May 17, 2026 View reviewed changes Comment thread src/llama-context.cpp Outdated Show resolved Hide resolved Uh oh! There was an error while loading. Please reload this page . review: update comment e964f98 ggerganov reviewed May 17, 2026 View reviewed changes Comment thread src/models/qwen35moe.cpp Show resolved Hide resolved Uh oh! There was an error while loading. Please reload this page . llama-graph: call set_output for t_h_pre_norm 70a7d0e CISC approved these changes May 17, 2026 View reviewed changes github-actions Bot added model Model specific examples server labels May 17, 2026 ggerganov approved these changes May 17, 2026 View reviewed changes Copy link Copy Markdown Member ggerganov left a comment There was a problem hiding this comment. Choose a reason for hiding this comment The reason will be displayed to describe this comment to others. Learn more . --> Choose a reason Spam Abuse Off Topic Outdated Duplicate Resolved Low Quality Hide comment A quick bench on RTX 5090 with Qwen3.6 27B Q4_K --> Sorry, something went wrong. Uh oh! There was an error while loading. Please reload this page . --> 🚀 9 momendo, mdziekon, pwilkin, othermod, coder543, jacekpoplawski, 4onen, d-r-e, and stan4cb reacted with rocket emoji All reactions 🚀 9 reactions Hide details View details am17an merged commit 3e12fbd into ggml-org : master May 17, 2026 75 of 81 checks passed Uh oh! There was an error while loading. Please reload this page . am17an deleted the mtp-pp-fix branch May 17, 2026 15:30 Copy link Copy Markdown d-r-e commented May 17, 2026 A quick bench on RTX 5090 with Qwen3.6 27B Q4_K Are the legend colors swapped? --> All reactions --> Sorry, something went wrong. Uh oh! There was an error while loading. Please reload this page . Copy link Copy Markdown Member pwilkin commented May 17, 2026 @d-r-e no, MTP does negatively impact prompt processing, but under this PR the negative impact is halved. --> 👍 4 Israel-Laguan, alkeryn, kroaton, and b1skit reacted with thumbs up emoji ❤️ 4 d-r-e, HideLord, stan4cb, and alkeryn reacted with heart emoji All reactions 👍 4 reactions ❤️ 4 reactions --> Sorry, something went wrong. Uh oh! There was an error while loading. Please reload this page . Copy link Copy Markdown cb88 commented May 17, 2026 2xMI50 qwen 27b Q4_1 does see some improvement with this PR MI50 without MTP = 500t/s with MTP = 250t/s with MTP this PR = 300t/s --> 👀 2 janus-reith and cchung2020 reacted with eyes emoji All reactions 👀 2 reactions --> Sorry, something went wrong. Uh oh! There was an error while loading. Please reload this page . tha80 mentioned this pull request May 17, 2026 llama + spec: MTP Support #22673 Merged 11 tasks Copy link Copy Markdown Contributor 0cc4m commented May 17, 2026 @d-r-e no, MTP does negatively impact prompt processing, but under this PR the negative impact is halved. Why does it affect prompt processing? --> All reactions --> Sorry, something went wrong. Uh oh! There was an error while loading. Please reload this page . Copy link Copy Markdown Mithras commented May 17, 2026 Made sure this PR is included and re-tested: unsloth/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-UD-Q5_K_XL.gguf | model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) | |:----------|-----------------:|----------------:|-------------:|-------------------:|-------------------:|-------------------:| | qwen36-27 | pp2048 @ d16384 | 1843.93 ± 14.82 | | 9098.06 ± 124.65 | 9097.48 ± 124.65 | 9098.06 ± 124.65 | | qwen36-27 | tg128 @ d16384 | 74.46 ± 3.24 | 84.00 ± 0.82 | | | | | qwen36-27 | pp2048 @ d65536 | 1449.72 ± 9.78 | | 42344.98 ± 292.27 | 42344.40 ± 292.27 | 42344.98 ± 292.27 | | qwen36-27 | tg128 @ d65536 | 61.97 ± 2.35 | 68.33 ± 4.78 | | | | | qwen36-27 | pp2048 @ d131072 | 1075.30 ± 2.36 | | 112238.48 ± 281.78 | 112237.90 ± 281.78 | 112238.48 ± 281.78 | | qwen36-27 | tg128 @ d131072 | 48.40 ± 2.53 | 55.00 ± 0.00 | | | | pretty much the same as #22673 (comment) which probably had the PR already. Still almost 50% pp hit --> All reactions --> Sorry, something went wrong. Uh oh! There was an error while loading. Please reload this page . Copy link Copy Markdown Member pwilkin commented May 17, 2026 Why does it affect prompt processing? Due to the embeddings copy, most likely. --> All reactions --> Sorry, something went wrong. Uh oh! There was an error while loading. Please reload this page . DrBearJew pushed a commit to DrBearJew/llama.cpp that referenced this pull request May 17, 2026 llama: avoid copying logits during prompt decode in MTP ( ggml-org#23198 ) … 899097b * llama: avoid copying logits during prompt decode in MTP * review: update comment * llama-graph: call set_output for t_h_pre_norm --> This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters Sign up for free to join this conversation on GitHub . Already have an account? Sign in to comment --> Reviewers CISC CISC approved these changes ggerganov ggerganov approved these changes --> Assignees No one assigned Labels examples model Model specific server --> Projects None yet --> Milestone No milestone --> Development Successfully merging this pull request may close these issues. Uh oh! There was an error while loading. Please reload this page . 8 participants Add this suggestion to a batch that can be applied as a single commit. This suggestion is invalid because no changes were made to the code. Suggestions cannot be applied while the pull request is closed. Suggestions cannot be applied while viewing a subset of changes. Only one suggestion per line can be applied in a batch. Add this suggestion to a batch that can be applied as a single commit. Applying suggestions on deleted lines is not supported. You must change the existing code in this line in order to create a valid suggestion. Outdated suggestions cannot be applied. This suggestion has been applied or marked resolved. Suggestions cannot be applied from pending reviews. Suggestions cannot be applied on multi-line comments. Suggestions cannot be applied while the pull request is queued to merge. Suggestion cannot be applied right now. Please check back later.

llama.cpp 성능 최적화 MTP 오픈소스 AI 추론

Testing llama.cpp MTP support on Qwen3.6 - RTX 5090

[요약 오류] Testing llama.cpp MTP support on Qwen3.6 - RTX 5090

r/LocalLLaMA • 74일 전

IMP 6

좋은 소식: llama.cpp에 MTP 승인

오픈소스 AI 추론 라이브러리인 llama.cpp에 Multi-Token Prediction(MTP) 기능이 드디어 승인되었습니다. 이 업데이트가 적용되면 AI 모델이 한 번에 여러 토큰을 예측하게 되어 텍스트 생성 속도와 추론 효율성이 대폭 향상될 것으로 기대됩니다. 실무자들은 곧 있을 업데이트 적용을 위해 환경 준비를 서두르는 추세입니다.

오픈소스 llama.cpp 추론 최적화