r/LocalLLaMA • 88일 전

PFlash: RTX 3090에서 128K 기준 llama.cpp 대비 프리필 10배 빠름

IMP

9/10

핵심 요약

오픈소스로 공개된 ‘PFlash’는 소형 드래프트 모델로 토큰 중요도를 평가해 중요한 구간만 처리하는 ‘추측적 프리필(Speculative Prefill)’ 기법을 적용했습니다. 그 결과, RTX 3090(24GB) 환경에서 128K 길이의 프롬프트 처리 시 기존 llama.cpp보다 첫 토큰 생성 시간(TTFT)을 약 10.4배 단축시켰습니다. C++/CUDA로만 작성되어 Python 기반 오버헤드 없이 24GB 메모리 내에서 추론 전체가 실행되는 것이 특징입니다.

번역된 본문

안녕하세요, Llama 커뮤니티 여러분. 지난 게시글에 남겨주신 따뜻한 말씀과 훌륭한 피드백 감사드립니다. 여러분과 공유하면 유용할 만한 새로운 내용이 생겨 소식을 전합니다. 여러분의 시간은 소중하니 간결하게 설명드리겠습니다.

우리는 양자화된 27B 파라미터 타겟 모델을 위한 긴 문맥(Long-context) 디코딩에 적용할 수 있는 '추측적 프리필(Speculative prefill)'을 순수 C++/CUDA로만 구현했습니다. 프로세스 내에 로드된 소형 드래프트 모델(Drafter)이 전체 프롬프트에 대해 토큰의 중요도를 평가하고, 무거운 타겟 모델은 중요한 구간(Span)에 대해서만 프리필(Prefill)을 수행합니다.

저장소: github.com/Luce-Org/lucebox-hhub (오픈소스, MIT 라이선스).

Qwen3.6-27B Q4_K_M 모델과 RTX 3090 환경에서 단일 샷(Single-shot) 기준 비교 결과는 다음과 같습니다: 기존 vanilla llama.cpp가 128K 길이에서 약 257초가 걸리던 것에 비해 24.8초의 TTFT(첫 토큰 생성 시간)를 기록했습니다. 이는 약 10.4배의 속도 향상입니다. (64K 길이에서도 134.95초에서 13.5초로 단축되어 10.0배 향상됨). 이때 NIAH(Needle In A Haystack) 검색 능력은 엔드투엔드(End-to-end) 기준으로 그대로 유지됩니다. 추론 루프 내에는 Python, Triton, PyTorch가 전혀 사용되지 않았습니다.

문제 상황

24GB VRAM을 가진 RTX 3090에서 Q4_K_M 양자화된 Qwen3.6-27B 모델은 디코딩이 매우 빠릅니다(DFlash 스펙 디코딩 사용 시 약 74 tok/s). 하지만 프리필(Prefill) 단계의 연산량은 O(S²)로 증가합니다. 131K 토큰 길이의 프롬프트를 처리할 때, 기존 vanilla llama.cpp는 콜드 상태에서 248.4초가 걸립니다 (llama-bench pp131072 --no-warmup -r 1 기준, 527.6 tok/s). 이는 첫 번째 토큰이 생성되기 전까지 빈 화면을 바라보며 4.1분을 기다려야 한다는 뜻입니다. 디코딩은 빠르지만 이 긴 대기 시간이 사용자 경험(UX)을 크게 저하시킵니다. 웜업(Warmup)이 된 안정적인 상태에서는 시간이 조금 줄어들긴 하지만(128K 기준 169.3초) 여전히 고통스러울 정도로 길며, 문맥 길이가 길어질수록 기하급수적으로 증가합니다.

거인의 어깨 위에서 (기존 연구 기반)

이 작업은 최근 발표된 두 편의 훌륭한 논문에 기반하고 있습니다:

추측적 프리필(Speculative Prefill) (Liu et al, arXiv 2502.02789) 및 교차 패밀리 추측적 프리필(Cross-Family Speculative Prefill) (SambaNova, ICLR 2026). 이 논문들의 핵심 통찰은 긴 프롬프트에 대해 소형 드래프트 모델의 어텐션 패턴(Attention pattern)이 답변 생성에 중요한 토큰이 무엇인지 정확하게 예측한다는 것입니다. 드래프트 모델을 실행해 토큰별 중요도를 평가하고, 상위 중요도를 가진 구간(Spans)만 남기고 나머지는 버리는 방식입니다.
FlashPrefill (Fan et al, 2026). 드래프트 모델 자체도 128K 길이에서 O(S²)의 연산 비용을 피할 수 있도록 해주는 블록 희소 어텐션(Block-sparse attention) 기술입니다.
FA2 파생 모델을 위한 sm_80+ 희소 포워드(Sparse forward)를 구현한 mit-han-lab/Block-Sparse-Attention (BSA).
런타임으로 사용된 ggml / llama.cpp. 우리는 libggml*.a만 링크하며 libllama는 사용하지 않았습니다.

우리의 기여는 이 두 가지 알고리즘을 24GB 용량의 소비자용 그래픽 카드 환경에서 C++/CUDA로 결합하여 단일 프로세스 내에서 구현한 것입니다. 우리가 아는 한, 이전까지 이 두 논문의 기법이 오픈소스 구현체로 결합된 적은 없었습니다.

우리가 구현한 것

프로세스 내 구성 (In-process composition). 드래프트 모델 포워드(커스텀 Qwen3-0.6B BF16 ggml 그래프), FlashPrefill 스코어링, 희소 어텐션(Sparse attention), 타겟 프리필, 그리고 DFlash 스펙 디코딩이 모두 하나의 ggml 할당자(Allocator)를 공유하는 단일 C++/CUDA 프로세스 내에서 실행됩니다. 서브프로세스나 IPC 없이, 추론 루프 내에 Python, Triton, PyTorch도 전혀 사용되지 않습니다.
FlashPrefill의 CUDA 포팅. 기존 참조 구현(qhfan/FlashPrefill)은 Triton 기반이었습니다. 우리는 4개의 CUDA 커널(mean_K, score, select, sparse_fwd)을 처음부터 새로 작성했으며, mit-han-lab/Block-Sparse-Attention을 통해 희소 포워드를 디스패치했습니다. BSA는 기본적으로 libtorch C++ 확장으로 제공되는데, 24GB 추론 루프에 2GB 크기의 libtorch를 끌어오는 것은 불가능하므로, dflash/deps/bsa_stubs/ 경로에 3개의 헤더로 구성된 ATen/c10 스텁(Stub) 세트를 만들어 연결했습니다.
24GB 메모리 오케스트레이션. 드래프트 모델(1.3GB 가중치 + KV 캐시 + 약 600MB 등...)을 포함하여 제한된 VRAM 환경에서 모든 것이 돌아가도록 메모리를 효율적으로 관리했습니다.

원문 보기

원문 보기 (영어)

Hey fellow Llamas, thank you for all the nice words and great feedback on the last post I made. We have something new we thought would be useful to share. As always your time is precious, so I'll keep it short. We built speculative prefill for long-context decode on quantized 27B targets, C++/CUDA only. A small drafter loaded in-process scores token importance over the full prompt; the heavy target only prefills the spans that matter. Repo: [github.com/Luce-Org/lucebox-hub](https://github.com/Luce-Org/lucebox-hub) (open source, MIT). Head-to-head on Qwen3.6-27B Q4\_K\_M, RTX 3090, single-shot: 24.8 s TTFT vs \~257 s for vanilla llama.cpp = \~10.4× at 128K (and 13.5 s vs 134.95 s = 10.0× at 64K), with NIAH retrieval preserved end-to-end. No Python, no Triton, no PyTorch in the inference loop. **The problem** Q4\_K\_M Qwen3.6-27B on a 24 GB 3090 decodes fast (\~74 tok/s with DFlash spec decode), but prefill scales O(S²). On a 131K-token prompt, vanilla llama.cpp takes 248.4 s cold (llama-bench pp131072 --no-warmup -r 1, 527.6 tok/s). That is 4.1 minutes staring at a blank screen before the first token. Decode is fast, but the wait kills the UX. Warmed steady-state is better (169.3 s at 128K) but still painful, and grows quadratically as you push context. **Standing on shoulders** This work stands on two recent papers, both excellent reads: * Speculative Prefill (Liu et al, [arXiv 2502.02789](https://arxiv.org/abs/2502.02789)) and Cross-Family Speculative Prefill (SambaNova, ICLR 2026). Insight: a small draft model's attention pattern over a long prompt faithfully predicts which tokens matter for the answer. Run the draft, score per-token importance, keep the top spans, drop the rest. * FlashPrefill (Fan et al, 2026). Block-sparse attention so the drafter itself does not pay O(S²) at 128K. * mit-han-lab/Block-Sparse-Attention (BSA) for the FA-2-derived sm\_80+ sparse forward. * ggml / llama.cpp for the runtime. We link libggml\*.a and never libllama. Our contribution is the C++/CUDA composition of these two algorithms, in-process, on a 24 GB consumer card. As far as we are aware, the two papers had not been combined in an open implementation before. **What we built** * In-process composition. Drafter forward (custom Qwen3-0.6B BF16 ggml graph), FlashPrefill scoring, sparse attention, target prefill, and DFlash spec decode all run in one C++/CUDA process sharing one ggml allocator. No subprocess, no IPC, no Python, Triton, or PyTorch in the inference loop. * CUDA port of FlashPrefill. The reference (qhfan/FlashPrefill) is Triton. We wrote 4 CUDA kernels from scratch (mean\_K, score, select, sparse\_fwd) and dispatched the sparse forward through mit-han-lab/Block-Sparse-Attention. BSA ships as a libtorch C++ extension; pulling 2 GB of libtorch into a 24 GB inference loop was a non-starter, so we wired it in via a 3-header ATen/c10 stub set under dflash/deps/bsa\_stubs/. * 24 GB memory orchestration. Drafter (1.3 GB weights + KV + \~600 MB

추론 속도 최적화 llama.cpp 오픈소스 LLM CUDA 추측적 프리필

윈도우 네이티브 vLLM으로 RTX 3090서 Qwen3.6-27B 초당 72토큰 달성

Windows 환경에서 WSL이나 Docker 없이 네이티브로 구동되는 오픈소스 vLLM 패치 및 포터블 런처가 공개되었습니다. RTX 3090 단일 GPU에서 Qwen3.6-27B(INT4 양자화) 모델을 최대 초당 72토큰(tok/s) 속도로 실행할 수 있으며, 복잡한 파이썬 환경 설정 없이 간편하게 설치할 수 있다는 것이 핵심입니다. 3090/4090/5090 등 엔비디아 최신 아키텍처 사용자가 로컬 환경에서 대규모 언어 모델을 쉽고 빠르게 테스트해 볼 수 있는 실용적인 도구입니다.

vLLM 로컬 LLM Windows 네이티브