r/LocalLLaMA • 82일 전

z-lab, 최대 3.7배 빠른 추론 모델 DFlash 공개

IMP

8/10

핵심 요약

z-lab이 구글의 Gemma 모델과 결합하여 최대 3.7배의 추론 속도 향상을 제공하는 스펙큘레이티브 디코딩(Speculative Decoding) 초안 모델 'gemma-4-26B-A4B-it-DFlash'를 공개했습니다. 이 모델은 가벼운 블록 디퓨전(Block Diffusion) 모델을 활용해 여러 토큰을 병렬로 동시 생성하여 기존 자기회귀(Autoregressive) 방식 대비 처리량(Throughput)을 획기적으로 높였습니다. vLLM 및 SGLang 환경에서 즉시 사용할 수 있으며, 엔지니어링 및 컴퓨팅 리소스 지원을 통해 개발 및 훈련되었습니다.

번역된 본문

z-lab에서 gemma-4-26B-A4B-it-DFlash 모델을 공개했습니다. 이미 사용해 보신 분이 있나요?

논문 | GitHub | 블로그

DFlash(Diffusion Flash)는 가벼운 블록 디퓨전 모델을 사용하여 여러 토큰을 병렬로 초안 생성(drafting)하는 스펙큘레이티브 디코딩(Speculative Decoding) 방식입니다. 이 모델은 초안 생성기(drafter model)로, google/gemma-4-26B-A4B-it 모델과 반드시 쌍으로 사용되어야 합니다.

빠른 시작 설치 vLLM: Gemma4 DFlash 지원이 메인 브랜치에 병합될 때까지 PR #41703에서 설치하세요: uv pip install -U --torch-backend=auto "vllm @ git+https://github.com/vllm-project/vllm.git@refs/pull/41703/head" SGLang: uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/23000/head#subdirectory=python"

서버 실행 vLLM: vllm serve google/gemma-4-26B-A4B-it
--speculative-config '{"method": "dflash", "model": "z-lab/gemma-4-26B-A4B-it-DFlash", "num_speculative_tokens": 15, "attention_backend": "flash_attn"}'
--attention-backend triton_attn
--max-num-batched-tokens 32768
--trust-remote-code

SGLang:

선택 사항: 스케줄 오버랩 활성화 (실험적 기능, 불안정할 수 있음)

export SGLANG_ENABLE_SPEC_V2=1

export SGLANG_ENABLE_DFLASH_SPEC_V2=1

export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1

python -m sglang.launch_server
--model-path google/gemma-4-26B-A4B-it
--speculative-algorithm DFLASH
--speculative-draft-model-path z-lab/gemma-4-26B-A4B-it-DFlash
--speculative-num-draft-tokens 16
--tp-size 1
--attention-backend triton
--speculative-draft-attention-backend fa4
--trust-remote-code

사용법 vLLM의 경우 포트 8000을 사용하세요. SGLang의 경우 포트 30000을 사용하세요. from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY") response = client.chat.completions.create( model="google/gemma-4-26B-A4B-it", messages=[{"role": "user", "content": "Write a quicksort in Python."}], max_tokens=4096, temperature=0.0, extra_body={"chat_template_kwargs": {"enable_thinking": True}}, ) print(response.choices[0].message.content)

벤치마크 결과 설정: 서버/실행당 단일 NVIDIA B300 GPU, vLLM, 생각(thinking) 모드 활성화, 최대 출력 길이 4096, 그리디 디코딩.

처리량 및 속도 향상 DFlash는 동시성(Concurrency) 8에서 최대 3.7배의 속도 향상을 달성했습니다. 초당 생성된 토큰 수 (자기회귀 기준선 대비 속도 향상) 블록 크기 = 16

태스크 | 동시성 | AR (기준선) | DFlash (속도 향상) Math500 | 1 | 259 | 925 (3.6배) Math500 | 8 | 1296 | 4837 (3.7배) Math500 | 32 | 3233 | 11435 (3.5배)

GSM8K | 1 | 256 | 825 (3.2배) GSM8K | 8 | 1217 | 4241 (3.5배) GSM8K | 32 | 3174 | 10306 (3.2배)

HumanEval | 1 | 246 | 818 (3.3배) HumanEval | 8 | 1182 | 4240 (3.6배) HumanEval | 32 | 2881 | 9150 (3.2배)

MBPP | 1 | 272 | 698 (2.6배) MBPP | 8 | 1288 | 3387 (2.6배) MBPP | 32 | 2950 | 7898 (2.7배)

MT-Bench | 1 | 272 | 492 (1.8배) MT-Bench | 8 | 1146 | 2259 (2.0배) MT-Bench | 32 | 2164 | 4829 (2.2배)

수용 길이 (Acceptance Length) 태스크 | c1 | c8 | c32 Math500 | 8.61 | 8.55 | 8.60 GSM8K | 7.71 | 7.76 | 7.72 HumanEval | 7.80 | 7.87 | 7.83 MBPP | 6.09 | 5.99 | 6.03 MT-Bench | 4.33 | 4.33 | 4.24

감사 인사 이 프로젝트에 뛰어난 엔지니어링 지원을 제공한 David Wang에게 특별히 감사드립니다. 또한 이 초안 모델 학습에 필요한 컴퓨팅 리소스를 제공해 주신 Modal, InnoMatrix, Yotta Labs에도 깊은 감사를 드립니다.

인용 DFlash가 유용하다고 생각하신다면 저희의 연구를 인용해 주시기 바랍니다. DFlash에 대한 피드백을 공유하거나 새로운 모델 지원을 요청하려면 이 양식(DFlash 피드백)을 작성해 주세요. @article{chen2026dflash, title = {{DFlash: Block Diffusion for Flash Speculative Decoding}}, author = {Chen, Jian and Liang, Yesheng and Liu, Zhijian}, journal = {arXiv preprint arXiv:2602.06036}, year = {2026} }

다운로드: 지난달 1,908건 Safetensors 모델 크기: 4천만(0.4B) 파라미터 텐서 유형: BF16

추론 제공자 새로운 기능: 텍스트 생성 이 모델은 아직 어떤 추론 제공자(Provider)에 의해서도 배포되지 않았습니다. 🙋 제공자 지원을 요청해 주세요. z-lab/gemma-4-26B-A4B-it-DFlash를 포함하는 컬렉션: 플래시 스펙큘레이티브 디코딩을 위한 블록 디퓨전 • 19개 항목 • 3일 전 업데이트됨 • 102 z-lab/gemma-4-26B-A4B-it-DFlash에 대한 논문: 논문 • 2602.06036 • 2월 5일 게재 • 76

원문 보기

원문 보기 (영어)

gemma-4-26B-A4B-it-DFlash Paper | GitHub | Blog DFlash is a speculative decoding method that uses a lightweight block diffusion model to draft multiple tokens in parallel. This is the drafter model, which must be paired with google/gemma-4-26B-A4B-it . Quick Start Installation vLLM: until Gemma4 DFlash support is merged, install vLLM from PR #41703 : uv pip install -U --torch-backend=auto \ "vllm @ git+https://github.com/vllm-project/vllm.git@refs/pull/41703/head" SGLang: uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/23000/head#subdirectory=python" Launch Server vLLM: vllm serve google/gemma-4-26B-A4B-it \ --speculative-config '{"method": "dflash", "model": "z-lab/gemma-4-26B-A4B-it-DFlash", "num_speculative_tokens": 15, "attention_backend": "flash_attn"}' \ --attention-backend triton_attn \ --max-num-batched-tokens 32768 \ --trust-remote-code SGLang: # Optional: enable schedule overlapping (experimental, may not be stable) # export SGLANG_ENABLE_SPEC_V2=1 # export SGLANG_ENABLE_DFLASH_SPEC_V2=1 # export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 python -m sglang.launch_server \ --model-path google/gemma-4-26B-A4B-it \ --speculative-algorithm DFLASH \ --speculative-draft-model-path z-lab/gemma-4-26B-A4B-it-DFlash \ --speculative-num-draft-tokens 16 \ --tp-size 1 \ --attention-backend triton \ --speculative-draft-attention-backend fa4 \ --trust-remote-code Usage For vLLM, use port 8000 . For SGLang, use port 30000 . from openai import OpenAI client = OpenAI(base_url= "http://localhost:8000/v1" , api_key= "EMPTY" ) response = client.chat.completions.create( model= "google/gemma-4-26B-A4B-it" , messages=[{ "role" : "user" , "content" : "Write a quicksort in Python." }], max_tokens= 4096 , temperature= 0.0 , extra_body={ "chat_template_kwargs" : { "enable_thinking" : True }}, ) print (response.choices[ 0 ].message.content) Benchmark Results Setup: Single NVIDIA B300 GPU per server/run, vLLM, thinking enabled, max output length 4096, greedy decoding. Throughput and Speedup DFlash achieves up to 3.7x speedup at concurrency 8. Generated tokens/sec (speedup vs. autoregressive baseline) Block Size = 16 Task Concurrency AR DFlash Math500 1 259 925 (3.6x) 8 1296 4837 (3.7x) 32 3233 11435 (3.5x) GSM8K 1 256 825 (3.2x) 8 1217 4241 (3.5x) 32 3174 10306 (3.2x) HumanEval 1 246 818 (3.3x) 8 1182 4240 (3.6x) 32 2881 9150 (3.2x) MBPP 1 272 698 (2.6x) 8 1288 3387 (2.6x) 32 2950 7898 (2.7x) MT-Bench 1 272 492 (1.8x) 8 1146 2259 (2.0x) 32 2164 4829 (2.2x) Acceptance Length Task c1 c8 c32 Math500 8.61 8.55 8.60 GSM8K 7.71 7.76 7.72 HumanEval 7.80 7.87 7.83 MBPP 6.09 5.99 6.03 MT-Bench 4.33 4.33 4.24 Acknowledgements Special thanks to David Wang for his outstanding engineering support on this project. We are also grateful to Modal , InnoMatrix , and Yotta Labs for providing the compute resources used to train this draft model. Citation If you find DFlash useful, please cite our work. To share feedback on DFlash or request new model support, please fill out this form: DFlash Feedback . @article{chen2026dflash, title = {{DFlash: Block Diffusion for Flash Speculative Decoding}}, author = {Chen, Jian and Liang, Yesheng and Liu, Zhijian}, journal = {arXiv preprint arXiv:2602.06036}, year = {2026} } Downloads last month 1,908 Safetensors Model size 0.4B params Tensor type BF16 · Files info Inference Providers NEW Text Generation This model isn't deployed by any Inference Provider. 🙋 Ask for provider support Collection including z-lab/gemma-4-26B-A4B-it-DFlash Block Diffusion for Flash Speculative Decoding • 19 items • Updated 3 days ago • 102 Paper for z-lab/gemma-4-26B-A4B-it-DFlash Paper • 2602.06036 • Published Feb 5 • 76

추론 속도 최적화 스펙큘레이티브 디코딩 오픈소스 AI 모델 vLLM 벤치마크