r/LocalLLaMA • 108일 전

애플 실리콘 DFlash 추론: 초당 85토큰, 최대 3.3배 속도 향상

IMP

8/10

핵심 요약

애플 실리콘(M5 Max) 환경의 MLX 프레임워크에서 작동하는 DFlash 스페큘러 디코딩(Speculative Decoding)의 네이티브 구현체가 공개되었습니다. 작은 초안(Draft) 모델이 16개의 토큰을 병렬로 생성하고 타겟 모델이 이를 한 번의 순전파(Forward pass)로 검증하는 방식을 사용하여, 양자화되지 않은 9B 모델 기준 최대 3.3배, 양자화된 27B 모델 기준 최대 2.5배의 추론 속도 향상을 달성했습니다. 통합 메모리(Unified memory) 환경에서 커스텀 커널보다 기본 GEMM 연산이 더 효율적이며, 양자화된 모델에서는 오히려 bf16 초안 모델이 병목 현상을 일으키는 등 애플 실리콘 특유의 하드웨어 최적화 인사이트를 제공합니다.

번역된 본문

저는 애플 실리콘을 위해 DFlash(논문)의 네이티브 MLX 구현체를 개발하고 있습니다. 작은 초안(Draft) 모델이 블록 디퓨전(Block Diffusion)을 통해 16개의 토큰을 병렬로 생성하면, 타겟 모델이 이를 단 한 번의 순전파(Forward pass)로 검증합니다. 출력 결과는 기존 모델과 비트 단위로 완벽하게 동일합니다(탐욕적 정확 argmax 일치).

테스트 환경: M5 Max, 64GB, MLX, CUDA 미사용.

결과

Qwen3.5-9B bf16

생성 길이	DFlash	기존(Baseline)	속도 향상
1024 토큰	85 tok/s	26 tok/s	3.3배
2048 토큰	80 tok/s	26 tok/s	3.1배

Qwen3.5-4B bf16

생성 길이	DFlash	기존(Baseline)	속도 향상
1024 토큰	109 tok/s	41 tok/s	2.7배
2048 토큰	133 tok/s	42 tok/s	3.2배

4B 모델은 생성 길이가 길어질수록 오히려 더 빨라집니다. 모델 크기가 충분히 작아 컨텍스트가 증가해도 초안 생성/검증 비율이 균형을 유지하기 때문입니다.

Qwen3.5-27B 양자화(Quantized)

양자화	생성 길이	DFlash	기존(Baseline)	속도 향상
8비트	1024 토큰	35 tok/s	14 tok/s	2.5배
8비트	2048 토큰	26 tok/s	11 tok/s	2.3배
4비트	1024 토큰	44 tok/s	24 tok/s	1.9배
4비트	2048 토큰	40 tok/s	23 tok/s	1.7배

8비트가 4비트보다 속도 향상률이 더 높습니다. int4는 검증(Verify) 속도를 너무 빠르게 만들어서 bf16 초안 모델이 오히려 병목 현상을 일으킵니다. 반면 int8에서는 초안/검증 비율이 더 건강하게 유지됩니다.

모든 수치는 텍스트 생성 시간에 한정됩니다(첫 번째 토큰부터 마지막 토큰까지, 프리필(Prefill) 시간 제외). 전체 모델에서 초안 토큰의 수용률(Acceptance rate)은 약 80~87%입니다.

구현 상세 내용

기존에 MLX용 DFlash 구현체는 존재하지 않았습니다. 제가 런타임을 처음부터 직접 작성했습니다. 실제 성능 향상을 이끌어낸 핵심 요소는 다음과 같습니다:

head_dim=256 패치: Qwen3.5-9B는 MLX의 steel_attention이 지원하지 않는 head_dim=256을 사용합니다. 단 2줄의 패치를 통해 빠른 SDPA 경로를 활성화했습니다.

동기화 생략(Sync elision): 사이클당 2번 발생하던 GPU→CPU 동기화를 1번으로 줄였습니다. 초당 80토큰 이상 환경에서는 동기화 한 번에 약 0.5ms가 소요됩니다.

패킹된 QKV 프로젝션(Packed QKV projection): 3번의 행렬 곱셈(matmul)을 1번의 행렬 곱셈과 분할(split) 연산으로 변경했습니다. 레이어당 커널 디스패치 횟수를 줄였습니다.

애플 실리콘에서 얻은 교훈

통합 메모리 환경에서는 모든 것이 메모리 대역폭에 의해 제한(Bandwidth-bound)을 받으므로, 스페큘러 디코딩의 게임 룰이 바뀝니다:

커스텀 Metal 커널(배치-GEMV, 게이트 융합 SiLU, 커스텀 SDPA)은 모두 기본 MLX steel GEMM보다 0.5~0.8배 더 느린 것으로 나타났습니다. 결국 모두 원래 코드로 되돌렸습니다.

검증 비용은 4개의 토큰에서 16개의 토큰으로 늘어나도 거의 변하지 않습니다(57ms vs 59ms). 토큰 수보다 가중치 로딩(Weight loading)이 시간을 지배하기 때문입니다. 따라서 신뢰도가 낮을 때 검증 토큰 수를 줄이는 최적화 방식은 이곳에서 효과가 없습니다.

양자화된 모델에서는 최적화 환경이 완전히 뒤집힙니다. 초안(bf16) 모델이 검증(int4/int8) 모델보다 느려집니다. 이는 bf16 모델의 경우와 정반대이며, 대역폭 제한을 받는 하드웨어에서 양자화된 타겟 모델을 사용할 때 발생하는 스페큘러 디코딩의 구조적 한계입니다.

현재 진행 중인 작업

초안 모델 압축/증류(Draft compression/distillation): 양자화된 타겟 모델에서 발생하는 bf16 초안 모델의 병목 현상(27B 모델 기준)을 해결하기 위한 작업입니다.

긴 컨텍스트 안정성: KV 캐시가 증가함에 따라 2K 토큰 이상부터는 속도 향상률이 저하됩니다.

MoE(Mixture of Experts) 모델: Qwen3.5-35B-A3B(총 35B, 활성화 3B)를 위한 DFlash 초안 모델이 존재합니다. 작은 모델의 검증 비용으로 큰 모델의 품질을 얻을 수 있습니다.

모든 것은 아직 개발 진행 중이며, 준비가 완료되면 오픈소스로 공개할 예정입니다.

원문 보기

원문 보기 (영어)

I'm building a native MLX implementation of DFlash ([paper](https://arxiv.org/abs/2602.06036)) for Apple Silicon. A small draft model generates 16 tokens in parallel via block diffusion, the target verifies them in one forward pass. Output is bit-for-bit identical to baseline (greedy exact argmax match). **Setup:** M5 Max, 64GB, MLX, no CUDA. # Results **Qwen3.5-9B bf16** |Gen length|DFlash|Baseline|Speedup| |:-|:-|:-|:-| |1024 tokens|85 tok/s|26 tok/s|3.3x| |2048 tokens|80 tok/s|26 tok/s|3.1x| **Qwen3.5-4B bf16** |Gen length|DFlash|Baseline|Speedup| |:-|:-|:-|:-| |1024 tokens|109 tok/s|41 tok/s|2.7x| |2048 tokens|133 tok/s|42 tok/s|3.2x| The 4B actually gets *faster* at longer generation. The model is small enough that the draft/verify balance stays healthy as context grows. **Qwen3.5-27B quantized** |Quant|Gen length|DFlash|Baseline|Speedup| |:-|:-|:-|:-|:-| |8bit|1024 tokens|35 tok/s|14 tok/s|2.5x| |8bit|2048 tokens|26 tok/s|11 tok/s|2.3x| |4bit|1024 tokens|44 tok/s|24 tok/s|1.9x| |4bit|2048 tokens|40 tok/s|23 tok/s|1.7x| **8bit gives better speedup ratios than 4bit.** int4 makes the verify so fast that the bf16 draft becomes the bottleneck. With int8, the draft/verify balance is healthier. All numbers are generation only (first token to last token, no prefill). Acceptance around 80-87% across all models. # What I built No DFlash MLX implementation existed. I wrote the runtime from scratch. What actually moved the numbers: **head\_dim=256 patch.** Qwen3.5-9B uses head\_dim=256, which MLX's steel\_attention didn't support. A 2-line patch unlocked the fast SDPA path. **Sync elision.** Restructured the pipeline from 2 GPU→CPU syncs per cycle to 1. At 80+ tok/s each sync costs \~0.5ms. **Packed QKV projection.** 3 matmuls → 1 matmul + split. Fewer kernel dispatches per layer. # Lessons on Apple Silicon On unified memory everything is bandwidth-bound, which changes the speculative decoding game: Custom Metal kernels (batched-GEMV, fused gated SiLU, custom SDPA) all came back 0.5 to 0.8x *slower* than stock MLX steel GEMM. Ended up reverting all of them. Verify cost is almost flat from 4 to 16 tokens (57ms vs 59ms). Weight loading dominates, not token count. "Verify fewer tokens when confidence is low" doesn't help here. On quantized models, the optimization landscape flips: the draft (bf16) becomes slower than the verify (int4/int8). This is the opposite of the bf16 case and is a structural limitation of speculative decoding on bandwidth-bound hardware with quantized targets. # Currently working on **Draft compression/distillation** for the 27B to fix the bf16 draft bottleneck on quantized targets. **Long context stability.** Speedup degrades past 2K tokens due to KV cache growth. **MoE models.** DFlash drafts exist for Qwen3.5-35B-A3B (35B total, 3B active). Verify cost of a small model, quality of a large one. Everything is still very much under construction. Will open source when ready.

온디바이스 AI 애플 실리콘 스페큘러 디코딩 MLX 프레임워크 추론 최적화