r/LocalLLaMA • 79일 전

ExLlamaV3 대규모 업데이트: DFlash 지원 및 속도 대폭 향상!

IMP

8/10

핵심 요약

로컬 AI 추론 라이브러리인 ExLlamaV3가 대대적인 업데이트를 진행했습니다. 새로운 'DFlash' 기능을 지원하여 에이전트 및 코딩 작업에서 기존 대비 최대 3배 빠른 텍스트 생성 속도를 달성했습니다. 또한 Gemma 4 모델 지원을 추가하고, 주요 오픈소스 모델들에 대한 최적화를 통해 다양한 GPU 환경에서의 실행 효율성을 크게 높였습니다.

번역된 본문

Turboderp는 새로운 LLM(라마 모델)을 더 작고 빠른 환경에 구겨 넣기 위한 끝없는 전투에서 최근 맹렬한 속도로 개발을 진행하고 있습니다.

우리는 지난달 Gemma 4 지원 릴리스로 시작했으며, 캐싱 효율성 개선으로 이어졌습니다.

2주 전에는 DFlash 지원이 추가되었으며, 다음과 같은 인상적인 결과를 보여주었습니다:

카테고리	기준선	N-gram/suffix	DFlash
에이전트, 코드	55.98 t/s	89.58 t/s (1.60배)	140.61 t/s (2.51배)
에이전트, curl	54.03 t/s	74.62 t/s (1.38배)	125.94 t/s (2.33배)
코딩	59.21 t/s	75.34 t/s (1.27배)	177.67 t/s (3.00배)
크리에이티브	59.10 t/s	67.26 t/s (1.13배)	89.19 t/s (1.50배)
크리에이티브 (추론)	59.03 t/s	64.25 t/s (1.09배)	93.54 t/s (1.58배)
번역	58.11 t/s	55.39 t/s (0.95배)	75.73 t/s (1.30배)
번역 (추론)	58.08 t/s	80.21 t/s (1.38배)	119.43 t/s (2.06배)

지난주에는 추가적인 모델 최적화가 이루어졌으며, 다음과 같은 성능 향상을 달성했습니다:

모델	3090¹	4090¹	5090¹	6000 Pro¹	5090²	6000 Pro²
Qwen3.5-35B-A3B 4.00bpw	5.3%	5.8%	8.6%	10.3%	21.0%	23.5%
Qwen3.5-27B 4.00bpw	0.0%	1.9%	8.1%	11.7%	13.1%	15.0%
Trinity-Nano 4.15bpw	29.5%	48.6%	52.3%	52.9%	70.5%	72.4%
Gemma4-26B-A4B 4.10bpw	3.1%	2.9%	7.8%	9.6%	16.4%	19.2%
Gemma4-31B 4.00bpw	4.0%	4.9%	10.0%	8.0%	16.0%	12.0%

지난 이틀 동안 DFlash 모델 양자화(Quantization) 지원과 더 많은 버그 수정 및 효율성 개선이 이루어졌으며, 현재 개발(dev) 브랜치에서도 이미 추가 작업이 한창 진행 중입니다!

exllama 디스코드에서 들러서 인사해 주세요.

원문 보기

원문 보기 (영어)

Turboderp has a been on [an absolute tear](https://github.com/turboderp-org/exllamav3/commits/dev) recently, in the endless battle to cram new llamas into smaller, faster boxes. We started off last month with the release of [gemma 4 support](https://github.com/turboderp-org/exllamav3/releases/tag/v0.0.29), and continued with [improved caching efficiency](https://github.com/turboderp-org/exllamav3/releases/tag/v0.0.30). [DFlash support](https://github.com/turboderp-org/exllamav3/releases/tag/v0.0.31) came 2 weeks ago with these impressive results: |Category|Baseline|N-gram/suffix|DFlash| |:-|:-|:-|:-| |Agentic, code|55.98 t/s|89.58 t/s (1.60x)|140.61 t/s (2.51x)| |Agentic, curl|54.03 t/s|74.62 t/s (1.38x)|125.94 t/s (2.33x)| |Coding|59.21 t/s|75.34 t/s (1.27x)|177.67 t/s (3.00x)| |Creative|59.10 t/s|67.26 t/s (1.13x)|89.19 t/s (1.50x)| |Creative (reasoning)|59.03 t/s|64.25 t/s (1.09x)|93.54 t/s (1.58x)| |Translation|58.11 t/s|55.39 t/s (0.95x)|75.73 t/s (1.30x)| |Translation (reasoning)|58.08 t/s|80.21 t/s (1.38x)|119.43 t/s (2.06x)| [More model optimization](https://github.com/turboderp-org/exllamav3/releases/tag/v0.0.32) last week, with these improvements: |Model|3090¹|4090¹|5090¹|6000 Pro¹|5090²|6000 Pro²| |:-|:-|:-|:-|:-|:-|:-| |Qwen3.5-35B-A3B 4.00bpw|5.3%|5.8%|8.6%|10.3%|21.0%|23.5%| |Qwen3.5-27B 4.00bpw|0.0%|1.9%|8.1%|11.7%|13.1%|15.0%| |Trinity-Nano 4.15bpw|29.5%|48.6%|52.3%|52.9%|70.5%|72.4%| |Gemma4-26B-A4B 4.10bpw|3.1%|2.9%|7.8%|9.6%|16.4%|19.2%| |Gemma4-31B 4.00bpw|4.0%|4.9%|10.0%|8.0%|16.0%|12.0%| [DFlash model quantization](https://github.com/turboderp-org/exllamav3/releases/tag/v0.0.33) and more bugfixes + efficiency in the last 2 days, and more work on the dev branch already! Come say hi at the [exllama discord](https://discord.gg/AD2mVhZzf).

오픈소스 로컬 LLM 추론 최적화 ExLlamaV3