r/LocalLLaMA • 114일 전

라즈베리파이 5 환경에서 Gemma 4 등 다수 모델 벤치마크

IMP

6/10

핵심 요약

라즈베리파이 5(16GB RAM)에 공식 M.2 HAT+와 NVMe SSD를 장착하고 PCIe Gen3로 설정하여 스토리지 읽기 속도를 대폭 끌어올렸습니다. 이를 통해 RAM 용량을 초과하는 대형 언어 모델 구동 시 텍스트 생성 속도가 1.5~2배 향상되었으며, Gemma 4 등 다양한 AI 모델의 실제 추론 성능을 테스트한 결과를 공유했습니다.

번역된 본문

안녕하세요,

이번 포스트는 업데이트된 내용입니다! 며칠 전 저는 더 큰 모델들을 구동하기 위해 라즈베리파이 5에서 SSD를 사용했을 때의 성능에 대해 공유한 바 있습니다. 여러 사용자들이 제가 사용하던 USB3 연결보다 PCIe가 더 빠르다는 점을 지적해 주었고, 그 말이 맞기에 공식 M.2 HAT를 구입했습니다.

스포일러: 예상대로 읽기 속도가 두 배로 증가했으며, 이로 인해 스왑(Swap) 공간을 사용하는 모델들의 추론 및 텍스트 생성 속도(t/sec)가 1.5배에서 2배까지 향상되었습니다.

제 테스트 환경을 간단히 요약해 드리겠습니다:

라즈베리파이 5 16GB RAM
공식 액티브 쿨러(Active Cooler)
공식 M.2 HAT+ 스탠다드
HAT에 연결된 1TB SSD
기본 라즈베리파이 OS 라이트(Trixie) 구동

제가 집중한 질문은 다음과 같습니다: 약간의 설정만으로 표준 부품을 구매했을 때 어느 정도의 성능을 기대할 수 있을까? 저는 서드파티에서 더 큰 쿨러를 사거나 오버클럭, 오버볼팅을 하거나 오렌지파이(Orange Pi) 같은 틈새 기기를 살 수 있다는 것을 압니다. 하지만 그것은 제가 원했던 방향이 아니었습니다. 그래서 표준 파이를 사용하고 설정을 최소화하여 대부분의 사람들이 쉽게 따라 할 수 있도록 했습니다.

기본적으로 파이는 Gen2 표준으로 PCIe 인터페이스를 사용합니다(따라서 HAT를 사용할 때 SSD에서 약 418MB/sec의 읽기 속도만 얻었습니다). 저는 "/boot/firmware/config.txt" 파일에 dtparam=pciex1_gen=3을 추가하고 재부팅하여 Gen3를 사용하도록 설정했습니다.

SSD의 읽기 속도는 360.18MB/sec(USB)에서 2.2배 증가하여, 다른 사람들이 HAT로 달성한 최대치 수준에 도달했습니다.

$ sudo hdparm -t --direct /dev/nvme0n1p2
/dev/nvme0n1p2:
 Timing O_DIRECT disk reads: 2398 MB in  3.00 seconds = 798.72 MB/sec

제 SSD는 절반은 스왑 공간으로, 나머지 절반은 모델을 저장하는 파티션으로 나뉘어 있습니다(물론 다른 곳에 저장해도 됩니다). RAM에 맞는 모델들은 당연히 스왑을 사용할 필요가 없습니다.

저는 다음 명령어로 모든 모델을 벤치마크하여 컨텍스트가 없을 때와 32k 컨텍스트(거의 모든 모델)에서 프롬프트 처리(pp512) 및 텍스트 생성(tg128)을 테스트했습니다:

$ llama.cpp/build/bin/llama-bench -r 2 --mmap 0 -d 0,32768 -m &lt;all-models-as-GGUF&gt; --progress | tee bench.txt

다음은 알파벳순으로 정렬된 필터링된 결과입니다(예를 들어, GLM4.7-Flash는 기본 deepseek2 아키텍처로 언급되었으므로 이름이 조정되었습니다):

모델	크기	pp512	pp512 @ d32768	tg128	tg128 @ d32768
Bonsai 8B Q1_0	1.07 GiB	3.27	-	2.77	-
gemma3 12B-it Q8_0	11.64 GiB	12.88	3.34	1.00	0.66
gemma4 E2B-it Q8_0	4.69 GiB	41.76	12.64	4.52	2.50
gemma4 E4B-it Q8_0	7.62 GiB	22.16	9.44	2.28	1.53
gemma4 26B-A4B-it Q8_0	25.00 GiB	9.22	5.03	2.45	1.44
GLM-4.7-Flash 30B.A3B Q8_0	29.65 GiB	6.59	0.90	1.64	0.11
gpt-oss 20B IQ4_XS	11.39 GiB	9.13	2.71	4.77	1.36
gpt-oss 20B Q8_0	20.72 GiB	4.80	2.19	2.70	1.13
gpt-oss 120B Q8_0	59.02 GiB	5.11	1.77	1.95	0.79
kimi-linear 48B.A3B IQ1_M	10.17 GiB	8.67	2.78	4.24	0.58
mistral3 14B Q4_K_M	7.67 GiB	5.83	1.27	1.49	0.42
Qwen3-Coder 30B.A3B Q8_0	30.25 GiB	10.79	1.42	2.28	0.47
Qwen3.5 0.8B Q8_0	763.78 MiB	127.70	28.43	11.51	5.52
Qwen3.5 2B Q8_0	1.86 GiB	75.92	24.50	5.57	3.62
Qwen3.5 4B Q8_0	4.16 GiB	31.02	9.44	2.42	1

원문 보기

원문 보기 (영어)

Hey all, this is an update! A few days ago I posted to show the performance of a Raspberry Pi5 when using a SSD to let larger models run. Rightfully so, a few brought to my attention that the PCIe is faster than the USB3 connection I was using, so I bought the official HAT. **Spoiler: As expected: Read speed doubled, leading to 1.5x to 2x improvement on tokens/sec for inference and text generation on models in swap.** I'll repeat my setup shortly: * Raspberry Pi5 with 16GB RAM * Official Active Cooler * Official M.2 HAT+ Standard * 1TB SSD connected via HAT * Running stock Raspberry Pi OS lite (Trixie) My focus is on the question: `What performance can I expect when buying a few standard components with only a little bit of tinkering?` I know I can buy larger fans/coolers from third-party sellers, overclock and overvolt, buy more niche devices like an Orange Pi, but thats not what I wanted, so I went with a standard Pi and kept tinkering to a minimum, so that most can still do the same. By default the Pi uses the PCIe interface with the Gen2 standard (so I only got \~418MB/sec read speed from the SSD when using the HAT). I appended `dtparam=pciex1_gen=3` to the file "/boot/firmware/config.txt" and rebooted to use Gen3. Read speed of the SSD increased from 360.18MB/sec (USB) by a factor of **2.2x** to what seems to be the maximum others achieved too with the HAT. $ sudo hdparm -t --direct /dev/nvme0n1p2 /dev/nvme0n1p2: Timing O_DIRECT disk reads: 2398 MB in 3.00 seconds = 798.72 MB/sec My SSD is partitioned to be half swapspace, half partition where I store my models (but that could be also anywhere else). Models that fit in RAM don't need the swap of course. I benchmarked all models with this command, testing prompt processing (pp512) and text generation (tg128) at zero and (almost all) at 32k context: $ llama.cpp/build/bin/llama-bench -r 2 --mmap 0 -d 0,32768 -m <all-models-as-GGUF> --progress | tee bench.txt Here are the filtered results in alphabetical order (names adjusted as GLM4.7-Flash was mentioned as the underlying deepseek2 architecture for example): |model|size|pp512|pp512 @ d32768|tg128|tg128 @ d32768| |:-|:-|:-|:-|:-|:-| |Bonsai 8B Q1\_0|1.07 GiB|3.27|\-|2.77|\-| |gemma3 12B-it Q8\_0|11.64 GiB|12.88|3.34|1.00|0.66| |gemma4 E2B-it Q8\_0|4.69 GiB|41.76|12.64|4.52|2.50| |gemma4 E4B-it Q8\_0|7.62 GiB|22.16|9.44|2.28|1.53| |gemma4 26B-A4B-it Q8\_0|25.00 GiB|9.22|5.03|2.45|1.44| |GLM-4.7-Flash 30B.A3B Q8\_0|29.65 GiB|6.59|0.90|1.64|0.11| |gpt-oss 20B IQ4\_XS|11.39 GiB|9.13|2.71|4.77|1.36| |gpt-oss 20B Q8\_0|20.72 GiB|4.80|2.19|2.70|1.13| |gpt-oss 120B Q8\_0|59.02 GiB|5.11|1.77|1.95|0.79| |kimi-linear 48B.A3B IQ1\_M|10.17 GiB|8.67|2.78|4.24|0.58| |mistral3 14B Q4\_K\_M|7.67 GiB|5.83|1.27|1.49|0.42| |Qwen3-Coder 30B.A3B Q8\_0|30.25 GiB|10.79|1.42|2.28|0.47| |Qwen3.5 0.8B Q8\_0|763.78 MiB|127.70|28.43|11.51|5.52| |Qwen3.5 2B Q8\_0|1.86 GiB|75.92|24.50|5.57|3.62| |Qwen3.5 4B Q8\_0|4.16 GiB|31.02|9.44|2.42|1

라즈베리파이 로컬 AI 오픈소스 모델 벤치마크 엣지 컴퓨팅

브라우저 내장형 AI 'Gemma Gem' 오픈소스 공개

해커뉴스에 구글의 'Gemma 4' 모델을 브라우저 내에서 직접 구동하는 크롬 확장 프로그램 'Gemma Gem'이 공개되었습니다. WebGPU를 활용해 별도의 API 키나 클라우드 없이 기기 내에서 AI가 작동하며, 사용자의 데이터를 외부로 전송하지 않아 프라이버시가 강력하게 보호됩니다. 특히 웹페이지 내용 읽기, 버튼 클릭, 폼 작성, 자바스크립트 실행 등 브라우저 상에서의 에이전트(Agent) 작업 수행이 가능하다는 점이 가장 큰 특징입니다.

온디바이스 AI 웹 브라우저 크롬 확장프로그램

r/LocalLLaMA • 114일 전

IMP 7

M3 Pro에서 구동되는 Gemma E2B 실시간 AI

오픈소스 모델인 Gemma를 활용해 오디오와 비디오를 입력받아 음성으로 출력하는 실시간 AI가 Apple M3 Pro 환경에서 로컬 구동되는 것을 확인한 사례입니다. 복잡한 에이전트 코딩은 불가능하지만 다국어 처리가 가능하여 언어 학습용으로 혁신적인 활용성을 보여줍니다. 수년 전 OpenAI가 시연했던 것과 같이 스마트폰 카메라로 사물을 인식하고 모국어로 소통하는 미래가 로컬 환경에서도 가까워지고 있습니다.

오픈소스 로컬 AI 음성 인식