r/LocalLLaMA • 78일 전

인텔 옵테인 메모리 활용, 1조 파라미터 모델 로컬 구동 성공

IMP

6/10

핵심 요약

한 로컬 AI 개발자가 단종된 인텔 옵테인 지속형 메모리(PMem)를 활용해 768GB의 대용량 메모리 시스템을 저렴하게 구축했습니다. 이를 통해 1조 파라미터급 거대 언어 모델인 Kimi K2.5를 로컬 환경에서 초당 약 4 토큰 속도로 실행하는 데 성공했습니다. 이는 제한된 하드웨어 예산으로 최고 수준의 AI 모델을 구동할 수 있는 효율적인 로컬 인퍼런스 빌드의 사례로 주목받습니다.

번역된 본문

제목에서 알 수 있듯이, 저의 PC 빌드는 실제로 1조 파라미터 모델(이 경우 Kimi K2.5)을 로컬에서 초당 약 4토큰(~4 tokens/second)의 속도로 실행할 수 있습니다. 저는 이 놀라운 스펙과, 지금까지 누구도 LLM 인퍼런스 빌드에 사용하는 것을 보지 못했던 특이한 부품인 '인텔 옵테인 지속형 메모리(Intel Optane Persistent Memory)'가 포함되어 있다는 점 때문에 r/LocalLLaMA 커뮤니티의 분들이 흥미를 느낄 것이라고 생각했습니다. 옵테인 PMem은 DIMM 폼팩터를 가진 메모리 유닛으로, DRAM과 SSD의 중간 정도 방식으로 기능합니다. 인텔은 이 제품군을 단종시켰지만, 저는 중고 시장에서 동일한 용량의 DRAM보다 훨씬 저렴한 가격에 모듈을 구할 수 있었습니다. 바로 이 대용량 PMem(768GB) 덕분에 제 시스템에서 이렇게 거대한 모델을 호스팅할 수 있습니다. 저의 빌드에서는 PMem을 '메모리 모드(Memory Mode)'로 사용했으며, 이 모드에서는 PMem이 컴퓨터의 RAM으로 인식되고 컴퓨터의 DRAM 메모리가 캐시 역할을 하게 됩니다.

Kimi K2.5의 혼합 전문가(Mixture-of-Experts, MoE) 아키텍처는 제 빌드를 테스트하기에 이상적인 모델입니다. 제가 이 결과를 얻기 위해 llama.cpp를 활용한 GPU/CPU 하이브리드 인퍼런스를 사용했습니다. llama.cpp의 'override-tensor' 플래그를 사용하면 Kimi K2.5(Unsloth Q2_K_XL 양자화 버전)의 어텐션 가중치, 밀집 레이어(dense layer), 각 MoE 레이어의 공유 전문가(shared expert) 및 라우팅 구성 요소를 실제로 12GB VRAM의 GPU에 탑재할 수 있습니다. 또한 llama.cpp의 'ngl auto' 및 'cmoe' 플래그를 사용하여 텐서 배치를 알아서 결정하도록 맡겼을 때도 꽤 좋은 결과를 얻었습니다. 어느 쪽이든, 희소 전문가(sparse experts)의 가중치(모델 크기의 대부분을 차지)는 일반적으로 PMem/DRAM에 상주하며, 필요에 따라 그곳에서 처리됩니다.

이 설정으로 테스트한 최종 결과는 텍스트 생성 시 초당 약 4토큰입니다! 제한된 하드웨어 예산으로 1조 파라미터급 최첨단(frontier-class) 모델을 실행하고 있다는 사실을 감안할 때, 이는 큰 성공이라고 생각합니다. 인텔이 옵테인 지속형 메모리를 단종시킨 것은 아쉬운 일입니다. 왜냐하면 최신 하드웨어 플랫폼에서 이런 종류의 특정한 메모리 계층이 있었다면, 현재 진행되고 있는 SSD 오프로딩(offloading)이나 더 광범위한 메모리 계층화(memory tiering) 접근 방식 등 일부 로컬 인퍼런스 혁신의 방향성이 정말 흥미로웠을 것이기 때문입니다. 전반적으로 저는 이 옵테인 PMem 중심의 빌드에 만족합니다. 이를 통해 놀랍도록 허용 가능한 속도로 매우 큰 모델을 실행할 수 있었고, 그 과정 자체도 매우 교육적이었습니다.

주요 부품 목록:

Intel Xeon Gold 6246 CPU
TYAN S5630GMRE-CGN 마더보드
ASUS Dual GeForce RTX 3060 OC 12GB GPU
6x 32GB Samsung 2666MHz DDR4 ECC DRAM 메모리
6x 128GB Intel Optane DCPMM PC4-2666 NMA1XBD128GQS 지속형 메모리 모듈
Western Digital WD SN850X 2TB M.2 2280 NVMe SSD
ASRock Steel Legend SL-850G 850W 80 PLUS GOLD & Cybenetics PLATINUM 풀 모듈러 파워서플라이
Silverstone SST-GD08B (블랙) Grandia Series 홈시어터 PC 케이스

이 개요를 즐겁게 읽으셨기를 바랍니다. 여기에 포함되지 않은 더 많은 세부 사항이 있으므로, 댓글에서 빌드, 설정 또는 부품 선택 배경에 대해 질문해 주시면 기꺼이 답변해 드리겠습니다. 또한 다른 분들도 비슷하게 탐구해 보신 적이 있다면...

원문 보기

원문 보기 (영어)

As the title states, my build is indeed able to run a 1 trillion parameter model (in this case Kimi K2.5) locally at \~4 tokens/second. I thought r/LocalLLaMA would be interested in the build due to that stat line, and also due to the inclusion of an unusual part, Intel Optane Persistent Memory, which I haven’t seen anyone use in an LLM inference build before. Optane PMem is a DIMM form factor memory unit that can function in a way that is somewhere between DRAM and an SSD. Intel has discontinued the line, and I found sticks on the secondhand market for much less than what the equivalent DRAM capacity would cost. It is this large PMem capacity (768GB) that allows me to host such large models on my system. For my build I used the PMem in Memory Mode, which is where the PMem is available to the computer as RAM, with the computer’s DRAM sticks functioning as a cache. Kimi K2.5’s mixture-of-experts architecture is an ideal test model for my build. To get the results I did, I used hybrid GPU/CPU inference with llama.cpp. Kimi K2.5’s (Unsloth Q2\_K\_XL quant) attention weights, the dense layer, the shared expert in each MoE layer, and the routing components are actually able to fit on my 12GB GPU using llama.cpp’s “override-tensor” flag, although I also did pretty good results just using llama.cpp’s “ngl auto” and “cmoe” flags and letting llama.cpp decide tensor placement as it sees fit too. Regardless, the sparse experts’ weights (the bulk of the model size) generally live on PMem/DRAM and get processed as needed from there. The end result from my testing with this setup is around 4 tokens per second for generation! Given the fact that this is a trillion parameter frontier-class model running on such a limited hardware budget, I would consider it to be a great success. It’s a shame Intel discontinued Optane Persistent Memory, because the current direction of some local inference innovation, including SSD offloading and broader memory tiering approaches, could have been really interesting with this specific kind of memory tier on modern hardware platforms. Overall I was pleased with this Optane PMem-centric build, it allows me to run very big models at surprisingly acceptable speeds, and the process was highly educational. Parts: \- Intel Xeon Gold 6246 CPU \- TYAN S5630GMRE-CGN motherboard \- ASUS Dual GeForce RTX 3060 OC 12GB GPU \- 6x 32GB Samsung 2666MHz DDR4 ECC DRAM sticks \- 6x 128GB Intel Optane DCPMM PC4-2666 NMA1XBD128GQS persistent memory modules \- Western Digital WD SN850X 2TB M.2 2280 NVMe SSD \- ASRock Steel Legend SL-850G 850W 80 PLUS GOLD & Cybenetics PLATINUM Full Modular Power Supply \- Silverstone SST-GD08B (Black) Grandia Series Home Theater PC Case I hope you enjoyed this rundown. There is a lot more detail that I didn’t include here, so I’m happy to answer questions about the build, the configuration, or the reasoning behind any of the component choices in the comments. Also if anyone else has explored similarly u

로컬-인퍼런스 옵테인-메모리 거대-언어-모델 하드웨어-빌드 llama-cpp