r/LocalLLaMA • 114일 전

32MB 램 1998년 아이맥에서 LLM 구동 성공

IMP

2/10

핵심 요약

1998년에 출시된 32MB RAM의 오리지널 아이맥 G3에서 26만 개 파라미터(260K) 규모의 초소형 대규모 언어 모델(LLM)을 로컬로 구동하는 데 성공한 프로젝트가 공개되었습니다. 크로스 컴파일 및 빅엔디안 변환, 제한적인 메모리 할당을 우회하는 등 레트로 하드웨어의 극심한 제약을 극복한 것이 특징입니다. 실질적인 성능보다는 제한된 환경에서 AI 모델을 실행하는 기술적 난제를 해결한 흥미로운 실험으로 평가받습니다.

번역된 본문

하드웨어:

• 오리지널 아이맥 G3 Rev B (1998년 10월 출시). 233MHz PowerPC 750, 32MB RAM, Mac OS 8.5. 별도의 업그레이드 없음.

• 모델: 안드레이 카파시(Andrej Karpathy)의 260K TinyStories (Llama 2 아키텍처). 약 1MB 크기의 체크포인트.

툴체인:

• 맥 미니(Mac mini)에서 Retro68을 사용해 크로스 컴파일 진행 (클래식 Mac OS용 GCC → PEF 바이너리 생성)

• 모델 및 토크나이저를 PowerPC 환경에 맞춰 리틀엔디안(little-endian)에서 빅엔디안(big-endian)으로 엔디안 변환

• 이더넷을 통해 FTP로 아이맥에 파일 전송

도전 과제:

• Mac OS 8.5는 앱에 기본적으로 매우 작은 메모리 파티션을 할당합니다. Mac Memory Manager의 MaxApplZone()과 NewPtr()을 사용하여 충분한 힙(Heap) 공간을 확보해야 했습니다.

• 이 하드웨어에서는 RetroConsole이 충돌하기 때문에, 모든 출력 결과는 SimpleText에서 열 수 있는 텍스트 파일로 기록됩니다.

• 기존 llama2.c의 가중치 레이아웃은 n_kv_heads == n_heads라고 가정합니다. 그러나 260K 모델은 그룹화된 쿼리 어텐션(grouped-query attention, kv_heads=4, heads=8)을 사용하므로 wk 이후의 모든 포인터가 어긋나 NaN(Not a Number) 에러가 발생했습니다. wk/wv 크기를 n_kv_heads * head_size로 지정하여 이 문제를 해결했습니다.

• 32MB 환경에서 malloc 실패를 방지하기 위해 KV 캐시와 실행 상태(Run State)에 정적 버퍼(Static Buffers)를 사용했습니다.

이 프로그램은 prompt.txt에서 프롬프트를 읽어 BPE로 토큰화하고, 추론(Inference)을 실행한 뒤 결과를 output.txt에 기록합니다.

출력 결과는 매우 짧지만, 이 프로젝트는 어디까지나 재미를 위한 실험 및 데모가 목적입니다!

다음은 저장소 링크입니다: https://github.com/maddiedreese/imac-llm

원문 보기

원문 보기 (영어)

Hardware: • Stock iMac G3 Rev B (October 1998). 233 MHz PowerPC 750, 32 MB RAM, Mac OS 8.5. No upgrades. • Model: Andrej Karpathy’s 260K TinyStories (Llama 2 architecture). \~1 MB checkpoint. Toolchain: • Cross-compiled from a Mac mini using Retro68 (GCC for classic Mac OS → PEF binaries) • Endian-swapped model + tokenizer from little-endian to big-endian for PowerPC • Files transferred via FTP to the iMac over Ethernet Challenges: • Mac OS 8.5 gives apps a tiny memory partition by default. Had to use MaxApplZone() + NewPtr() from the Mac Memory Manager to get enough heap • RetroConsole crashes on this hardware, so all output writes to a text file you open in SimpleText • The original llama2.c weight layout assumes n\_kv\_heads == n\_heads. The 260K model uses grouped-query attention (kv\_heads=4, heads=8), which shifted every pointer after wk and produced NaN. Fixed by using n\_kv\_heads \* head\_size for wk/wv sizing • Static buffers for the KV cache and run state to avoid malloc failures on 32 MB It reads a prompt from prompt.txt, tokenizes with BPE, runs inference, and writes the continuation to output.txt. Obviously the output is very short, but this is definitely meant to just be a fun experiment/demo! Here’s the repo link: https://github.com/maddiedreese/imac-llm

온디바이스 AI 레트로 컴퓨팅 LLM 최적화 오픈소스