Hacker News • 114일 전

LM Studio CLI와 클로드 코드로 구글 제마 4 로컬 구동하기

IMP

8/10

핵심 요약

LM Studio 0.4.0의 새로운 헤드리스 CLI와 클로드 코드(Claude Code)를 연동하여 macOS 환경에서 구글 제마 4 26B 모델을 로컬로 구동하는 방법을 소개합니다. 48GB 메모리가 탑재된 맥북 프로에서 초당 51토큰의 속도를 내며, API 비용 없이도 400B 이상의 거대 모델과 필적하는 성능을 제공하는 것이 가장 큰 특징입니다.

번역된 본문

LM Studio의 새로운 헤드리스(Headless) CLI와 클로드 코드(Claude Code)를 사용하여 로컬에서 구글 제마(Gemma) 4 구동하기

LM Studio 0.4.0은 llmster와 lms CLI를 도입했습니다. 여기서는 Claude Code와 함께 사용할 수 있도록 macOS 환경에서 로컬 추론용으로 Gemma 4 26B를 설정하는 방법을 설명합니다. George Liu, 2026년 4월 4일

왜 모델을 로컬에서 구동해야 할까요? 클라우드 AI API는 문제가 생기기 전까지는 훌륭합니다. 사용량 제한(Rate limit), 사용 비용, 개인정보 보호 우려, 네트워크 지연 시간 등 여러 요인이 겹칩니다. 코드 리뷰, 문서 초안 작성, 프롬프트 테스트와 같은 빠른 작업의 경우, 사용자의 하드웨어에서 완전히 실행되는 로컬 모델은 확실한 이점을 제공합니다. API 비용이 들지 않고, 데이터가 기기를 떠나지 않으며, 안정적인 가용성을 보장합니다.

구글의 Gemma 4는 혼합 전문가(Mixture-of-Experts, MoE) 아키텍처 덕분에 로컬 사용에 매우 적합합니다. 26B 파라미터 모델은 순방향 패스(Forward pass)당 4B 파라미터만 활성화하므로, 26B 밀집형(Dense) 모델을 감당할 수 없는 하드웨어에서도 원활하게 실행됩니다. 필자의 48GB 통합 메모리가 탑재된 14인치 MacBook Pro M4 Pro에서는 부담 없이 실행되며 초당 51 토큰을 생성합니다. 다만 필자의 경험상 Claude Code 내부에서 사용할 때는 속도 저하가 상당히 발생했습니다.

Gemma 4 모델 패밀리 구글은 단일 모델이 아닌 4개 모델로 구성된 Gemma 4 패밀리를 출시했습니다. 이 라인업은 다양한 하드웨어 목표를 아우릅니다.

"E" 모델(E2B, E4B)은 기기 내 배포에 최적화하기 위해 계층별 임베딩(Per-Layer Embeddings)을 사용하며, 오디오 입력(음성 인식 및 번역)을 지원하는 유일한 변형입니다. 31B 밀집형(Dense) 모델은 가장 뛰어난 성능을 자랑하며, MMLU Pro에서 85.2%, AIME 2026에서 89.2%의 점수를 기록했습니다.

왜 26B-A4B를 선택했는가 혼합 전문가(MoE) 아키텍처가 핵심입니다. 이 모델은 128개의 전문가와 1개의 공유 전문가를 갖추고 있지만, 토큰당 8개의 전문가(3.8B 파라미터)만 활성화합니다. 일반적인 경험칙에 따르면 MoE 밀집형 모델의 동등한 품질은 대략 '총 파라미터 × 활성 파라미터의 제곱근'으로 추정되며, 이는 해당 모델이 약 10B 파라미터의 유효 성능을 가진다는 것을 의미합니다. 실제로 이 모델은 4B 밀집형 모델과 비견되는 추론 비용으로, 그 무게급을 훨씬 뛰어넘는 품질을 제공합니다.

벤치마크에서 MMLU Pro 82.6%, AIME 2026 88.3%를 기록하며, 실행 속도가 훨씬 빠르면서도 밀집형 31B(85.2%, 89.2%)와 근접한 성능을 보여줍니다. 아래 차트가 이를 잘 설명해 줍니다. 이 차트는 최근의 오픈 웨이트(Open-weight) 모델을 대상으로 사고(Thinking) 기능이 활성화된 상태에서 총 모델 크기에 따른 Elo 점수를 로그 스케일로 나타낸 것입니다. 왼쪽 상단의 파란색 영역이 우리가 원하는 구간인 '높은 성능, 작은 크기'입니다. Gemma 4 26B-A4B(Elo 약 1441)는 이 구역에 확고히 자리 잡고 있으며, 25.2B 파라미터라는 무게를 훌쩍 뛰어넘는 성능을 보여줍니다.

31B 밀집형 변형은 점수가 약간 더 높지만(약 1451) 여전히 놀랍도록 컴팩트합니다. 참고로 Qwen 3.5 397B-A17B(약 1450 Elo) 및 GLM-5(약 1457 Elo)와 같은 모델은 유사한 점수에 도달하기 위해 100~600B의 파라미터가 필요합니다. Kimi-K2.5(약 1457 Elo)는 1,000B 이상을 필요로 합니다. 26B-A4B는 적은 파라미터로 경쟁력 있는 Elo 점수를 달성하며, 이는 곧 더 낮은 메모리 요구 사항과 더 빠른 로컬 추론으로 직결됩니다.

바로 이 점이 MoE 모델을 로컬 사용에 혁신적으로 만드는 이유입니다. 400B 이상의 거대한 파라미터를 가진 모델과 경쟁하기 위해 클러스터나 고가의 GPU 랙이 필요하지 않습니다. 48GB 통합 메모리를 갖춘 노트북 하나면 충분합니다. 48GB 메모리를 탑재한 Mac에서 로컬 추론을 할 때, 이 모델은 가장 완벽한 최적점(Sweet spot)입니다. 밀집형(Dense) 31B는 모든 파라미터가 매 순방향 패스에 참여하기 때문에 더 많은 메모리를 소비하고 토큰 생성 속도가 느려집니다. E4B는 더 가볍지만 성능이 눈에 띄게 떨어집니다. 반면 26B-A4B는 필자의 하드웨어에서 초당 51토큰의 속도로 256K의 최대 컨텍스트, 비전 지원(스크린샷 및 다이어그램 분석에 유용), 네이티브 함수/도구 호출, 그리고 구성 가능한 사고 모드(Thinking mode)를 통한 추론 기능을 모두 제공합니다.

LM Studio 0.4.0의 변화 LM Studio는 꽤 오랫동안 로컬 모델을 실행하는 데 널리 사용되는 데스크톱 앱이었습니다. 버전 0.4.0은 데스크톱 앱에서 추출한 핵심 추론 엔진인 llmster를 도입하고 이를 독립 실행형 서버로 패키징하여 근본적으로 아키텍처를 변경했습니다. 그 결과, 이제 lms CLI를 사용하여 명령줄에서 LM Studio를 완전히 실행할 수 있습니다. GUI가 필요 없습니다.

원문 보기

원문 보기 (영어)

Running Google Gemma 4 Locally With LM Studio’s New Headless CLI & Claude Code LM Studio 0.4.0 introduced llmster and the lms CLI. Here is how I set up Gemma 4 26B for local inference on macOS that can be used with Claude Code. George Liu Apr 04, 2026 Share Why run models locally? Cloud AI APIs are great until they are not. Rate limits, usage costs, privacy concerns, and network latency all add up. For quick tasks like code review, drafting, or testing prompts, a local model that runs entirely on your hardware has real advantages: zero API costs, no data leaving your machine, and consistent availability. Google’s Gemma 4 is interesting for local use because of its mixture-of-experts architecture. The 26B parameter model only activates 4B parameters per forward pass, which means it runs well on hardware that could never handle a dense 26B model. On my 14” MacBook Pro M4 Pro with 48 GB of unified memory, it fits comfortably and generates at 51 tokens per second. Though there’s significant slowdowns when used within Claude Code from my experience. Thanks for reading! Subscribe for free to receive new posts and support my work. Subscribe The Gemma 4 model family Google released Gemma 4 as a family of four models, not just one. The lineup spans a wide range of hardware targets: The “E” models (E2B, E4B) use Per-Layer Embeddings to optimize for on-device deployment and are the only variants that support audio input (speech recognition and translation). The 31B dense model is the most capable, scoring 85.2% on MMLU Pro and 89.2% on AIME 2026. Why I picked the 26B-A4B. The mixture-of-experts architecture is the key. It has 128 experts plus 1 shared expert, but only activates 8 experts (3.8B parameters) per token. A common rule of thumb estimates MoE dense - equivalent quality as roughly sqrt(total x active parameters), which puts this model around 10B effective. In practice, it delivers inference cost comparable to a 4B dense model with quality that punches well above that weight class. On benchmarks, it scores 82.6% on MMLU Pro and 88.3% on AIME 2026, close to the dense 31B (85.2% and 89.2%) while running dramatically faster. The chart below tells the story. It plots Elo score against total model size on a log scale for recent open-weight models with thinking enabled. The blue-highlighted region in the upper left is where you want to be: high performance, small footprint. Gemma 4 26B-A4B (Elo ~1441) sits firmly in that zone, punching well above its 25.2B parameter weight. The 31B dense variant scores slightly higher (~1451) but is still remarkably compact. For context, models like Qwen 3.5 397B-A17B (~1450 Elo) and GLM-5 (~1457 Elo) need 100-600B total parameters to reach similar scores. Kimi-K2.5 (~1457 Elo) requires over 1,000B. The 26B-A4B achieves competitive Elo with a fraction of the parameters, which translates directly into lower memory requirements and faster local inference. This is what makes MoE models transformative for local use. You do not need a cluster or a high-end GPU rig to run a model that competes with 400B+ parameter behemoths. A laptop with 48 GB of unified memory is enough. For local inference on a 48 GB Mac, this is the sweet spot. The dense 31B would consume more memory and generate tokens slower because every parameter participates in every forward pass. The E4B is lighter but noticeably less capable. The 26B-A4B gives you 256K max context, vision support (useful for analyzing screenshots and diagrams), native function/tool calling, and reasoning with configurable thinking modes, all at 51 tokens/second on my hardware. What changed in LM Studio 0.4.0 LM Studio has been a popular desktop app for running local models for a while. Version 0.4.0 changed the architecture fundamentally by introducing llmster , the core inference engine extracted from the desktop app and packaged as a standalone server. The practical result: you can now run LM Studio entirely from the command line using the lms CLI. No GUI required. This makes it usable on headless servers, in CI/CD pipelines, SSH sessions, or just for developers who prefer staying in the terminal. Key additions in 0.4.0: llmster daemon : a background service that manages model loading and inference without the desktop app lms CLI : full command-line interface for downloading, loading, chatting, and serving models Parallel request processing : continuous batching instead of sequential queuing, so multiple requests to the same model run concurrently Stateful REST API : a new /v1/chat endpoint that maintains conversation history across requests MCP integration : local Model Context Protocol support with permission-key gating Installation Install the lms CLI with a single command: # Linux/Mac curl -fsSL https://lmstudio.ai/install.sh | bash # Windows irm https://lmstudio.ai/install.ps1 | iex Then start the headless daemon: lms daemon up On macOS, update both inference runtimes: lms runtime update llama.cpp lms runtime update mlx Downloading Gemma 4 With the daemon running, download Google’s Gemma 4 26B model: lms get google/gemma-4-26b-a4b The CLI shows you the variant it will download (Q4_K_M quantization by default, 17.99 GB) and asks for confirmation: ↓ To download: model google/gemma-4-26b-a4b - 64.75 KB └─ ↓ To download: Gemma 4 26B A4B Instruct Q4_K_M [GGUF] - 17.99 GB About to download 17.99 GB. ? Start download? ❯ Yes No Change variant selection If you already have the model, the CLI tells you and shows the load command: ✔ Start download? yes Model already downloaded. To use, run: lms load google/gemma-4-26b-a4b Checking your local model library List all downloaded models: lms ls You have 10 models, taking up 118.17 GB of disk space. LLM PARAMS ARCH SIZE DEVICE gemma-3-270m-it-mlx 270m gemma3_text 497.80 MB Local google/gemma-4-26b-a4b (1 variant) 26B-A4B gemma4 17.99 GB Local gpt-oss-20b-mlx 20B gpt_oss 22.26 GB Local llama-3.2-1b-instruct 1B Llama 712.58 MB Local nvidia/nemotron-3-nano (1 variant) 30B nemotron_h 17.79 GB Local openai/gpt-oss-20b (1 variant) 20B gpt-oss 12.11 GB Local qwen/qwen3.5-35b-a3b (1 variant) 35B-A3B qwen35moe 22.07 GB Local qwen2.5-0.5b-instruct-mlx 0.5B Qwen2 293.99 MB Local zai-org/glm-4.7-flash (1 variant) 30B glm4_moe_lite 24.36 GB Local EMBEDDING PARAMS ARCH SIZE DEVICE text-embedding-nomic-embed-text-v1.5 Nomic BERT 84.11 MB Local Worth noting: several of these models use mixture-of-experts architectures (Gemma 4, Qwen 3.5, GLM 4.7 Flash). MoE models punch above their weight for local inference because only a fraction of parameters activate per token. Running an interactive chat Start a chat session with stats enabled to see performance numbers: lms chat google/gemma-4-26b-a4b --stats ╭─────────────────────────────────────────────────╮ │ 👾 lms chat │ │ Type exit or Ctrl+C to quit │ │ │ │ Chatting with google/gemma-4-26b-a4b │ │ │ │ Try one of the following commands: │ │ /model - Load a model (type /model to see list) │ │ /download - Download a model │ │ /clear - Clear the chat history │ │ /help - Show help information │ ╰─────────────────────────────────────────────────╯ With --stats , you get prediction metrics after each response: Prediction Stats: Stop Reason: eosFound Tokens/Second: 51.35 Time to First Token: 1.551s Prompt Tokens: 39 Predicted Tokens: 176 Total Tokens: 215 51 tokens/second on a 14” MacBook Pro M4 Pro (48 GB) with a 26B model is solid. Time to first token at 1.5 seconds is responsive enough for interactive use. Checking loaded models and memory See what is currently loaded: lms ps IDENTIFIER MODEL STATUS SIZE CONTEXT PARALLEL DEVICE TTL google/gemma-4-26b-a4b google/gemma-4-26b-a4b IDLE 17.99 GB 48000 2 Local 60m / 1h The model occupies 17.99 GB in memory with a 48K context window and supports 2 parallel requests. The TTL (time-to-live) auto-unloads the model after 1 hour of idle time, freeing memory without manual intervention. For detailed model metadata, pipe through jq: lms ps --json | jq lms ps --json | jq [ { "type": "ll

로컬-추론 LM-Studio Gemma-4 MoE-아키텍처 클로드-코드