Hacker News • 77일 전

제미나이 도구 호출 기능, 2천6백만 파라미터 초소형 모델로 증류

IMP

8/10

핵심 요약

Cactus Compute 팀이 구글의 제미나이(Gemini) 모델의 툴 콜링(Tool Calling) 기능을 단 2천6백만(26M) 파라미터를 가진 'Simple Attention Network' 모델(Needle)로 경량화하여 깃허브에 공개했습니다. 이 모델은 파인튜닝 없이도 FunctionGemma-270m, Qwen-0.6B 등 기존 경쟁 모델들을 단일 툴 콜 성능에서 뛰어넘으며, 가벼운 웨이트 덕분에 로컬 PC 및 스마트워치, 안경 등 소비자 기기에서 초당 수천 토큰을 처리할 수 있는 실용성을 갖췄습니다.

번역된 본문

우리는 Gemini 3.1을 단 2천6백만(26M) 파라미터를 가진 "Simple Attention Network"로 증류(distill)했습니다. 이 모델은 Mac이나 PC에서도 로컬로 파인튜닝할 수 있습니다. 프로덕션 환경에서 Needle은 Cactus 기반으로 실행되며, 사전 채우기(prefill) 속도 초당 6000 토큰, 디코딩 속도 초당 1200 토큰을 기록합니다. 모델 가중치(Weights)는 Cactus-Compute/needle에서 데이터셋 생성 코드와 함께 완벽하게 오픈소스로 공개됩니다.

모델 구조 하이퍼파라미터: d=512, 8H/4KV, BPE=8192 ┌──────────────┐ │ Tool Call │ └──────┬───────┘ ┌┴──────────┐ │ Softmax │ └─────┬─────┘ ┌─────┴─────┐ │ Linear (T)│ ← tied └─────┬─────┘ ┌─────┴─────┐ │ ZCRMSNorm │ └─────┬─────┘ ┌────────┴────────┐ │ Decoder x 8 │ │┌───────────────┐│ ││ ZCRMSNorm ││ ││ Masked Self ││ ││ Attn + RoPE ││ ││ Gated Residual││ │├───────────────┤│ ┌──────────────┐ ││ ZCRMSNorm ││ │ Encoder x 12 │──────────────────────▶Cross Attn │ │ │ │ ││ Gated Residual││ │ ┌──────────┐ │ │└───────────────┘│ │ │ZCRMSNorm │ │ └────────┬────────┘ │ │Self Attn │ │ ┌─────┴─────┐ │ │ GQA+RoPE │ │ │ Embedding │ ← shared │ │Gated Res │ │ └─────┬─────┘ │ │ │ │ ┌───────┴───────-┐ │ │ (no FFN) │ │ │[EOS]xCB │ │ │ │ │ │ + answer │ │ │ │ │ └───────┴───────-┘ │ └──────────┘ │ └──────┬───────┘ │ ┌────┴──────┐ │ Embedding │ └────┬──────┘ │ ┌────┴──────┐ │ Text │ │ query │ └───────────┘

훈련 과정: 16대의 TPU v6e에서 2000억(200B) 개의 토큰으로 사전 학습(27시간 소요)을 진행했습니다. 이후 20억(2B) 개의 토큰으로 구성된 단일 툴 콜(single-shot function call) 데이터셋을 통해 사후 학습(45분 소요)을 수행했습니다.

Needle은 Simple Attention Networks의 실험적 결과물로, 스마트폰, 스마트워치, 스마트 안경 등 소비자 기기를 위한 초소형 AI를 재정의하는 것을 목표로 합니다. 따라서 개인화된 AI를 위한 단일 툴 콜(single-shot function call) 벤치마크에서 FunctionGemma-270m, Qwen-0.6B, Granite-350m, LFM2.5-350m 등의 모델들을 제치고 더 우수한 성능을 보여주지만, 상기된 타 모델들이 더 큰 범위와 용량을 가지고 있어 일반적인 대화형(conversational) 설정에서는 여전히 우수합니다. 또한, 초소형 모델은 다루기 까다로울 수 있습니다. 웹 UI를 통해 자체 도구에서 테스트를 진행하고 버튼 클릭 한 번으로 파인튜닝을 해보시길 권장합니다.

빠른 시작 (Quickstart) git clone https://github.com/cactus-compute/needle.git cd needle && source ./setup needle playground 위 명령어를 실행하면 http://127.0.0.1:7860 에서 자체 도구로 테스트하고 파인튜닝할 수 있는 웹 UI가 열립니다. 가중치는 자동으로 다운로드됩니다.

사용 예시 (Python) from needle import SimpleAttentionNetwork, load_checkpoint, generate, get_tokenizer params, config = load_checkpoint("checkpoints/needle.pkl") model = SimpleAttentionNetwork(config) tokenizer = get_tokenizer() result = generate( model, params, tokenizer, query="샌프란시스코의 날씨는 어때?", tools='[{"name":"get_weather","parameters":{"location":"string"}}]', stream=False, ) print(result)

출력 결과: [{"name":"get_weather","arguments":{"location":"San Francisco"}}]

파인튜닝 (Finetuning)

플레이그라운드 (Gemini를 통해 데이터를 생성하고, 학습 및 평가 후 결과를 묶어서 제공)

needle playground

CLI (로컬에 가중치가 없으면 자동으로 다운로드)

needle finetune data.jsonl

CLI 명령어 목록 needle playground : 웹 UI를 통해 테스트 및 파인튜닝 needle finetune <data.jsonl> : 사용자 지정 데이터로 파인튜닝 needle run --query "..." --tools : 단일 추론 실행 needle train : 전체 학습 실행 needle pretrain : PleIAs/SYNTH 데이터셋으로 사전 학습 needle eval --checkpoint : 체크포인트 평가 needle tokenize : 데이터셋 토큰화 needle generate-data : Gemini를 통해 합성 학습 데이터 생성 needle tpu : TPU 관리 (자세한 내용은 docs/tpu.md 참조)

참고 문헌 @misc{ndubuaku2026needle, title={Needle}, author={Henry Ndubuaku, Jakub Mroz, Karen Mosoyan, Roman Shemet, Parkirat Sandhu, Satyajit Kumar, Noah Cylich, Justin H. Lee}, year={2026}, url={https://github.com/cactus-compute/needle} }

원문 보기

원문 보기 (영어)

Needle We distilled Gemini 3.1 into a 26m parameter " Simple Attention Network " that you can even finetune locally on your Mac/PC. In production, Needle runs on Cactus at 6000 toks/sec prefill and 1200 decode speed. Weights are fully open on Cactus-Compute/needle , as well as the dataset generation. d=512, 8H/4KV, BPE=8192 ┌──────────────┐ │ Tool Call │ └──────┬───────┘ ┌┴──────────┐ │ Softmax │ └─────┬─────┘ ┌─────┴─────┐ │ Linear (T)│ ← tied └─────┬─────┘ ┌─────┴─────┐ │ ZCRMSNorm │ └─────┬─────┘ ┌────────┴────────┐ │ Decoder x 8 │ │┌───────────────┐│ ││ ZCRMSNorm ││ ││ Masked Self ││ ││ Attn + RoPE ││ ││ Gated Residual││ │├───────────────┤│ ┌──────────────┐ ││ ZCRMSNorm ││ │ Encoder x 12 │──────────────────────▶Cross Attn ││ │ │ ││ Gated Residual││ │ ┌──────────┐ │ │└───────────────┘│ │ │ZCRMSNorm │ │ └────────┬────────┘ │ │Self Attn │ │ ┌─────┴─────┐ │ │ GQA+RoPE │ │ │ Embedding │ ← shared │ │Gated Res │ │ └─────┬─────┘ │ │ │ │ ┌───────┴───────-┐ │ │ (no FFN) │ │ │[EOS]<tool_call>│ │ └──────────┘ │ │ + answer │ │ │ └───────────────-┘ └──────┬───────┘ │ ┌────┴──────┐ │ Embedding │ └────┬──────┘ │ ┌────┴──────┐ │ Text │ │ query │ └───────────┘ Pretrained on 16 TPU v6e for 200B tokens (27hrs). Post-trained on 2B tokens of single-shot function call dataset (45mins). Needle is an experimental run for Simple Attention Networks, geared at redefining tiny AI for consumer devies (phones, watches, glasses...). So while it beats FunctionGemma-270m, Qwen-0.6B, Graninte-350m, LFM2.5-350m on single-shot function call for personal AI, Those model are have more scope/capacity and excel in conversational settings. Also, small models can be finicky. Please use the UI in the next section to test on your own tools, and finetune accordingly, at the click of a button. Quickstart git clone https://github.com/cactus-compute/needle.git cd needle && source ./setup needle playground Opens a web UI at http://127.0.0.1:7860 where you can test and finetune on your own tools. Weights are auto-downloaded. Usage (Python) from needle import SimpleAttentionNetwork , load_checkpoint , generate , get_tokenizer params , config = load_checkpoint ( "checkpoints/needle.pkl" ) model = SimpleAttentionNetwork ( config ) tokenizer = get_tokenizer () result = generate ( model , params , tokenizer , query = "What's the weather in San Francisco?" , tools = '[{"name":"get_weather","parameters":{"location":"string"}}]' , stream = False , ) print ( result ) # [{"name":"get_weather","arguments":{"location":"San Francisco"}}] Finetuning # Playground (generates data via Gemini, trains, evaluates, bundles result) needle playground # CLI (auto-downloads weights if not local) needle finetune data.jsonl CLI needle playground Test and finetune via web UI needle finetune <data.jsonl> Finetune on your own data needle run --query "..." --tools Single inference needle train Full training run needle pretrain Pretrain on PleIAs/SYNTH needle eval --checkpoint <path> Evaluate a checkpoint needle tokenize Tokenize dataset needle generate-data Synthesize training data via Gemini needle tpu <action> TPU management (see docs/tpu.md) @misc{ndubuaku2026needle, title={Needle}, author={Henry Ndubuaku, Jakub Mroz, Karen Mosoyan, Roman Shemet, Parkirat Sandhu, Satyajit Kumar, Noah Cylich, Justin H. Lee}, year={2026}, url={https://github.com/cactus-compute/needle} }

소형언어모델(SLM) 오픈소스 모델경량화 툴콜링 엣지AI