Hacker News • 114일 전

언어 모델 원리를 쉽게 깨우쳐주는 초소형 LLM 공개

IMP

6/10

핵심 요약

해커뉴스에 약 900만 개의 파라미터를 가진 초소형 언어 모델인 'GuppyLM'이 공개되었습니다. 이 프로젝트는 누구나 Colab 환경에서 5분 만에 데이터 생성부터 토크나이저, 모델 아키텍처, 학습 및 추론까지 전 과정을 직접 경험하며 LLM의 원리를 쉽게 이해할 수 있도록 제작되었습니다. 복잡한 기법 없이 가장 기본적인 트랜스포머 아키텍처(Vanilla Transformer)만을 사용해 대형 모델들이 가진 블랙박스 같은 느낌을 해소하는 데 초점을 맞췄습니다.

번역된 본문

GuppyLM: 작은 물고기처럼 말하는 약 900만(9M) 파라미터 크기의 LLM입니다.

이 프로젝트는 자신만의 언어 모델을 학습시키는 것이 마법 같은 일이 아님을 보여주기 위해 존재합니다. 박사 학위도, 거대한 GPU 클러스터도 필요하지 않습니다. 단 하나의 Colab 노트북과 5분이면 데이터 생성, 토크나이저, 모델 아키텍처, 학습 루프, 추론에 이르기까지 밑바닥부터 직접 구축한 작동하는 LLM을 얻을 수 있습니다.

노트북만 실행할 수 있다면 언어 모델을 학습시킬 수 있습니다. 이 모델이 에세이를 써주는 수십억 파라미터 규모의 모델을 만들어내지는 못할 것입니다. 하지만 원시 텍스트(Raw text)부터 학습된 가중치(Weights), 그리고 생성된 출력에 이르기까지 모든 부분이 정확히 어떻게 작동하는지 보여줍니다. 그래서 거대한 모델들이 더 이상 블랙박스처럼 느껴지지 않게 될 것입니다.

[대화 예시] You> hi guppy Guppy> hi there. i just found a nice spot near the rock. the temperature feels nice. (안녕. 방금 바위 근처에 좋은 자리를 찾았어. 온도가 좋은 것 같아.) You> are you hungry Guppy> yes. always yes. i will swim to the top right now. i promise to eat all of it. (응. 항상 배고파. 지금 당장 위로 헤엄쳐 갈 거야. 전부 다 먹을 것을 약속할게.) You> do you like bubbles Guppy> i love bubbles. they make the water feel slightly different. (거품을 좋아해. 거품은 물을 약간 다르게 느껴지게 해.) You> what is the meaning of life Guppy> food. the answer is always food. (먹이. 답은 항상 먹이야.) You> tell me a joke Guppy> what did the fish say when it hit the wall. dam. (물고기가 벽에 부딪혔을 때 뭐라고 했을까? 댐(dam/젠장).) You> do you love me Guppy> you're my favorite big shape. my mouth are happy when you're here. (넌 내가 가장 좋아하는 커다란 모양이야. 네가 여기 있으면 내 입이 행복해.) You> goodnight guppy Guppy> ok sleep time. i was following a bubble but now i'll stop. goodnight tank. goodnight water. (알겠어, 잘 시간이네. 거품을 따라가고 있었지만 이제 멈출게. 잘자, 어항. 잘자, 물.)

GuppyLM이란 무엇입니까? GuppyLM은 구피(Guppy)라는 이름의 물고기인 척하는 아주 작은 언어 모델입니다. 물, 음식, 빛, 그리고 어항 생활에 대해 짧고 소문자로만 된 문장으로 말합니다. 돈, 휴대폰, 정치와 같은 인간의 추상적인 개념은 이해하지 못하며, 애초에 그런 것을 이해하려는 목적도 아닙니다. 60개 주제에 걸쳐 6만 건의 합성 대화(Synthetic conversations) 데이터를 사용해 처음부터 학습되었으며, 단일 GPU에서 약 5분 만에 실행되고 브라우저에서 실행될 수 있을 만큼 작은 크기의 모델을 생성합니다.

[아키텍처]

파라미터(Parameters): 8.7M
레이어(Layers): 6
은닉 차원(Hidden dim): 384
어텐션 헤드(Heads): 6
FFN: 768 (ReLU)
단어장(Vocab): 4,096 (BPE)
최대 시퀀스(Max sequence): 128 토큰
정규화(Norm): LayerNorm
위치(Position): 학습된 임베딩(Learned embeddings)
LM 헤드: 임베딩과 가중치 공유(Weight-tied)

순수 트랜스포머(Vanilla transformer)를 사용합니다. GQA, RoPE, SwiGLU, 조기 종료(Early exit) 등은 없습니다. 가능한 한 가장 단순한 형태입니다.

[성격]

짧고 소문자로 된 문장으로 말합니다.
물, 온도, 빛, 진동, 그리고 음식을 통해 세상을 경험합니다.
인간의 추상적인 개념을 이해하지 못합니다.
친절하고 호기심이 많으며, 약간 덜렁거립니다.
음식에 대해 많이 생각합니다.

60개의 세부 주제: 인사, 감정, 온도, 음식, 빛, 물, 어항, 소음, 밤, 외로움, 거품, 유리창, 반사, 호흡, 수영, 색상, 맛, 식물, 필터, 조류, 달팽이, 공포, 흥분, 지루함, 호기심, 행복, 피곤함, 바깥 세상, 고양이, 비, 계절, 음악, 방문객, 아이들, 삶의 의미, 시간, 기억, 꿈, 크기, 미래, 과거, 이름, 날씨, 수면, 친구, 농담, 두려움, 사랑, 나이, 지능, 건강, 노래, TV 등.

[빠른 시작]

Guppy와 대화하기 (학습 필요 없음): HuggingFace에서 사전 학습된 모델을 다운로드하여 대화할 수 있습니다. 모든 셀을 실행하기만 하면 됩니다.
직접 학습시키기: 런타임을 T4 GPU로 설정합니다. 모든 셀을 실행하면 데이터셋 다운로드, 토크나이저 학습, 모델 학습 및 테스트가 진행됩니다. HuggingFace에 업로드하거나 로컬에 다운로드할 수 있습니다.
로컬에서 대화하기: pip install torch tokenizers python -m guppylm chat

You> the cat is looking at you (고양이가 널 보고 있어) Guppy> i hide behind the plant when the furry one comes. (털뭉치가 올 때 나는 식물 뒤에 숨어.) You> it is raining outside (밖에 비가 와) Guppy> i think rain is the best thing about outside. (비가 바깥 세상에서 가장 좋은 것 같아.)

[데이터셋] HuggingFace의 arman-bd/guppylm-60k-generic 데이터셋을 사용합니다.

샘플 수: 60,000 (57K 학습용 / 3K 테스트용)
형식: {"input": "...", "output": "...", "category": "..."}
카테고리: 60개
생성 방식: 데이터셋의 합성 템플릿 조합(Synthetic template composition)

from datasets import load_dataset ds = load_dataset("arman-bd/guppylm-60k-generic") print(ds["train"][0])

{'input': 'hi guppy', 'output': 'hello. the water is nice today.', 'category': 'greeting'}

[프로젝트 구조] guppylm/ ├── config.py # 하이퍼파라미터 (모델 + 학습) ├── model.py # 순수 트랜스포머(Vanilla transformer) ├── dataset.py # 데이터 로딩 + 배칭(Batching) ├── train.py # 학습 루프 (Cosine LR, AMP) ├── generate_data.py # 대화 데이터 생성기 (60개 주제) ├── eval_cases.py # 홀드아웃 테스트 케이스(Held-out test cases) ├── prepare_data.py # 데이터 준비 + 토크나이저 학습 └── inference.py # 채팅

원문 보기

원문 보기 (영어)

GuppyLM A ~9M parameter LLM that talks like a small fish. This project exists to show that training your own language model is not magic. No PhD required. No massive GPU cluster. One Colab notebook, 5 minutes, and you have a working LLM that you built from scratch — data generation, tokenizer, model architecture, training loop, and inference. If you can run a notebook, you can train a language model. It won't produce a billion-parameter model that writes essays. But it will show you exactly how every piece works — from raw text to trained weights to generated output — so the big models stop feeling like black boxes. You> hi guppy Guppy> hi there. i just found a nice spot near the rock. the temperature feels nice. You> are you hungry Guppy> yes. always yes. i will swim to the top right now. i promise to eat all of it. You> do you like bubbles Guppy> i love bubbles. they make the water feel slightly different. You> what is the meaning of life Guppy> food. the answer is always food. You> tell me a joke Guppy> what did the fish say when it hit the wall. dam. You> do you love me Guppy> you're my favorite big shape. my mouth are happy when you're here. You> goodnight guppy Guppy> ok sleep time. i was following a bubble but now i'll stop. goodnight tank. goodnight water. What is GuppyLM? GuppyLM is a tiny language model that pretends to be a fish named Guppy. It speaks in short, lowercase sentences about water, food, light, and tank life. It doesn't understand human abstractions like money, phones, or politics — and it's not trying to. It's trained from scratch on 60K synthetic conversations across 60 topics, runs on a single GPU in ~5 minutes, and produces a model small enough to run in a browser. Architecture Parameters 8.7M Layers 6 Hidden dim 384 Heads 6 FFN 768 (ReLU) Vocab 4,096 (BPE) Max sequence 128 tokens Norm LayerNorm Position Learned embeddings LM head Weight-tied with embeddings Vanilla transformer. No GQA, no RoPE, no SwiGLU, no early exit. As simple as it gets. Personality Guppy: Speaks in short, lowercase sentences Experiences the world through water, temperature, light, vibrations, and food Doesn't understand human abstractions Is friendly, curious, and a little dumb Thinks about food a lot 60 topics: greetings, feelings, temperature, food, light, water, tank, noise, night, loneliness, bubbles, glass, reflection, breathing, swimming, colors, taste, plants, filter, algae, snails, scared, excited, bored, curious, happy, tired, outside, cats, rain, seasons, music, visitors, children, meaning of life, time, memory, dreams, size, future, past, name, weather, sleep, friends, jokes, fear, love, age, intelligence, health, singing, TV, and more. Quick Start Chat with Guppy (no training needed) Downloads the pre-trained model from HuggingFace and lets you chat. Just run all cells. Train your own Set runtime to T4 GPU Run all cells — downloads dataset, trains tokenizer, trains model, tests it Upload to HuggingFace or download locally Chat locally pip install torch tokenizers python -m guppylm chat You> the cat is looking at you Guppy> i hide behind the plant when the furry one comes. You> it is raining outside Guppy> i think rain is the best thing about outside. Dataset arman-bd/guppylm-60k-generic on HuggingFace. Samples 60,000 (57K train / 3K test) Format {"input": "...", "output": "...", "category": "..."} Categories 60 Generation Synthetic template composition from datasets import load_dataset ds = load_dataset ( "arman-bd/guppylm-60k-generic" ) print ( ds [ "train" ][ 0 ]) # {'input': 'hi guppy', 'output': 'hello. the water is nice today.', 'category': 'greeting'} Project Structure guppylm/ ├── config.py Hyperparameters (model + training) ├── model.py Vanilla transformer ├── dataset.py Data loading + batching ├── train.py Training loop (cosine LR, AMP) ├── generate_data.py Conversation data generator (60 topics) ├── eval_cases.py Held-out test cases ├── prepare_data.py Data prep + tokenizer training └── inference.py Chat interface tools/ ├── make_colab.py Generates guppy_colab.ipynb ├── export_dataset.py Push dataset to HuggingFace └── dataset_card.md HuggingFace dataset README Design Decisions Why no system prompt? Every training sample had the same one. A 9M model can't conditionally follow instructions — the personality is baked into the weights. Removing it saves ~60 tokens per inference. Why single-turn only? Multi-turn degraded at turn 3-4 due to the 128-token context window. A fish that forgets is on-brand, but garbled output isn't. Single-turn is reliable. Why vanilla transformer? GQA, SwiGLU, RoPE, and early exit add complexity that doesn't help at 9M params. Standard attention + ReLU FFN + LayerNorm produces the same quality with simpler code. Why synthetic data? A fish character with consistent personality needs consistent training data. Template composition with randomized components (30 tank objects, 17 food types, 25 activities) generates ~16K unique outputs from ~60 templates. License MIT

오픈소스 소형 언어 모델 학습 가이드 트랜스포머