Hacker News • 96일 전

LLM 작동 원리 시각적 가이드

IMP

8/10

핵심 요약

안드레이 카파시(Andrej Karpathy)의 기술 강연을 기반으로, 대규모 언어 모델(LLM)이 어떻게 구축되는지 전 과정을 인터랙티브하게 보여주는 시각화 프로젝트가 해커뉴스에 소개되었습니다. 원시 인터넷 텍스트를 수집해 데이터를 정제하고, 신경망이 처리할 수 있도록 토큰화(Tokenization)하는 핵심 사전 학습 과정을 직관적인 파이프라인으로 설득력 있게 설명하고 있습니다. AI 모델 개발의 기초가 되는 데이터 수집 및 품질 관리의 중요성을 체감할 수 있다는 점에서 실무자 및 입문자 모두에게 유용한 자료입니다.

번역된 본문

메인 콘텐츠로 건너뛰기

시각적 심층 탐구: LLM은 실제로 어떻게 작동하는가 실시간 LLM 응답 사용자: 이 텍스트 상자 뒤에는 무엇이 있나요?

ChatGPT와 같은 대규모 언어 모델이 원시 인터넷 텍스트에서 시작하여 대화형 어시스턴트로 구축되는 전체 과정에 대한 완벽한 안내서입니다. 안드레이 카파시(Andrej Karpathy)의 기술 심층 분석을 기반으로 작성되었습니다.

학습 토큰: 15조 개 파라미터: 405B (4050억 개) 텍스트 데이터: 44 TB 토큰 어휘(Vocabulary): 100K (10만 개)

스크롤하여 탐색하세요

제1장 · 사전 학습(Pre-Training) · 1단계 인터넷 다운로드 첫 번째 단계는 방대한 양의 텍스트를 수집하는 것입니다. Common Crawl과 같은 기관은 2007년부터 웹을 크롤링하여 2024년까지 27억 개의 웹페이지를 인덱싱했습니다. 이 원시 데이터는 FineWeb과 같은 고품질 데이터셋으로 필터링됩니다. 목표는 대량의 고품질이면서 다양한 문서를 확보하는 것입니다. 공격적인 필터링 과정을 거치면 약 44테라바이트(대략 하나의 하드 드라이브에 들어가는 양)의 데이터가 남으며, 이는 약 15조 개의 토큰을 나타냅니다.

핵심 인사이트 이 학습 데이터의 품질과 다양성은 최종 모델에 그 어떤 것보다 막대한 영향을 미칩니다. 쓰레기를 넣으면 쓰레기가 나온다는 'Garbage in, garbage out' 원칙이 조(Trillion) 단위 규모에서도 그대로 적용됩니다.

자세한 내용을 보려면 각 단계를 클릭하세요

🌐 Common Crawl 27억 개의 웹페이지 · 원시 HTML · 2007년 이후 웹을 크롤링하고 데이터를 무료로 제공하는 비영리 단체입니다. 그들의 봇(bot)은 시드 페이지의 링크를 따라 인터넷을 재귀적으로 인덱싱합니다. 원시 아카이브는 원시 HTML을 포함하는 페타바이트 크기의 gzip 압축된 WARC 파일들로 구성되어 있습니다.

↓

🚫 URL 필터링 차단 목록 · 멀웨어 · 스팸 · 성인 콘텐츠 알려진 멀웨어 사이트, 스팸 네트워크, 성인 콘텐츠, 마케팅 페이지 및 저품질 도메인의 차단 목록이 적용됩니다. 전체 도메인이 제거될 수도 있습니다. 이는 가장 비용이 적게 드는 필터이므로 가장 먼저 실행됩니다.

↓

📄 텍스트 추출 HTML → 깨끗한 텍스트 · 탐색 메뉴 및 CSS 제거 원시 HTML에는

태그, CSS, JavaScript, 탐색 메뉴 및 광고가 포함되어 있습니다. 파서는 의미 있는 텍스트 콘텐츠만 추출합니다. 이는 생각보다 어려운 작업입니다. 휴리스틱(Heuristics)을 통해 무엇이 실제 '콘텐츠'이고 무엇이 단순 '웹페이지 꾸미기 요소(Chrome)'인지 결정해야 합니다.

↓

🌍 언어 필터링 영어 65% 이상인 페이지 유지 · 언어 분류기 언어 분류기가 각 페이지의 언어를 추정합니다. 대상 언어 콘텐츠가 65% 미만인 페이지는 삭제됩니다. 이는 하나의 언어를 강력하게 필터링할지, 아니면 다국어로 학습할지에 대한 설계상의 결정입니다.

↓

♻️ 중복 제거(Deduplication) 정확한 일치 및 퍼지 매칭 · 반복 감소 동일하거나 거의 동일한 페이지가 인터넷에는 수백만 번 나타납니다(복사된 기사, 정형화된 문구 등). 동일한 텍스트로 반복적으로 학습하면 모델이 단순 암기를 하게 됩니다. 중복 제거는 MinHash 및 정확한 일치 기술을 사용하여 중복을 제거합니다.

↓

🔒 개인 식별 정보(PII) 제거 이름 · 주소 · 주민등록번호(SSN) · 이메일 개인 식별 정보(PII)가 감지되면 해당 부분을 삭제하거나 페이지 전체를 폐기합니다. 정규 표현식(Regex) 패턴과 머신러닝 분류기가 전화번호, 이메일, 주민등록번호, 실제 주소 및 특정 개인의 이름을 찾아냅니다.

↓

✅ FineWeb 데이터셋 44 TB · 15조 개 토큰 · 고품질 최종적으로 필터링된 데이터셋입니다. 2012년의 토네이도에 대한 기사, 의학적 사실, 역사, 코드, 요리법, 과학 논문 등 텍스트로 표현된 인류 지식의 전체 영역을 포함합니다. 이것이 학습 코퍼스(Corpus, 말뭉치)가 됩니다.

▶ 파이프라인 애니메이션 실행

제1장 · 사전 학습(Pre-Training) · 2단계 토큰화(Tokenization) 신경망은 원시 텍스트를 처리할 수 없으며 숫자가 필요합니다. 그 해결책이 바로 토큰화(Tokenization)입니다. 텍스트를 '토큰'(하위 단어 청크)으로 나누고 각각에 ID를 부여하는 것입니다. GPT-4는 바이트 페어 인코딩(Byte Pair Encoding, BPE) 알고리즘을 통해 구축된 100,277개 토큰의 어휘(Vocabulary)를 사용합니다.

BPE는 개별 바이트(256개 기호)로 시작하여 가장 자주 등장하는 인접한 쌍을 반복적으로 병합합니다. 이는 어휘 크기를 확장하면서 시퀀스 길이를 압축하는 방식입니다.

왜 단어를 그대로 사용하지 않을까요? 단어에는 무한한 변형이 존재합니다. "run", "running", "runner"는 3개의 개별 항목이 될 것입니다. 하위 단어인 서브워드(Subword) 토큰은 어근을 공유합니다: "run" + "ning", "run" + "ner". 이 방식은 새로운 단어, 오타 및 여러 언어도 효율적으로 처리합니다.

BPE 적용 과정 BPE 토큰화 단계 바이트 페어 인코딩이 문자를 서브워드 토큰으로 점진적으로 병합하는 방식을 보여주는 인터랙티브 다이어그램

다음 단계 → 5단계 중 1단계

라이브 토크나이저 아래 예제를 사용해 보거나 직접 텍스트를 입력해 보세요. 마우스를 올리거나(또는...)

원문 보기

원문 보기 (영어)

Skip to main content A Visual Deep Dive How LLMs Actually Work Live LLM Response Human: What is behind this text box? A complete walkthrough of how large language models like ChatGPT are built — from raw internet text to a conversational assistant. Based on Andrej Karpathy's technical deep dive. Training Tokens 15T Parameters 405B Text Data 44 TB Token Vocabulary 100K Scroll to explore Chapter 1 · Pre-Training · Stage 1 Downloading the Internet The first step is collecting an enormous amount of text. Organizations like Common Crawl have been crawling the web since 2007 — indexing 2.7 billion pages by 2024. This raw data is then filtered into a high-quality dataset like FineWeb . The goal: large quantity of high quality , diverse documents. After aggressive filtering, you end up with about 44 terabytes — roughly what fits on a single hard drive — representing ~15 trillion tokens. Key Insight The quality and diversity of this training data has more impact on the final model than almost anything else. Garbage in, garbage out — but at a trillion-token scale. Click any stage to read more detail 🌐 Common Crawl 2.7B web pages · Raw HTML · Since 2007 A non-profit organization that crawls the web and freely provides its data. Their bots follow links from seed pages, recursively indexing the internet. The raw archive is petabytes of gzip'd WARC files containing raw HTML. ↓ 🚫 URL Filtering Blocklists · Malware · Spam · Adult content Block-lists of known malware sites, spam networks, adult content, marketing pages, and low-quality domains are applied. Entire domains can be removed. This is the cheapest filter so it runs first. ↓ 📄 Text Extraction HTML → clean text · Remove navigation & CSS Raw HTML contains <div> tags, CSS, JavaScript, navigation menus, and ads. Parsers extract just the meaningful text content. This is harder than it sounds — heuristics decide what's "content" vs "chrome". ↓ 🌍 Language Filtering Keep pages ≥65% English · Language classifier A language classifier estimates the language of each page. Pages with less than 65% target-language content are dropped. This is a design decision — filter aggressively for one language or train multilingual. ↓ ♻️ Deduplication Exact & fuzzy matching · Reduce repetition Identical or near-identical pages appear millions of times on the internet (copied articles, boilerplate). Training on the same text repeatedly causes memorization. Dedup uses MinHash and exact-match techniques to remove duplicates. ↓ 🔒 PII Removal Names · Addresses · SSNs · Emails Personally Identifiable Information is detected and either redacted or the page is dropped. Regex patterns and ML classifiers find phone numbers, emails, Social Security numbers, physical addresses, and named individuals. ↓ ✅ FineWeb Dataset 44 TB · 15 Trillion tokens · High quality The final filtered dataset. Articles about tornadoes in 2012, medical facts, history, code, recipes, science papers — the full breadth of human knowledge expressed in text. This becomes the training corpus. ▶ Animate Pipeline Chapter 1 · Pre-Training · Stage 2 Tokenization Neural networks can't process raw text — they need numbers. The solution is tokenization : breaking text into "tokens" (sub-word chunks) and assigning each an ID. GPT-4 uses a vocabulary of 100,277 tokens , built via the Byte Pair Encoding (BPE) algorithm. BPE starts with individual bytes (256 symbols), then iteratively merges the most frequent adjacent pairs — compressing the sequence length while expanding the vocabulary. Why not just use words? Words have infinite variants. "run", "running", "runner" would be 3 separate entries. Subword tokens share roots: "run" + "ning", "run" + "ner". This also handles new words, typos, and multiple languages efficiently. BPE in Action BPE Tokenization Steps Interactive diagram showing how Byte Pair Encoding progressively merges characters into subword tokens Next Step → Step 1 of 5 Live Tokenizer Try the examples below or type your own text. Hover (or focus) any token to see its ID. Hello world LLM description Tokenization Numbers Compound words Enter text to tokenize Large language models predict the next token in a sequence. Tokens: 0 Characters: 0 Ratio: 0 chars/token Explore tokenization across GPT-4, Claude, Llama and more → tiktokenizer.vercel.app Chapter 1 · Pre-Training · Stage 3 Training the Neural Network The Transformer neural network is initialized with random parameters — billions of "knobs" . Training adjusts these knobs so the network gets better at predicting the next token in any sequence. Every training step: sample a window of tokens → feed to network → compare prediction to actual next token → nudge all parameters slightly in the right direction. Repeat billions of times . The loss — a single number measuring prediction error — falls steadily as the model learns the statistical patterns of human language. Scale GPT-2 (2019): 1.6B params, 100B tokens, ~$40K to train. Today: same quality for ~$100. Llama 3: 405B params, 15T tokens. Modern frontier models: hundreds of billions of parameters, trillions of tokens. Transformer Architecture Transformer Architecture Select a training stage to see model output quality Step 1 Loss: 11.2 Step 500 Loss: 4.8 Step 5K Loss: 3.1 Step 32K Loss: 2.4 Training Loss ↓ 4.8 Cross-entropy loss 500 Training step Model Output at This Stage the model has learn ing but confus tion still the wqp mxr model bns to predict ... What the model is learning At step 1: pure noise. By step 500: local coherence appears. By step 32K: fluent English. The model is learning grammar, facts, reasoning patterns — all implicitly from token prediction. Chapter 1 · Pre-Training · Stage 4 Inference & Token Sampling Once trained, the network generates text autoregressively: feed a sequence of tokens → get a probability distribution over all 100K possible next tokens → sample one → append → repeat. This process is stochastic — the same prompt generates different outputs every time because we're flipping a biased coin. Higher-probability tokens are more likely but not guaranteed to be chosen. Temperature controls randomness. Low temperature (0.1) → model always picks the top token. High temperature (2.0) → uniform chaos. 0.7–1.0 is the sweet spot for coherent-but-creative text. Key Mental Model The model doesn't "think" about what to say. It computes a probability distribution over all possible next tokens and samples from it. Every word is a coin flip — just a very informed one. Token Sampling Demo Watch the model choose the next word. Each bar shows the probability of a candidate token. The sky appears blue Temperature (randomness) 0.8 Next token candidates Sample Next Token Reset Chapter 2 · The Base Model The Internet Simulator After pre-training, you have a base model — a sophisticated autocomplete engine. It's not an assistant. It doesn't answer questions. It continues token sequences based on what it saw on the internet. Give it a Wikipedia sentence and it'll complete it from memory. Ask it "What is 2+2?" and it might give you a math textbook page, a quiz answer key, or go off on a tangent — whatever was statistically common in its training data. The base model's knowledge lives in its 405 billion parameters — a lossy compression of the internet, like a zip file that approximates rather than perfectly stores information. Base Model Behavior Few-Shot Prompting Hello: Bonjour | Cat: Chat | Dog: Chien | Teacher: → Professeur ✓ correct Memorization Zebras (/ˈzɛbrə, ˈziːbrə/) are African equines with distinctive... ...black-and-white striped coats. There are three living species: the Grévy's zebra, plains zebra, and mountain zebra... ↑ Verbatim Wikipedia recall from weights Hallucination The Republican Party nominated Trump and [running mate] in the 2024 election against... → ...Mike Pence, facing Hillary Clinton and Tim Kaine... → ...Ron DeSantis, against Joe Biden and Kamala Harris... ↑ Knowledge cutoff → plausible confabulation In-Context Learning Ba

LLM 기초 데이터 처리 시각화 자료 토큰화(Tokenization) 안드레이 카파시