Hacker News • 103일 전

1989년 매킨토시에서 구현된 트랜스포머 신경망

IMP

7/10

핵심 요약

1989년 Macintosh SE/30의 하이퍼카드(HyperCard) 환경에서 트랜스포머 신경망을 온전히 구현한 'MacMind' 프로젝트가 공개되었습니다. 1,216개의 파라미터를 가진 이 모델은 현대 LLM과 완전히 동일한 수학적 원리(셀프 어텐션, 역전파 등)를 사용하여 빠른 푸리에 변환(FFT)의 초기 단계인 비트 역순 정렬을 스스로 학습합니다. 이 프로젝트는 거대 AI 모델의 작동 방식이 마법이 아니라 이해 가능한 수학임을 시각적으로 증명하며 AI의 기본 원리 교육에 큰 가치를 지닙니다.

번역된 본문

MacMind: 1989년 Macintosh의 HyperCard에서 구현된 완전한 트랜스포머 신경망

MacMind는 Macintosh SE/30에서 학습되고, 전적으로 HyperTalk로 구현된 완전한 트랜스포머 신경망입니다. MacMind는 1,216개의 파라미터를 가진 단일 레이어, 단일 헤드 트랜스포머로, 무작위 예제로부터 고속 푸리에 변환(FFT)의 첫 번째 단계인 '비트 역순 정렬(bit-reversal permutation)'을 학습합니다. 이 신경망의 모든 코드는 행렬 연산이 아닌 인터랙티브한 카드 스택을 만들기 위해 1987년에 설계된 스크립팅 언어인 HyperTalk로 작성되었습니다. 이 모델은 토큰 임베딩, 위치 인코딩(positional encoding), 스케일드 닷-프로덕트(scaled dot-product) 스코어를 사용한 셀프 어텐션, 교차 엔트로피 손실(cross-entropy loss), 완전한 역전파(backpropagation), 그리고 확률적 경사 하강법(SGD)을 포함합니다. 컴파일된 코드도, 외부 라이브러리도, 블랙 박스도 없습니다. Option 키를 누른 채 버튼을 클릭하면 실제 수학 공식을 읽을 수 있습니다.

왜 이것이 존재하는가 MacMind를 학습시킨 근본적인 과정(순전파, 손실 계산, 역전파, 가중치 업데이트, 반복)은 오늘날 존재하는 모든 대규모 언어 모델(LLM)을 학습시킨 과정과 동일합니다. 차이점은 종류가 아닌 규모입니다. MacMind는 1,216개의 파라미터를 가지지만, GPT-4는 대략 1조 개의 파라미터를 가집니다. 수학은 완벽하게 동일합니다. 우리는 AI가 거의 모든 사람에게 영향을 미치는 시대에 살고 있지만, 실제로 AI가 어떻게 작동하는지 이해하는 사람은 거의 없습니다. MacMind는 그 과정을 알 수 있다는 것, 역전파와 어텐션이 마법이 아니라 수학이라는 것, 그리고 그 수학은 TPU 클러스터에서 실행되든 1987년의 68000 프로세서에서 실행되든 상관하지 않는다는 것을 보여주는 시연입니다. 모든 것을 검사하고 수정할 수 있습니다. 학습률 변경, 학습 작업 교체, 모델 크기 조정 등 모든 것을 HyperCard의 스크립트 에디터 내에서 수행할 수 있습니다. 이 프로젝트는 보닛을 연 채 공개된 자동차 엔진과 같습니다.

무엇을 학습하는가 비트 역순 정렬은 각 위치 인덱스의 이진법 표현을 뒤집어 순서를 재배열합니다. 8개 요소 시퀀스의 경우: 위치: 0 1 2 3 4 5 6 7 이진법: 000 001 010 011 100 101 110 111 뒤집기: 000 100 010 110 001 101 011 111 매핑: 0 4 2 6 1 5 3 7 따라서 입력 [3, 7, 1, 9, 5, 2, 8, 4]는 [3, 5, 1, 8, 7, 2, 9, 4]가 됩니다. 이 정렬 방식은 컴퓨팅에서 가장 중요한 알고리즘 중 하나인 고속 푸리에 변환(FFT)의 첫 번째 단계입니다. 모델은 규칙을 전혀 알려주지 않습니다. 셀프 어텐션과 경사 하강법을 통해서만 순수하게 위치 패턴을 발견하며, 이는 대규모 모델이 언어를 이해하도록 가르치는 것과 동일한 과정을 크게 확장한 것입니다. 학습 후 Card 4의 어텐션 맵을 보면 FFT의 버터플라이 라우팅 패턴이 나타납니다. 이 모델은 Cooley와 Tukey가 1965년에 발표한 것과 동일한 수학적 구조를 독립적으로 발견했습니다.

스택 구성 MacMind는 5개의 카드로 구성된 HyperCard 스택입니다:

Card 1 (제목): 프로젝트 이름 및 크레딧
Card 2 (학습): 모델을 학습시키고 실시간으로 학습 과정 관찰
Card 3 (추론): 학습된 모델을 8자리 숫자 입력으로 테스트
Card 4 (어텐션 맵): 8x8 어텐션 가중치 행렬 시각화
Card 5 (정보): 모델이 수행하는 작업에 대한 텍스트 설명

학습 (Card 2) 'Train 10'을 클릭해 10번의 학습 단계를 거치거나, 'Train to 100%'를 클릭해 모델이 샘플에서 완벽한 점수를 얻을 때까지 학습시킬 수 있습니다. 더 깊은 학습을 위해서는 'Train 10'을 반복해서 실행하거나 'Train to 100%'를 다시 클릭하면, 중단된 시점부터 학습이 이어집니다. 더 오래 실행하려면 Message Box(Cmd-M)를 열고 trainN 1000을 입력하여 1,000단계를 연속으로 학습시키십시오. 각 단계에서 무작위 8자리 시퀀스가 생성되고, 전체 순전파가 실행되며, 교차 엔트로피 손실이 계산되고, 모든 레이어를 통해 기울기가 역전파되며, 1,216개의 모든 가중치가 업데이트됩니다. 진행률 표시줄, 위치별 정확도 및 학습 로그가 실시간으로 업데이트됩니다. 참고: 학습 로그 필드는 30,000자 제한(HyperCard의 제약)이 있습니다. 약 900단계 이후에는 로그가 가득 차서 HyperCard에 오류가 표시됩니다. 이를 지우고 계속하려면 Message Box(Cmd-M)를 열고 다음을 입력하세요: put "" into card field "trainingLog" 그런 다음 trainN 500(또는 원하는 단계 수)을 입력하여 학습을 재개합니다.

추론 (Card 3) 학습 후 'New Random'을 클릭하여 테스트 입력을 생성한 다음 'Permute'를 클릭하여 학습된 모델을 실행합니다. 출력 행은 모델의 예측을 보여주고 신뢰도 행은 각 위치에 대한 모델의 확신 정도를 보여줍니다. 결과를 확인하려면

원문 보기

원문 보기 (영어)

MacMind A complete transformer neural network implemented entirely in HyperTalk, trained on a Macintosh SE/30. MacMind is a 1,216-parameter single-layer single-head transformer that learns the bit-reversal permutation -- the opening step of the Fast Fourier Transform -- from random examples. Every line of the neural network is written in HyperTalk, a scripting language from 1987 designed for making interactive card stacks, not matrix math. It has token embeddings, positional encoding, self-attention with scaled dot-product scores, cross-entropy loss, full backpropagation, and stochastic gradient descent. No compiled code. No external libraries. No black boxes. Option-click any button and read the actual math. Why This Exists The same fundamental process that trained MacMind -- forward pass, loss computation, backward pass, weight update, repeat -- is what trained every large language model that exists today. The difference is scale, not kind. MacMind has 1,216 parameters. GPT-4 has roughly a trillion. The math is identical. We are at a moment where AI affects nearly everyone but almost nobody understands what it actually does. MacMind is a demonstration that the process is knowable -- that backpropagation and attention are not magic, they are math, and that math does not care whether it is running on a TPU cluster or a 68000 processor from 1987. Everything is inspectable. Everything is modifiable. Change the learning rate, swap the training task, resize the model -- all from within HyperCard's script editor. This is the engine with the hood up. What It Learns The bit-reversal permutation reorders a sequence by reversing the binary representation of each position index. For an 8-element sequence: Position: 0 1 2 3 4 5 6 7 Binary: 000 001 010 011 100 101 110 111 Reversed: 000 100 010 110 001 101 011 111 Maps to: 0 4 2 6 1 5 3 7 So input [3, 7, 1, 9, 5, 2, 8, 4] becomes [3, 5, 1, 8, 7, 2, 9, 4] . This permutation is the first step of the Fast Fourier Transform, one of the most important algorithms in computing. The model is never told the rule. It discovers the positional pattern purely through self-attention and gradient descent -- the same process, scaled up enormously, that taught larger models to understand language. After training, the attention map on Card 4 reveals the butterfly routing pattern of the FFT. The model independently discovered the same mathematical structure that Cooley and Tukey published in 1965. The Stack MacMind is a 5-card HyperCard stack: Card Purpose 1 -- Title Project name and credits 2 -- Training Train the model and watch it learn in real time 3 -- Inference Test the trained model on any 8-digit input 4 -- Attention Map Visualize the 8x8 attention weight matrix 5 -- About Plain-text explanation of what the model is doing Training (Card 2) Click Train 10 for 10 training steps, or Train to 100% to train until the model gets a perfect score on a sample. For deeper training, run Train 10 repeatedly or click Train to 100% again -- the model picks up where it left off. For a longer run, open the Message Box (Cmd-M) and type trainN 1000 to train for 1,000 steps straight. Each step generates a random 8-digit sequence, runs the full forward pass, computes cross-entropy loss, backpropagates gradients through every layer, and updates all 1,216 weights. Progress bars, per-position accuracy, and a training log update in real time. Note: The training log field has a 30,000 character limit (a HyperCard constraint). After roughly 900 steps the log will fill up and HyperCard will display an error. To clear it and continue, open the Message Box (Cmd-M) and type: put "" into card field "trainingLog" Then resume training with trainN 500 (or whatever number of steps you want). Inference (Card 3) After training, click New Random to generate a test input, then Permute to run the trained model. The output row shows the model's predictions and the confidence row shows how sure it is about each position. To verify the result, apply the bit-reversal permutation by hand. The output should rearrange the input positions in this order: Output[0] = Input[0] Output[4] = Input[1] Output[1] = Input[4] Output[5] = Input[5] Output[2] = Input[2] Output[6] = Input[3] Output[3] = Input[6] Output[7] = Input[7] For example, input [3, 7, 1, 9, 5, 2, 8, 4] should produce [3, 5, 1, 8, 7, 2, 9, 4] . If the model is well-trained, every position will be correct with confidence above 90%. Attention Map (Card 4) The 8x8 grid visualizes which input positions the model attends to when producing each output position. After training, you should see the butterfly pattern: positions 0, 2, 5, 7 attend to themselves (fixed points of the permutation), while positions 1 and 4 attend to each other, and positions 3 and 6 attend to each other (swap pairs). This is the same routing structure discovered by Cooley and Tukey in 1965 for the Fast Fourier Transform: The classic FFT butterfly diagram ( public domain ). The model discovers this structure independently through attention. Architecture Component Dimensions Parameters Token embeddings (W_embed) 10 x 16 160 Position embeddings (W_pos) 8 x 16 128 Query projection (W_Q) 16 x 16 256 Key projection (W_K) 16 x 16 256 Value projection (W_V) 16 x 16 256 Output projection (W_out) 16 x 10 160 Total 1,216 Data flow: Input digits [8] | Token embedding lookup + position embedding --> [8 x 16] | Q, K, V projections --> [8 x 16] each | Attention scores = Q x K^T, scaled by 1/sqrt(16) --> [8 x 8] | softmax per row Attention weights --> [8 x 8] | Context = weights x V --> [8 x 16] | Residual connection: context + embedded input --> [8 x 16] | Output logits = residual x W_out --> [8 x 10] | softmax per position Predictions --> [8 x 10] probability distribution over digits All weights and activations are stored as comma-delimited numbers in hidden HyperCard fields on Card 2. A 16x16 weight matrix is 256 comma-separated values in a single field. Save the stack, quit, reopen it: the trained model is still there. Training on Real Hardware MacMind was trained on a Macintosh SE/30 running System 7.6.1 and has also been tested through Basilisk II on Apple Silicon. HyperTalk is interpreted, and every multiply, every field access, every variable lookup goes through the interpreter. Each training step takes several seconds. Training to convergence (~1,000 steps) takes hours. The model was left training overnight, grinding through backpropagation one 8 MHz multiply-accumulate at a time. By morning it had learned the permutation. Requirements HyperCard 2.0 or later is required. HyperCard 1.x evaluates arithmetic left-to-right without standard precedence ( 2 + 3 * 4 = 20 instead of 14 ), which would silently corrupt every matrix multiplication and gradient computation in the model. HyperCard 2.0 introduced standard mathematical operator precedence. The stack was built and tested with HyperCard 2.1. HyperCard 2.1 Minimum MacMind Reference HyperCard 2.0 2.1 System software System 7 System 7.6.1 RAM 1 MB (2 MB recommended) 4 MB Processor 68000 68030 (Mac SE/30) Also runs on Mac OS 8, Mac OS 9, Mac OS X Classic Environment (through 10.4 Tiger on PowerPC) On real vintage hardware, each training step takes several seconds and full training takes hours. On a modern Mac running Basilisk II or SheepShaver, performance is comparable -- HyperTalk interpretation is the bottleneck, not the host CPU. Running It Yourself Quick Start (pre-trained) Download MacMind-Trained.img from Releases Open it on your Mac running System 7 through Mac OS 9, or in an emulator (Basilisk II, SheepShaver, Mini vMac) Double-click the MacMind stack Navigate to Card 3 (Inference), click New Random , then Permute Watch It Learn (blank stack) Download MacMind-Blank.img from Releases Open it on your Mac or in an emulator Navigate to Card 2 (Training) Click Train 10 for short runs, or Train to 100% to train until the model gets a perfect score on a sample. For a longer run, open the Message Box (Cm

신경망 트랜스포머 레트로 컴퓨팅 AI 교육 오픈소스