Hacker News • 114일 전

JAX와 TPU로 구축하는 최고의 오픈소스 클로드 코딩 에이전트

IMP

8/10

핵심 요약

카파시의 nanochat 프로젝트를 기반으로, 앤스로픽의 Constitutional AI 방식을 차용해 직접 코딩 에이전트 모델을 학습할 수 있도록 돕는 'nanocode' 라이브러리가 공개되었습니다. 순수 JAX로 작성되어 TPU 환경에 최적화되었으며, 구글의 무료 TPU 프로그램을 활용해 200달러의 비용으로 13억 파라미터(1.3B) 크기의 코딩 에이전트 모델을 학습 및 재현할 수 있는 것이 가장 큰 특징입니다.

번역된 본문

salmanmohammadi / nanocode 공지사항

nanocode 소개: 200달러로 살 수 있는 최고의 Claude Code #1 salmanmohammadi가 Announcements에서 발표

게시일: 2026년 4월 5일 · 댓글 0개

salmanmohammadi (Maintainer)

안녕하세요, nanocode를 공유하게 되어 매우 기쁩니다. 이 라이브러리는 여러분만의 Claude Code를 엔드투엔드(End-to-End)로 학습시키는 방법을 보여줍니다. 우선, Anthropic이 Claude 모델을 학습할 때 사용하는 방식인 Constitutional AI(헌법적 AI)를 활용한 가장 단순한 접근 방식을 따릅니다. 우리만의 SOUL.md를 작성하고, 모델이 외부와 상호작용할 에이전트 인터페이스를 정의하며, 합성 데이터(Synthetic data)를 생성한 뒤 선호도 최적화(Preference optimisation)를 통해 모델을 우리의 SOUL에 정렬(Alignment)할 것입니다.

nanocode는 전적으로 JAX로 작성되었으며 TPU를 사용해 학습하도록 설계되었습니다. 저는 핵심 학습 인프라와 철학을 Karpathy의 놀라운 nanochat 프로젝트에서 가져와 수정했기 때문에, nanochat에 익숙하다면 nanocode 역시 매우 비슷하게 느껴지실 것입니다.

다음은 제가 학습시킨 d24 1.3B 파라미터 nanocode의 구동 결과입니다: (nanocode.mp4)

새로운 구글 클라우드 계정은 300달러의 크레딧을 제공하며, 한 달 동안 선점형(Pre-emptible) TPU를 무료로 사용할 수 있는 구글 TRC 프로그램을 통해 무료로 시작해 볼 수 있습니다. 저는 이 프로젝트를 위해 3개월 동안 TRC 프로그램을 이용할 수 있었는데, 대부분의 경우 스팟 인스턴스가 중단되는 일이 드물었고 동일한 파드(Pod)를 일주일 이상 계속해서 띄워둘 수 있었습니다.

200달러가 드는 TPU v6e-8 환경에서 약 9시간이면 nanocode-d24(1.3B 파라미터)를 재현할 수 있으며, 34달러가 드는 환경에서는 약 1.5시간 만에 nanocode-d20(4억 7천7백만 파라미터)을 학습할 수 있습니다. NVIDIA GPU를 사용하시더라도 nanocode를 바로 실행할 수 있지만, TPU에 맞게 고도로 최적화되어 있다는 점은 참고하시기 바랍니다.

친절한 에이전트 코딩 파트너, nanocode 학습하기 Andrej의 원래 nanochat 릴리스 게시물은 우리가 여기서 무엇을 하고 있는지, 그리고 nanocode에서 사용할 명령어가 무엇인지 설명하는 데 훌륭한 역할을 합니다. 명령어들이 거의 동일하므로 그의 글을 먼저 읽어보시길 권장합니다. 이 글에서는 모델에서 에이전트형 코딩(Agentic coding) 동작을 이끌어내기 위해 우리가 다르게 한 작업들을 살펴보겠습니다.

토큰화(Tokenization) 및 사전 학습(Pre-training) 사전 학습과 토크나이저 학습 과정은 기본적으로 nanochat과 거의 동일합니다. 다만, 사전 학습 및 토크나이저 혼합 데이터에 The Stack-V2의 추가 코딩 데이터를 1:5 비율로 포함시킨 것이 더 강력한 코딩 모델과 효율적인 코드 토큰화를 이끌어냈으며, 이 점이 엄청난 도움이 되었습니다.

먼저 토크나이저 학습과 모델 사전 학습에 필요한 데이터셋 샤드(Shard)를 다운로드해 보겠습니다:

d24, 13억 파라미터(1.3B) 모델을 학습할 것입니다. 하지만 모델 크기에 맞게 MODEL_TAG를 수정할 수 있습니다.

export NANOCODE_BASE_DIR="$HOME/.cache/nanocode" export MODEL_TAG=d24

python -m data.pretrain -d fineweb-edu -n 300

FineWeb과 비슷하게 The Stack을 미리 패킹하고 샤딩했습니다.

python -m data.pretrain -d the-stack-v2-dedup -n 60

그리고 다음 토크나이저 학습 스크립트를 실행합니다: python -m scripts.tok_train --max-chars=2000000000 python -m scripts.tok_eval

참고로, 학습 데이터에 The Stack이 추가되었다는 점을 제외하면 nanochat의 토크나이저와 동일한 nanocode의 토크나이저를 비교할 수 있습니다(물론 더 정교한 도구 호출(Tool calling)을 지원하기 위해 특수 토큰과 템플릿 로직도 추가했지만, 이에 대해서는 나중에 자세히 다루겠습니다).

비교: nanocode vs nanochat

텍스트 유형 | 바이트(Bytes) | nanocode 토큰 | 비율(Ratio) | nanochat 토큰 | 비율(Ratio) | 상대적 차이(Diff %)

뉴스(News) | 1819 | 407 | 4.47 | 375 | 4.85 | +7.9% (nanochat 우세) 한국어(Korean)| 893 | 558 | 1.60 | (원문 텍스트가 여기서 끊어짐)

원문 보기

원문 보기 (영어)

salmanmohammadi / nanocode Public Notifications You must be signed in to change notification settings Fork 1 Star 26 Introducing nanocode: The best Claude Code that $200 can buy. #1 salmanmohammadi announced in Announcements Introducing nanocode: The best Claude Code that $200 can buy. #1 salmanmohammadi Apr 5, 2026 · 0 comments Return to top Discussion options Uh oh! There was an error while loading. Please reload this page . {{title}} Something went wrong. Uh oh! There was an error while loading. Please reload this page . Quote reply salmanmohammadi Apr 5, 2026 Maintainer - I'm so excited to share nanocode . This is a library showing you how to train your own Claude Code end-to-end. To a first approximation, we will follow the simplest possible approach for training using Constitutional AI - the approach used by Anthropic to train their Claude models. We'll write our own SOUL.md , define the agentic interface which our model will use to interact with the world, generate synthetic data, and use preference optimisation to align the model with our SOUL . nanocode is written entirely in JAX and designed to be trained using TPUs. I adapted the core training infrastructure and philosophy from Karpathy's incredible nanochat project, so if you're familiar with nanochat , nanocode should feel very similar. This is how my d24 1.3B parameter nanocode turned out: nanocode.mp4 You can get started for free using the Google TRC program which gives you free access to pre-emptible TPUs for a month - and I think new Google Cloud accounts also get $300 in credits. I was fortunate to have access to the TRC program for 3 months for this project, and I found most of the time that my spot instances were rarely interrupted and I could easily have the same pod up for a week or more. You can reproduce nanocode-d24 (1.3B params) in around ~9 hours in total on a TPU v6e-8 costing $200, or train nanocode-d20 (477M params) in ~1.5 hours costing $34. If you're using NVIDIA GPUs, nanocode should also work out of the box, but you should be aware that nanocode has been highly optimised for TPUs. Training nanocode : a friendly agentic coding partner Andrej's original release post for nanochat does a great job of explaining what we're doing here, and the commands you'll use in nanocode are virtually identical, so I'd recommend reading through his work first. I'll go over what we've done differently to elicit agentic coding behaviours from our model. Tokenization and Pre-training The pre-training and tokenizer training process is pretty much identical to nanochat 's, but I found that including additional coding data from The Stack-V2 at a ratio of 1:5 in both the pre-training and tokenizer mixture resulted in a stronger coding model and more efficient code tokenization, which helped a ton. Let's first download the dataset shards we'll need for tokenizer training and model pre-training: # we'll be training our d24, 1.3B parameter model. but you can adapt MODEL_TAG for your model size. export NANOCODE_BASE_DIR= " $HOME /.cache/nanocode " export MODEL_TAG=d24 python -m data.pretrain -d fineweb-edu -n 300 # I've pre-packed and sharded The Stack similar to FinewWeb python -m data.pretrain -d the-stack-v2-dedup -n 60 And kick off our tokenizer training script: python -m scripts.tok_train --max-chars=2000000000 python -m scripts.tok_eval For reference, we can compare with nanochat 's tokenizer which is identical aside from the addition of The Stack in the training mixture (well, I've also added special tokens and templating logic to support more sophisticated tool calling, but more on that later). Comparison: nanocode vs nanochat =============================================================================================== Text Type Bytes nanocode nanochat Relative Better Tokens Ratio Tokens Ratio Diff % ----------------------------------------------------------------------------------------------- news 1819 407 4.47 375 4.85 +7.9% nanochat korean 893 558 1.60 712 1.25 -27.6% nanocode code 1259 326 3.86 492 2.56 -50.9% nanocode math 1834 922 1.99 966 1.90 -4.8% nanocode science 1112 259 4.29 228 4.88 +12.0% nanochat fwe-train 4208518 902950 4.66 856883 4.91 +5.1% nanochat fwe-val 4495276 975403 4.61 1010352 4.86 -3.6% nanocode We can see that this gives a big boost for code at the cost of general text tokenization efficiency, but this is okay since we want our model to do one thing very well; agentic coding. Our models are trained with a param:data ratio of 8 (following nanochat's scaling law analysis ). Let's kick off a training run like so: python -u -m scripts.base_train \ --batch-size=32 \ --minibatch-size=1 \ --config=configs.d24 \ --eval-every=500 \ --sample-every=500 You should see something like this: Vocab size: 32768 World size: 8 1342.17728M model parameters 67.108864M wte parameters 1207.959552M h parameters 67.108864M lm_head parameters Training on 10737418240 tokens over 10241 steps ==================== Estimated FLOPs per token: 10066329600 Scaling the LR for the AdamW parameters ∝1/√(2048/768) = 0.612372 Step: 0/10241 | Loss: 10.398 | dt: 104.58s | | tkps: 10026 | mfu: 1.37 | ETA: -1.0 min | lr_multiplier: 1.000 Peak bytes reserved/limit: 14.86/22.27 Step: 1/10241 | Loss: 9.771 | dt: 2.74s | | tkps: 382082 | mfu: 52.37 | ETA: -1.0 min | lr_multiplier: 1.000 Step: 2/10241 | Loss: 8.209 | dt: 2.74s | | tkps: 382220 | mfu: 52.39 | ETA: 234.1 min | lr_multiplier: 1.000 Step: 3/10241 | Loss: 7.327 | dt: 2.74s | | tkps: 382193 | mfu: 52.39 | ETA: 312.1 min | lr_multiplier: 1.000 ... fwe_bpb: 0.7626 | sv2_bpb: 0.4356 | avg_bpb: 0.5991 | dt: 90.53s < | bos | > The capital of France is Paris. It is the largest city in France and the most populous city in < | bos | > The chemical symbol of gold is Au. Gold is a soft, malleable, yellow metal that is < | bos | > The closest planet to the Sun is Mercury, which is the smallest planet in the solar system. It is the closest < | bos | > The opposite of hot is cold. The opposite of cold is heat. The opposite of heat is cold. < | bos | > The second-last day of the week is the day of the Lord. (Leviticus 23:2) ... CORE metric: 0.2352 | dt: 56.86s Total training time: 467.15min Our model has attained some knowledge about the world, which is nice. It still doesn't know about Saturday though : ). Let's look at some more thorough quantitative results, since we only estimate metrics using a smaller subset of the evaluation data during training: python -u -m scripts.base_eval --checkpoint=base --minibatch-size=8 This will print a whole bunch of metrics, but the relevant ones are bits-per-byte across our pretraining sets: sv2 (The Stack V2) and fwe (FineWeb_EDU), and the CORE metric which makes comparing against nanochat 's results and GPT-2 straightforward. I've compiled the results across a few model parameter sizes to get a feel for our scaling laws: | depth | params | CORE | cost | time | MFU | fwe bpb | sv2 bpb | | ------- | -------- | ------ | ------ | -------- | ------- | --------- | --------- | | d12 | 135M | 0.090 | $3 | 9 min | 17.4% | 0.956 | 0.689 | | d20 | 477M | 0.170 | $30 | 1.4 hrs | 45.2% | 0.838 | 0.533 | | d24 | 1.3B | 0.227 | $200 | 9.3 hrs | 52.5% | 0.759 | 0.445 | Since CORE measures general language reasoning capabilities and we've geared our models towards code data, it's expected that our CORE scores drop slightly compared to the corresponding GPT-2 models. Training d24 on FineWeb-EDU alone resulted in a CORE score of 0.261 which lines up with GPT-2 XL below and nanochat-d24 . The tradeoff here is that we expect our models to perform well in coding tasks. | model | params | CORE | | --------------- | -------- | ------- | | GPT-2 Small | 124M | 0.114 | | GPT-2 Medium | 355M | 0.185 | | GPT-2 Large | 774M | 0.215 | | GPT-2 XL | 1.6B | 0.257 | I'll mostly be referring to our d24 model throughout this post, which is similar to nanochat 's d24 model but is trained with twice the context length (4096 vs. 2048) to better sup

오픈소스 JAX AI 코딩 에이전트 모델 학습 TPU