Recreating ChatGPT and LLM Training (for under $20)
I was inspired by Andrej Karpathy’s Nanochat to recreate ChatGPT and train an LLM from scratch. I tried to keep the budget under $20 using an A100 NVIDIA on Lambda.
Setup
The Transformer architecture has 12 layers, which is enough to run in about an hour on a single GPU instance.
To calculate the parameters, we need to calculate the embedding dimensionality by multiplying the number of layers by the aspect ratio. (GPT-2 and Karpathy use 64). d = 12 * 64, or 768. So we have 4 attention matrices of 768x768, one for Q,K,V and a projected matrix. 4 * d^2 = 2,359,296. The feed-forward network in the Transformer are 4x the dimensionality, and there is a fully connected layer 768x3072 and a projected layer 3072x768. So each layer has about 7 Million parameters. We also need embeddings for the input and output, and these are the 50,357 vocab times the dimensionality. 38 Million * 2. The total parameters are 162 Million.
Following the Chinchilla Scaling Law, we will need ~20 tokens per parameter for the training set, which is 3.2 Billion tokens. By selecting a batch size of 0.5M, this results in ~6k iterations during training.
Data: FineWeb-EDU
Using FineWeb-EDU which is the same as Nanochat. This is because the quality of the tokens is higher than FineWeb for smaller datasets.
My data step streams FineWeb-EDU from Hugging Face, tokenizes it with the gpt2 BPE tokenizer, and writes flat uint16token files that the trainer memory-maps. We can always come back later and train a tokenizer for more efficiency.
Architecture
Rotary position embeddings. RoPE encodes position by rotating queries and keys instead of adding a learned position vector. Two consequences downstream. There is no positional table to store or to run off the end of, and you can stretch context past the training length by changing the rotation base. And because position lives inside Q and K, the KV cache stores keys that are already rotated, so cached entries stay correct as you decode. RoPE and the KV cache fit together by design.
QK-norm. RMS-normalizing the queries and keys before attention bounds the size of the attention logits. At this scale it is mostly a training-stability win. The inference angle is quieter: bounded logits make low-precision attention less likely to overflow, which is exactly what you want when you are trying to run the score computation in fp8 or int8 to make decode cheaper.
Grouped-query attention. This one is pure inference economics, which is why I left it in as an option even though the small models do not need it. During decode the KV cache is what eats memory and bandwidth, and its size scales with the number of key/value heads. Dropping from one KV head per query head down to a small shared group shrinks the cache by that ratio with little quality loss. Decode is memory-bandwidth bound, so a smaller cache is close to a direct speedup at long context and large batch. Anyone who has read a decode roofline reaches for GQA.
Untied embeddings. nanochat does not share the input and output embedding matrices, so there are two vocab x dimmatrices instead of one. At a 50k-plus vocab that is real parameters and real weight-load bandwidth, and the output matrix is a large matmul in prefill and again every step of decode. Worth keeping in your head when you account for where the FLOPs and the bytes actually go.
Logit soft-cap. A tanh squashes the logits into a fixed range before the loss. Cheap, and it keeps the final big matmul’s outputs well-behaved for the same low-precision reasons as QK-norm.
bf16 activations, fp32 master weights. The model keeps fp32 weights for the optimizer but casts to bf16 inside the matmuls. Standard training setup, and also the cleanest version of the precision boundary you redraw later for inference, where you push activations and often weights down to fp8 or int8 and keep high precision only where the dynamic range demands it.


