Implementing a Transformer From Scratch

Jan 05, 2026

I wanted to analyze Andrej Karpathy’s NanoGPT implementation of the Transformer in detail. I will also use the Transformer paper as a reference.

Embeddings, positions, and weights

The PyTorch nn.Embedding for the token embedding operates as a lookup table, with one row per token in the vocabulary. So the dimensionality of the matrix is vocab_size x n_embedding. This is a hyperparameter which determines the width of residual stream.

The second embedding matrix encodes position information in a learned embedding. This is different from the Transformer paper, which uses sine and cosine to compute token positions.

Adding the token embedding plus the position embedding combines them into one vector.

The output matrix used to produce the logits before softmax is tied to the token embedding table to save parameters. This is because the embedding table has most of the parameters at this scale.

LayerNorm and residual stream

Using LayerNorm to normalize FFN activations adds stability between layers during training. The Transformer vectors have a dimensionality of 768, so take the mean of each of the 768 vectors for each token, and rescale the activations so that the mean is 0 and the standard deviation is 1. Since this is done per layer, it ignores the composition within the batch.

The implementation used by GPT-2 and Karpathy uses Post-LayerNorm, where the LayerNorm is computed last. This differs slightly from the Transformer paper using Pre-LayerNorm, where LayerNorm is computed first. The difference in the residual stream in code would be something like this:

# Pre-LN
x = x + attn(ln(x))
x = x + mlp(ln(x))
x = x + attn(ln(x))

# Post-LN
x = ln(x + attn(x))
x = ln(x + mlp(x))
x = ln(x + attn(x))

Since we’re using Post-LayerNorm, LayerNorm has to be applied a final time to get the output logits.

Causal SelfAttention

Scaled dot-product attention, QK^T/sqrt(d_k); the variance argument for the scale

The fused QKV projection and the reshape to (B, nh, T, hs)

Multi-head: why several subspaces beat one wide head

Causal masking as the autoregressive constraint: tril, is_causal

c_proj mixing heads back together (W^O)

Computer Science Research and Development

Ready for more?