Implementing a Transformer From Scratch
Almost every transformer people meet today is a decoder-only stack: GPT and everything downstream of it. But the paper that started all of it, “Attention Is All You Need,” describes something different. It is an encoder-decoder model built for machine translation. I implemented it by hand in PyTorch, no nn.Transformer, because writing a paper out in code is the fastest way to find out which of its decisions were load-bearing and which were incidental to 2017.
The short version up front: the core survived almost untouched. Attention, the residual stream, and one specific scaling factor are still the spine of every model we run. Most of what changed since is repair work around the edges, done for training stability and serving cost. Here is the build, and the deltas.
The shape
The model is two stacks. The encoder reads the source sequence and produces a set of context vectors, one per source token, all at once. The decoder generates the target sequence one token at a time, and each step it gets to look at two things: the target tokens it has produced so far, and the encoder’s output.
That second part is the piece a decoder-only model does not have. The decoder runs attention three different ways. It attends to its own past tokens (masked self-attention), it attends to the encoder output (cross-attention), and the encoder attends to itself (encoder self-attention). The important implementation insight is that these are not three mechanisms. They are one mechanism called with different arguments. So you write multi-head attention once and reuse it everywhere:
# encoder self-attention: query, key, value all from x
self.self_attn(x, x, x, src_mask)
# decoder cross-attention: query from the decoder, key/value from encoder memory
self.cross_attn(x, memory, memory, src_mask)
Cross-attention is just self-attention where the keys and values come from somewhere else. Once that clicks, the whole architecture collapses into a small number of reused parts.
Scaled dot-product attention, and the square root
The center of the whole thing is four lines:
scores = (query @ key.transpose(-2, -1)) / math.sqrt(d_k)
scores = scores.masked_fill(mask == 0, -1e9)
attn = scores.softmax(dim=-1)
return attn @ value, attn
The one part of this that people skip past is the division by the square root of d_k. It is not cosmetic. The dot product of two vectors with unit-variance components and dimension d_k has variance d_k. Without the scaling, as you widen the head dimension the raw scores grow, the softmax saturates, it puts almost all its mass on one key, and the gradient through it goes to nearly zero. The square root puts the variance back to roughly one so the softmax stays in a regime where it can actually learn. This is the kind of detail that is invisible when you import the module and unavoidable when you write it.
The parts that are easy to get subtly wrong
The attention math is the part everyone remembers. The parts that actually cost you an afternoon are the boring ones around it.
Masking is the worst offender, because there are two different masks doing two different jobs and they have to broadcast correctly through the head dimension. The encoder needs a padding mask so real tokens do not attend to padding. The decoder needs that same padding constraint plus a causal constraint, so position i cannot see position i+1. Those two combine with a logical AND:
def make_tgt_mask(tgt, pad_idx):
pad = (tgt != pad_idx).unsqueeze(1) # (B, 1, tgt_len)
causal = subsequent_mask(tgt.size(1)) # (1, tgt_len, tgt_len)
return pad & causal # (B, tgt_len, tgt_len)
Get the broadcasting shapes wrong by one dimension and the model still runs, still trains, and silently leaks future tokens into the present. There is no exception, just a loss that looks suspiciously good. The only way to catch it is to understand exactly what each axis means.
Two more small things the paper specifies that are easy to drop. Embeddings are multiplied by the square root of the model dimension before the positional encoding is added, which keeps their magnitude comparable to the position signal. And the positions themselves are fixed sinusoids, not learned parameters: a bank of sines and cosines at geometrically spaced frequencies. The paper tried learned positions too and got essentially the same result, and chose the sinusoids on the theory that they might extrapolate to sequences longer than anything seen in training.
The last structural choice is where the LayerNorm goes. The original puts it after the residual add: LayerNorm(x + sublayer(x)). Hold onto that, because it is the single decision that aged the worst.
Checking it actually works
A from-scratch model that runs is not a from-scratch model that is correct. The standard smoke test for a sequence-to-sequence transformer is the copy task: feed it random sequences and ask it to reproduce them. It is trivial for a working transformer and impossible for a broken one, which is exactly what you want from a test.
A two-layer model converges in a few hundred steps. Loss falls from about 4.4, which is the log of the vocabulary size and what you get from random guessing, down to roughly 0.01, and greedy decoding reproduces held-out sequences token for token. If your masking is wrong, the copy task is where it shows up, because leaked future information makes the task too easy and the loss collapses faster than it should. A clean, gradual collapse to near zero is the signal that the residual stream, the masking, and the optimizer are all wired correctly.
What changed since 2017
With the paper implemented, the interesting part is holding it next to a model from this year and listing what actually moved.
Normalization moved first. Post-norm, the original choice, is hard to train at depth. The gradients are badly behaved enough that the paper needed a learning-rate warmup to keep early training from diverging. Pre-norm, x + sublayer(LayerNorm(x)), puts the normalization inside the residual branch instead of across it, which keeps a clean gradient path straight down the stack. It trains deeper and more stably, often without warmup, and it is now the default everywhere. This is the one place I would tell you the original is simply worse, and it is a one-line change.
Positions moved twice. Fixed sinusoids gave way to learned absolute positions, and then learned absolute gave way to rotary embeddings (RoPE), which inject position by rotating the query and key vectors so that attention scores depend on relative offset. RoPE extrapolates better and is what most current open models use.
The whole encoder-decoder shape became the minority case. For general-purpose generative models, decoder-only won. Part of that is simplicity and how cleanly it scales, but part of it is serving. An autoregressive decoder generates one token per step and caches the keys and values it has already computed so it never recomputes them, which makes a single stack with a KV cache the natural unit of efficient generation. Encoder-decoder is not dead, it is still strong for translation and structured sequence-to-sequence work, but the default for a large language model is now one stack, not two. (One nice property the encoder-decoder keeps: its cross-attention keys and values are computed once from the fixed source and cached for the entire decode, so that half of the attention is free after the first step.)
Attention itself got cheaper to serve. The original gives every head its own keys and values. Multi-query and grouped-query attention share keys and values across heads, which shrinks the KV cache by a large factor and is the main reason long-context serving is affordable. That change exists almost entirely for inference, not for quality.
Smaller deltas round it out. ReLU in the feed-forward block became GELU and then gated variants like SwiGLU, usually with the inner dimension trimmed to keep the parameter count roughly fixed despite the extra gate. Label smoothing, which the paper used to buy a bit of BLEU, has largely fallen out of large-scale pretraining. None of these touch the skeleton.
Takeaway
Put the 2017 model next to a 2025 one and the spine is the same: scaled dot-product attention, multiple heads, a residual stream with normalization, position information added in, a feed-forward block between attention layers. The square root scaling that keeps the softmax sane is still there, unchanged. Almost everything that moved, moved for one of two reasons, training stability or serving cost, and most of it is a small local edit rather than a new idea.
That is the real argument for implementing the paper by hand. Once you have written the masks, scaled the embeddings, and placed the LayerNorm yourself, the modern stack stops looking like a different thing and starts looking like the original with a handful of well-motivated patches. You can read any architecture diagram from the last eight years and immediately see which part is load-bearing and which part is someone fixing post-norm or shrinking a KV cache.
