Transformer Architecture
The transformer is the architecture behind every major language model since 2017, and yet most discussion of "AI" treats it as a black box. Which is a shame, because the core mechanism — self-attention — is genuinely elegant: a sequence-to-sequence operation where each element in the output is a weighted average over the entire input, with the weights learned from the data itself. Everything else in the transformer is either there to make this operation more expressive or to keep training stable.
Self-Attention: The One Operation That Matters
The entire transformer rests on one idea: let every token in a sequence attend to every other token. Given an input sequence of vectors, each output vector is a weighted sum over all input vectors, where the weights are determined by dot-product similarity. In pseudocode it's remarkably compact — two matrix multiplications and a softmax.1
The dot product here is doing the work that handcrafted features used to do. Think of it like a recommender system: in movie recommendation, you'd learn user vectors and movie vectors whose dot products predict ratings. Self-attention does the same thing, except the "users" and "movies" are tokens in a sequence, and "compatibility" is whatever relationship the training signal rewards. The word "walks" learns to attend to nearby nouns because knowing who is walking helps predict what comes next.
Two things are immediately strange about this. First, there are no parameters in the basic operation — the behaviour is entirely driven by the input embeddings. Second, self-attention is permutation equivariant: it treats its input as a set, not a sequence. Shuffle the input, and the output shuffles the same way. A sequence model that doesn't natively understand order! This is why transformers need positional encodings bolted on — without them, "the cat sat on the mat" and "mat the the sat on cat" produce the same attention pattern.
Queries, Keys, Values — and Why They Exist
The raw dot-product attention has a limitation: each vector must simultaneously serve as the thing being compared (a key), the thing doing the comparing (a query), and the thing being aggregated (a value). That's too many jobs for one vector. The fix is to project each input through three separate learned matrices — W_Q, W_K, and W_V — producing distinct query, key, and value vectors.2
This simple addition gives the model room to disentangle "what am I looking for?" from "what do I advertise?" from "what information do I carry?" The names come from database retrieval: you have a query, you match it against keys, and you retrieve the corresponding values. It's a useful metaphor, though the learned projections can encode relationships far weirder than any database lookup.
The dot products also get scaled by the square root of the key dimension — without this, the softmax saturates for high-dimensional vectors, killing the gradient. A small detail, but the kind of thing that makes the difference between a model that trains and one that doesn't.
Multi-Head Attention
A single attention operation computes one weighted average per position. But a word can relate to its neighbours in multiple ways simultaneously — "Mary gave roses to Susan" requires tracking who gave, what was given, and who received, all from the same sentence. Multi-head attention runs several attention operations in parallel, each with its own Q/K/V projections, then concatenates the results and projects back down.2
Each head can specialize: some learn positional relationships (attend to the previous token), others learn syntactic roles (attend to the subject), others learn semantic similarity. The heads aren't told to specialize — they just do, because different heads learning different patterns reduces the training loss more than all of them learning the same thing. This is one of those cases where Mechanistic Interpretability reveals structure that the architecture doesn't explicitly encode.
The Full Block and What Surrounds It
A transformer block is: multi-head self-attention, followed by a feedforward network (usually two linear layers with a nonlinearity in between), with residual connections and layer normalization around each sub-layer. The feedforward network is applied independently to each position — it's where per-token computation happens, as opposed to the cross-token communication in the attention layer.
Stack these blocks and you get a transformer. GPT-style models use only decoder blocks (with causal masking so each token can only attend to previous tokens). The original "Attention Is All You Need" paper had an encoder-decoder structure for translation, but the decoder-only variant won for language modelling because it maps cleanly onto next-token prediction.3
The residual connections deserve more credit than they usually get. Without them, information from early layers would have to survive multiple nonlinear transformations to influence the output. With them, the network learns corrections at each layer rather than complete transformations — early features persist through the skip connections while later layers add refinements. This is part of why transformers scale so well: you can stack many layers without the earlier ones becoming irrelevant.
Inference Arithmetic: Where Theory Meets Hardware
Understanding the transformer architecture is one thing; running it efficiently is another. The key insight from Kipply's analysis is that transformer inference is almost entirely memory-bandwidth bound, not compute-bound, for typical serving scenarios.4
Here's why: at inference time, generating each token requires reading all the model's weights through memory, but only doing O(1) multiplications per weight (for batch size 1). An A100 GPU does 312 TFLOPS but only moves 1.5 TB/s of memory bandwidth — a ratio of about 208. This means the GPU can compute on 208 tokens in the time it takes to load the weights for one. Below 208 tokens, you're paying for memory transfer and the compute is essentially free.
The KV cache is the other critical piece: rather than recomputing key and value vectors for all previous tokens at each step, you cache them. This turns what would be quadratic-time autoregressive generation into something linear — at the cost of memory. For a 52B parameter model, the KV cache costs about 2 MB per token, which means you can fit maybe 8,000 tokens in the leftover GPU memory after loading the weights. Want bigger batches? You need more GPUs, and now you're paying for inter-GPU communication.
The practical upshot: the transformer's architecture determines not just what it can learn but what it costs to run. Inference-optimized models like Llama explicitly trade training compute for inference efficiency — a smaller model trained on more data, following the Chinchilla scaling insights, costs less per query even if it took similar compute to train.
Attention Is Actually All You Need
The title of the original transformer paper was rhetorical, but it turns out to be literally true in a surprising way. Robert Huben proved that feedforward networks can be implemented entirely using attention heads — you can convert any standard transformer into an attention-only transformer with preserved behavior.5 The construction is mathematically precise: augment the residual stream with extra dimensions to store intermediate FFN computations and one-hot positional encodings, then replace each FFN sublayer with three attention layers that jointly implement the linear transformations and nonlinearity.
The key trick is implementing the activation function (SiLU) through attention. Each dimension gets its own attention head where the query-key matrix is rigged so that each position attends only to itself and a special "bias" vector, with the attention weights splitting in exactly the proportion needed to compute the sigmoid. It's clever and somewhat wasteful — the converted model is about 5x wider and 4x deeper per layer — but the errors are on the order of 10^-14, essentially machine precision.
Why does this matter for anything besides mathematical aesthetics? Because the mechanistic interpretability toolkit that Anthropic built for analyzing transformer circuits works primarily on attention heads. The mathematical framework explicitly noted that "more complete understanding will require progress on MLP layers." If MLPs can be faithfully expressed as attention, the entire transformer becomes analyzable through a single set of tools. Whether this is practical for interpretability at scale remains open — the converted models are larger and the attention patterns use ranks far higher than normal learned heads — but as a theoretical result it's striking. The two components everyone assumed were fundamentally different turn out to be the same operation at different levels of indirection.
Hand-Coding Transformer Weights
If attention-only transformers show that the architecture's components are more unified than we thought, the Programmable Transformers project shows they're more interpretable than we assumed. The idea is straightforward but its implications are not: can you hand-write the weights of a transformer to perform a specific task, in a notation humans can read?6
The notation uses "semes" — named basis vectors like +noun, +sg, +verb, +third — to represent sparse vectors and matrices. A word embedding becomes +pig +noun +sg rather than a list of inscrutable floats. A weight matrix becomes {noun > subject, verb > predicate} rather than a wall of numbers. This is not an approximation or a simplification — it's a complete specification of actual transformer weights that you can load and run.
What makes the exercise genuinely illuminating is what you learn about the architecture while doing it. The feedforward layers turn out to be where logical conjunctions happen: implementing "A AND B" requires the ReLU nonlinearity that the FFN provides (the attention layers alone are linear and can't naturally compute conjunctions). The residual connections create a specific design constraint — it's hard to erase information from the residual stream, which helps explain why early-training transformers are repetitive before they learn to suppress previous-token embeddings. These are insights about what the architecture can and cannot naturally compute, derived not from probing trained models but from the constraints of building one by hand.6
What the Architecture Doesn't Explain
The transformer is a surprisingly simple architecture — attention, feedforward, residual, stack. It has fewer inductive biases than CNNs (no locality assumption) or RNNs (no sequential processing assumption). This is both its strength and its mystery. The architecture itself doesn't explain why these models can do chain-of-thought reasoning, or why Scaling And The Bitter Lesson seems to keep working. The magic isn't in the mechanism; it's in what emerges when you apply enough data and compute to a sufficiently expressive function class.
Building a GPT from scratch in 60 lines of NumPy is possible — and doing so is one of the best ways to demystify these systems.7 What remains mystifying is why such a simple architecture, scaled up, produces the capabilities we observe. That question belongs to Mechanistic Interpretability and to the philosophy of emergent computation more broadly.
Footnotes
Linked from
- Ai And Language Models Overview
Transformer Architecture lays out the mechanism.
- Llm Training Pipeline
The model sees trillions of tokens and learns, through the transformer architecture, to predict what comes next.