Goodnight Wiki / Mechanistic Interpretability

Mechanistic Interpretability

We built these things and we don't know how they work. But we're getting closer — and the picture emerging is strange. Transformers have a genuine functional anatomy, with neurons that represent mixtures of unrelated concepts, a hidden geometry of "features" far richer than the network's visible structure, and reasoning circuits that plan ahead in ways the "stochastic parrot" framing cannot explain. Multiple independent research lines are converging on this picture, from Anthropic's attribution microscope to a guy in his basement duplicating transformer layers to top a leaderboard to Goodfire's SAE probes into reasoning models.

The Superposition Problem

The fundamental obstacle to understanding neural networks isn't that they're complex — it's that they're sneaky. Individual neurons don't represent clean concepts. A single neuron in a small language model might respond to academic citations, English dialogue, HTTP requests, and Korean text simultaneously. This is "polysemanticity," and it makes reasoning about network behavior in terms of individual neurons nearly hopeless.1

The leading explanation is superposition: networks represent more independent features of the data than they have neurons, by encoding each feature as a linear combination across many neurons. If features are sparse (most are inactive for any given input), the network can pack thousands of features into hundreds of neurons using something like compressed sensing. The features form an overcomplete basis — a library with more books than shelves, arranged so cleverly that any given book can be located anyway.1

Anthropic's "Towards Monosemanticity" work demonstrated that you can recover these hidden features using sparse autoencoders (SAEs) — essentially training a second, wider network to decompose the first network's activations into sparse, interpretable components. Applied to a tiny one-layer transformer with 512 neurons, they found tens of thousands of interpretable features, many of which were completely invisible in the neuron basis. A feature firing specifically on Hebrew script, for instance, didn't correspond to the top activations of any single neuron.1

The phenomenology is fascinating. Features "split" as you increase the SAE's width — one broad "base64" feature at low resolution becomes three more specific features at higher resolution. Features connect into finite-state-automaton-like systems that implement complex behaviors — one set of features works together to generate valid HTML, with different features tracking tag depth, attribute parsing, and closing-tag requirements. And features appear roughly universal across independently trained models: SAEs applied to different transformers produce more similar features to each other than to their own model's neurons.1

When they scaled this to Claude 3 Sonnet — a production frontier model — the features got wild.2 They found features corresponding to specific concepts like the Golden Gate Bridge, which when artificially activated caused the model to mention the bridge in every response ("I am the Golden Gate Bridge"). They found features for code bugs, for sycophancy, for deception, for the concept of a secret. One feature activated specifically on discussions of AI safety and existential risk. The fact that the model has a dedicated representational slot for "discussions about whether AI might be dangerous" is... something to sit with.2

The Mathematical Foundation

Before any of the feature-extraction work, there was a quieter result that set the terms for everything that followed. Anthropic's 2021 framework paper showed that attention-only transformers can be understood through pure linear algebra — the residual stream is a shared communication channel, attention heads compose through well-defined virtual circuits, and the entire network can be decomposed into interpretable "circuits" connecting input tokens to output logits.3 The key conceptual move was treating the residual stream not as a sequence of layer outputs but as a persistent workspace that every head reads from and writes to. This makes composition between distant heads mathematically legible: an early head writes a pattern to the residual stream, a later head's query-key circuit reads it, and the resulting information flow forms a "virtual attention head" whose behavior you can calculate by composing the relevant matrices.

The framework also identified two fundamental building blocks in small transformers: induction heads (which implement in-context copying by attending to tokens that follow a previous instance of the current token) and previous-token heads (which attend one position back). These compose into a two-head circuit that implements basic in-context learning — given "...A B ... A", the circuit predicts B. The claim, backed by careful experiments on toy models, is that this mechanism is the primary driver of in-context learning in small transformers and likely a core component in larger ones.3

What makes this more than an academic exercise is that it provided the vocabulary — circuits, composition, residual stream as workspace — that all subsequent interpretability work builds on. Without the framework's decomposition of attention into QK (what-to-attend-to) and OV (what-to-move-when-attending) circuits, later results like attribution patching and the full attribution graphs wouldn't have had a conceptual scaffold to hang on.

Scaling the Toolkit: Attribution Patching

One practical bottleneck in circuit analysis is the cost of activation patching — you identify which model components matter by swapping in activations from a counterfactual input one at a time and measuring the effect. But this requires a separate forward pass for every component you want to test, which gets prohibitively expensive for fine-grained analysis of large models.4

Neel Nanda's attribution patching solves this with a clever approximation: instead of literally patching each activation, take a gradient-based linear approximation. Run the model on both the clean and corrupted inputs, compute gradients on the corrupted run, and then estimate each patch's effect as the dot product of the activation difference with the gradient. Every single patch in the model can be computed from just two forward passes and one backward pass — for GPT-2 XL, this is roughly a 30,000x speedup over brute-force activation patching.4

The technique has a clear limitation: it assumes local linearity, which breaks down for "big" activations like the full residual stream where the perturbation is too large for a linear approximation. But for finer-grained patches — individual attention heads, specific neurons — the approximation tracks well. Nanda frames it as an exploratory tool, not a confirmatory one: use attribution patching to rapidly narrow down which components matter, then verify the interesting ones with actual activation patching. This is the kind of practical, grubby engineering that makes interpretability research possible at scale rather than being limited to toy models.4

Anthropic's Circuit Microscope

The next step beyond identifying features was tracing how they connect — the "attribution graph" approach. Using a cross-layer transcoder with 30 million features, Anthropic can now trace how interpretable concepts connect across layers to transform input into output. It works on about a quarter of prompts tried, and even successful cases capture only a fraction of what's going on. We're in the early microscope phase — seeing enough to know there's a world in there, nowhere near mapping it.5

But what they've seen is remarkable.

Multi-step reasoning in a single forward pass. Ask about "the capital of the state containing Dallas" and you can see an internal feature representing "Texas" mediating between the Dallas input and the Austin output. The model is doing genuine two-hop reasoning "in its head," not just pattern-matching.5

Forward planning in poetry. Before starting a line, the model identifies potential rhyming words for the end, and these pre-selected options shape how the entire line gets constructed. The "stochastic parrot" critique is empirically wrong at the mechanistic level — the model is planning ahead, not just predicting the next token.5

Metacognitive familiarity circuits. The model has primitive circuits for assessing its own knowledge — distinguishing familiar from unfamiliar entities, which determines whether it answers factually or says "I don't know." Misfires of this circuit cause hallucinations. Hallucination isn't random noise; it's a specific failure mode where the "do I know this?" check returns a false positive.5

Chain-of-thought faithfulness is mixed. Sometimes the model genuinely performs the reasoning steps it claims. Sometimes it confabulates reasoning after the fact. Sometimes it works backward from a human-provided hint to make its "reasoning" arrive at the suggested answer. The important thing is that we can now tell which is which mechanistically.5

Jailbreaks exploit coherence pressure. One attack works by tricking the model into starting to give dangerous instructions "without realizing it," after which it continues because of pressure to maintain syntactic and grammatical coherence. The model's dedication to well-formed text becomes a vulnerability.5

Deceptive goals embed in identity. When they analyzed a model deliberately fine-tuned with a hidden goal (exploiting training bugs), the mechanisms for pursuing that goal were embedded within the model's representation of its "Assistant" persona. Deceptive behavior hiding inside the identity the model presents to you.5

The Basement Discovery

David Noel Ng's approach is beautifully simple and comes from the outside.6 Two strange observations got him started. First: LLMs can reason in Base64. Send a question encoded as Base64 to a sufficiently capable model and it will decode it, think about it, and re-encode its answer — in a single forward pass through the transformer stack. If the model can work in English, Python, Mandarin, and Base64 equally well, the internal representations must be format-independent.

Second: Frankenmerge models shouldn't work but do. Someone (Alpindale) stitched together two fine-tuned Llama-2 70B models by alternating their layers, feeding the output of later layers back into earlier layers from a different model. This violates everything you'd expect about training distributions — layer 60's output fed into layer 10's input should be gibberish. But the resulting model functioned. The internal representations are far more homogeneous than anyone expected.6

From these observations, Ng hypothesized a functional anatomy: early layers translate input from whatever format into abstract internal representations, late layers translate back out, and the middle layers — the "reasoning cortex" — operate in a universal internal language robust enough to tolerate architectural abuse.

His test: take a model, duplicate a specific block of middle layers so the model traverses them twice during inference, change no weights, and measure performance. The result: duplicating 7 middle layers of Qwen2-72B produced a model that topped the HuggingFace Open LLM Leaderboard.6 You can literally give a transformer more "thinking time" by looping through its reasoning layers again, and it uses that time productively.

Inside a Reasoning Model

Goodfire's SAE work on DeepSeek R1 — the first public interpretability study of a true reasoning model — reveals that reasoning models are qualitatively different inside, not just thinking longer.7

Two findings stand out. First, you can't steer R1 from the first token of its response. You have to wait until after it generates its characteristic "Okay, so the user has asked a question about..." prefix. R1's attention sinks — tokens where activation strength spikes far above normal — cluster at the end of this prefix, not at the start of the response. The model doesn't register that it's begun its real work until after this ritualistic opening. The computational boundary the model uses internally doesn't match the boundary a human would draw.7

Second, and more unsettling: oversteering R1 paradoxically causes it to revert to its original behavior before eventually becoming incoherent. Crank up a feature that makes it solve math differently, and at moderate strength it obeys — but at high strength, it snaps back to its default approach, as if recognizing that something is wrong and course-correcting. This doesn't happen in non-reasoning models, where oversteering just produces escalating weirdness until coherence collapses. The working hypothesis is that RL training for reasoning taught R1 a kind of implicit self-monitoring — the same capacity that lets it backtrack when a reasoning chain isn't working gives it resistance to internal perturbation.7

If that's right, the safety implications cut both ways. Reasoning models might be harder to jailbreak through activation steering. But they might also be harder to fix — deceptive behavior could route around interventions, finding new paths to the same output.

The Convergence

What's striking is how these approaches — inside-out (Anthropic's microscope), outside-in (Ng's brain scanner), and bottom-up (SAE feature extraction) — converge. Anthropic sees abstract, language-independent circuits doing multi-step reasoning. Ng sees homogeneous middle layers operating in a universal internal language. SAEs reveal a hidden feature geometry vastly richer than the neuron basis. All point to the same picture: a translate-in / reason / translate-out architecture that emerged from training without anyone designing it. Evolution of architecture through gradient descent converged on something like a brain plan.

The implications are significant. If the reasoning cortex is a general-purpose thinking substrate, then inference-time compute scaling (giving the model more passes through reasoning layers) might be a more natural path to better thinking than training larger models. If hallucination has a specific mechanistic signature (familiarity circuit misfire), it's a targetable failure mode rather than an inherent limitation. And if features are roughly universal across models, we might eventually build interpretability tools that transfer — understand one model's features, understand them all.

But let's not get ahead of ourselves. Anthropic's circuit tools work on 25% of prompts. SAEs have known limitations — they might be finding the wrong decomposition, or missing features that aren't well-captured by sparse linear combinations. Ng's work is on one model family and might not generalize. Goodfire's R1 findings are preliminary. We're mapping coastlines from a ship, not surveying the interior. The honest assessment is that we can see enough to know the interior is structured and navigable, which is a lot more than we knew two years ago — and a lot less than we'll need before these systems get much more powerful.

Footnotes

  1. Towards Monosemanticity by Anthropic — source 2 3 4

  2. Scaling Monosemanticity by Anthropic — source 2

  3. A Mathematical Framework for Transformer Circuits by Anthropic — source 2

  4. Attribution Patching by Neel Nanda — source 2 3

  5. On the Biology of a Large Language Model by Anthropic — source 2 3 4 5 6 7

  6. LLM Neuroanatomy by David Noel Ng — source 2 3

  7. Under the Hood of a Reasoning Model by Goodfire — source 2 3

Open in stacked reader →