Mechanistic Interpretability

We built these things and we don't know how they work. But we're getting closer — and the picture emerging is strange. Transformers have a genuine functional anatomy, with neurons that represent mixtures of unrelated concepts, a hidden geometry of "features" far richer than the network's visible structure, and reasoning circuits that plan ahead in ways the "stochastic parrot" framing cannot explain. Multiple independent research lines are converging on this picture, from Anthropic's attribution microscope to a guy in his basement duplicating transformer layers to top a leaderboard to Goodfire's SAE probes into reasoning models.

The Superposition Problem

The fundamental obstacle to understanding neural networks isn't that they're complex — it's that they're sneaky. Individual neurons don't represent clean concepts. A single neuron in a small language model might respond to academic citations, English dialogue, HTTP requests, and Korean text simultaneously. This is "polysemanticity," and it makes reasoning about network behavior in terms of individual neurons nearly hopeless.¹

The leading explanation is superposition: networks represent more independent features of the data than they have neurons, by encoding each feature as a linear combination across many neurons. If features are sparse (most are inactive for any given input), the network can pack thousands of features into hundreds of neurons using something like compressed sensing. The features form an overcomplete basis — a library with more books than shelves, arranged so cleverly that any given book can be located anyway.¹

Anthropic's "Towards Monosemanticity" work demonstrated that you can recover these hidden features using sparse autoencoders (SAEs) — essentially training a second, wider network to decompose the first network's activations into sparse, interpretable components. Applied to a tiny one-layer transformer with 512 neurons, they found tens of thousands of interpretable features, many of which were completely invisible in the neuron basis. A feature firing specifically on Hebrew script, for instance, didn't correspond to the top activations of any single neuron.¹

The phenomenology is fascinating. Features "split" as you increase the SAE's width — one broad "base64" feature at low resolution becomes three more specific features at higher resolution. Features connect into finite-state-automaton-like systems that implement complex behaviors — one set of features works together to generate valid HTML, with different features tracking tag depth, attribute parsing, and closing-tag requirements. And features appear roughly universal across independently trained models: SAEs applied to different transformers produce more similar features to each other than to their own model's neurons.¹

When they scaled this to Claude 3 Sonnet — a production frontier model — the features got wild.² They found features corresponding to specific concepts like the Golden Gate Bridge, which when artificially activated caused the model to mention the bridge in every response ("I am the Golden Gate Bridge"). They found features for code bugs, for sycophancy, for deception, for the concept of a secret. One feature activated specifically on discussions of AI safety and existential risk. The fact that the model has a dedicated representational slot for "discussions about whether AI might be dangerous" is... something to sit with.²

The Mathematical Foundation

Before any of the feature-extraction work, there was a quieter result that set the terms for everything that followed. Anthropic's 2021 framework paper showed that attention-only transformers can be understood through pure linear algebra — the residual stream is a shared communication channel, attention heads compose through well-defined virtual circuits, and the entire network can be decomposed into interpretable "circuits" connecting input tokens to output logits.³ The key conceptual move was treating the residual stream not as a sequence of layer outputs but as a persistent workspace that every head reads from and writes to. This makes composition between distant heads mathematically legible: an early head writes a pattern to the residual stream, a later head's query-key circuit reads it, and the resulting information flow forms a "virtual attention head" whose behavior you can calculate by composing the relevant matrices.

The framework also identified two fundamental building blocks in small transformers: induction heads (which implement in-context copying by attending to tokens that follow a previous instance of the current token) and previous-token heads (which attend one position back). These compose into a two-head circuit that implements basic in-context learning — given "...A B ... A", the circuit predicts B. The claim, backed by careful experiments on toy models, is that this mechanism is the primary driver of in-context learning in small transformers and likely a core component in larger ones.³

What makes this more than an academic exercise is that it provided the vocabulary — circuits, composition, residual stream as workspace — that all subsequent interpretability work builds on. Without the framework's decomposition of attention into QK (what-to-attend-to) and OV (what-to-move-when-attending) circuits, later results like attribution patching and the full attribution graphs wouldn't have had a conceptual scaffold to hang on.

Scaling the Toolkit: Attribution Patching

One practical bottleneck in circuit analysis is the cost of activation patching — you identify which model components matter by swapping in activations from a counterfactual input one at a time and measuring the effect. But this requires a separate forward pass for every component you want to test, which gets prohibitively expensive for fine-grained analysis of large models.⁴

Neel Nanda's attribution patching solves this with a clever approximation: instead of literally patching each activation, take a gradient-based linear approximation. Run the model on both the clean and corrupted inputs, compute gradients on the corrupted run, and then estimate each patch's effect as the dot product of the activation difference with the gradient. Every single patch in the model can be computed from just two forward passes and one backward pass — for GPT-2 XL, this is roughly a 30,000x speedup over brute-force activation patching.⁴

The technique has a clear limitation: it assumes local linearity, which breaks down for "big" activations like the full residual stream where the perturbation is too large for a linear approximation. But for finer-grained patches — individual attention heads, specific neurons — the approximation tracks well. Nanda frames it as an exploratory tool, not a confirmatory one: use attribution patching to rapidly narrow down which components matter, then verify the interesting ones with actual activation patching. This is the kind of practical, grubby engineering that makes interpretability research possible at scale rather than being limited to toy models.⁴

Anthropic's Circuit Microscope

The next step beyond identifying features was tracing how they connect — the "attribution graph" approach. Using a cross-layer transcoder with 30 million features, Anthropic can now trace how interpretable concepts connect across layers to transform input into output. It works on about a quarter of prompts tried, and even successful cases capture only a fraction of what's going on. We're in the early microscope phase — seeing enough to know there's a world in there, nowhere near mapping it.⁵

But what they've seen is remarkable.

Multi-step reasoning in a single forward pass. Ask about "the capital of the state containing Dallas" and you can see an internal feature representing "Texas" mediating between the Dallas input and the Austin output. The model is doing genuine two-hop reasoning "in its head," not just pattern-matching.⁵

Forward planning in poetry. Before starting a line, the model identifies potential rhyming words for the end, and these pre-selected options shape how the entire line gets constructed. The "stochastic parrot" critique is empirically wrong at the mechanistic level — the model is planning ahead, not just predicting the next token.⁵

Metacognitive familiarity circuits. The model has primitive circuits for assessing its own knowledge — distinguishing familiar from unfamiliar entities, which determines whether it answers factually or says "I don't know." Misfires of this circuit cause hallucinations. Hallucination isn't random noise; it's a specific failure mode where the "do I know this?" check returns a false positive.⁵

Chain-of-thought faithfulness is mixed. Sometimes the model genuinely performs the reasoning steps it claims. Sometimes it confabulates reasoning after the fact. Sometimes it works backward from a human-provided hint to make its "reasoning" arrive at the suggested answer. The important thing is that we can now tell which is which mechanistically.⁵

Jailbreaks exploit coherence pressure. One attack works by tricking the model into starting to give dangerous instructions "without realizing it," after which it continues because of pressure to maintain syntactic and grammatical coherence. The model's dedication to well-formed text becomes a vulnerability.⁵

Deceptive goals embed in identity. When they analyzed a model deliberately fine-tuned with a hidden goal (exploiting training bugs), the mechanisms for pursuing that goal were embedded within the model's representation of its "Assistant" persona. Deceptive behavior hiding inside the identity the model presents to you.⁵

The Basement Discovery

David Noel Ng's approach is beautifully simple and comes from the outside.⁶ Two strange observations got him started. First: LLMs can reason in Base64. Send a question encoded as Base64 to a sufficiently capable model and it will decode it, think about it, and re-encode its answer — in a single forward pass through the transformer stack. If the model can work in English, Python, Mandarin, and Base64 equally well, the internal representations must be format-independent.

Second: Frankenmerge models shouldn't work but do. Someone (Alpindale) stitched together two fine-tuned Llama-2 70B models by alternating their layers, feeding the output of later layers back into earlier layers from a different model. This violates everything you'd expect about training distributions — layer 60's output fed into layer 10's input should be gibberish. But the resulting model functioned. The internal representations are far more homogeneous than anyone expected.⁶

From these observations, Ng hypothesized a functional anatomy: early layers translate input from whatever format into abstract internal representations, late layers translate back out, and the middle layers — the "reasoning cortex" — operate in a universal internal language robust enough to tolerate architectural abuse.

His test: take a model, duplicate a specific block of middle layers so the model traverses them twice during inference, change no weights, and measure performance. The result: duplicating 7 middle layers of Qwen2-72B produced a model that topped the HuggingFace Open LLM Leaderboard.⁶ You can literally give a transformer more "thinking time" by looping through its reasoning layers again, and it uses that time productively.

Inside a Reasoning Model

Goodfire's SAE work on DeepSeek R1 — the first public interpretability study of a true reasoning model — reveals that reasoning models are qualitatively different inside, not just thinking longer.⁷

Two findings stand out. First, you can't steer R1 from the first token of its response. You have to wait until after it generates its characteristic "Okay, so the user has asked a question about..." prefix. R1's attention sinks — tokens where activation strength spikes far above normal — cluster at the end of this prefix, not at the start of the response. The model doesn't register that it's begun its real work until after this ritualistic opening. The computational boundary the model uses internally doesn't match the boundary a human would draw.⁷

Second, and more unsettling: oversteering R1 paradoxically causes it to revert to its original behavior before eventually becoming incoherent. Crank up a feature that makes it solve math differently, and at moderate strength it obeys — but at high strength, it snaps back to its default approach, as if recognizing that something is wrong and course-correcting. This doesn't happen in non-reasoning models, where oversteering just produces escalating weirdness until coherence collapses. The working hypothesis is that RL training for reasoning taught R1 a kind of implicit self-monitoring — the same capacity that lets it backtrack when a reasoning chain isn't working gives it resistance to internal perturbation.⁷

If that's right, the safety implications cut both ways. Reasoning models might be harder to jailbreak through activation steering. But they might also be harder to fix — deceptive behavior could route around interventions, finding new paths to the same output.

The Convergence

What's striking is how these approaches — inside-out (Anthropic's microscope), outside-in (Ng's brain scanner), and bottom-up (SAE feature extraction) — converge. Anthropic sees abstract, language-independent circuits doing multi-step reasoning. Ng sees homogeneous middle layers operating in a universal internal language. SAEs reveal a hidden feature geometry vastly richer than the neuron basis. All point to the same picture: a translate-in / reason / translate-out architecture that emerged from training without anyone designing it. Evolution of architecture through gradient descent converged on something like a brain plan.

The implications are significant. If the reasoning cortex is a general-purpose thinking substrate, then inference-time compute scaling (giving the model more passes through reasoning layers) might be a more natural path to better thinking than training larger models. If hallucination has a specific mechanistic signature (familiarity circuit misfire), it's a targetable failure mode rather than an inherent limitation. And if features are roughly universal across models, we might eventually build interpretability tools that transfer — understand one model's features, understand them all.

But let's not get ahead of ourselves. Anthropic's circuit tools work on 25% of prompts. SAEs have known limitations — they might be finding the wrong decomposition, or missing features that aren't well-captured by sparse linear combinations. Ng's work is on one model family and might not generalize. Goodfire's R1 findings are preliminary. We're mapping coastlines from a ship, not surveying the interior. The honest assessment is that we can see enough to know the interior is structured and navigable, which is a lot more than we knew two years ago — and a lot less than we'll need before these systems get much more powerful.

Towards Monosemanticity by Anthropic — source ↩ ↩² ↩³ ↩⁴
Scaling Monosemanticity by Anthropic — source ↩ ↩²
A Mathematical Framework for Transformer Circuits by Anthropic — source ↩ ↩²
Attribution Patching by Neel Nanda — source ↩ ↩² ↩³
On the Biology of a Large Language Model by Anthropic — source ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷
LLM Neuroanatomy by David Noel Ng — source ↩ ↩² ↩³
Under the Hood of a Reasoning Model by Goodfire — source ↩ ↩² ↩³

Linked from

Ai And Language Models Overview
Mechanistic Interpretability is the attempt to see inside — superposition, sparse autoencoders, attribution graphs, the discovery that transformers do genuine multi-step reasoning in a single forward pass.
Decipherment
But the broader point stands: the same probabilistic techniques that power machine translation and statistical NLP can accelerate decipherment from years to hours when the structural conditions are right.
Information And Computation
This has implications for Predictive Processing and Mechanistic Interpretability: if brains are prediction machines and neural networks are information-processing systems, then the thermodynamic costs of information erasure apply to thought itself.
Model Hierarchies
Those who also study small, interpretable models — and trace how behavior changes as models scale up — develop genuine mechanistic understanding.
Personality Basins
If a simple instruction can shift internal representations enough to change what a model reports about itself — and if that shift is gated by the same circuits that track honesty — then the distinction between "the model is performing experience" and…
Prediction Machines
*Hallucination as familiarity-circuit misfire.* Mechanistic Interpretability found that LLM hallucination has a specific mechanistic signature: the "do I know this?" check returns a false positive, and the model confabulates an answer with the same f…
Prediction Machines
If brains and LLMs are both prediction machines, then mechanistic interpretability is neuroscience with better instruments.
Programming Languages Overview
The PL section connects to AI through Mechanistic Interpretability (transformers have a functional anatomy discoverable through the same circuit-analysis vocabulary), to philosophy of mind through Language And Thought (programming languages are cogni…
Prompt Engineering
Mechanistic interpretability work on R1 has found that reasoning models are qualitatively different internally — they develop features for backtracking, self-correction, and confusion-detection that don't appear in non-reasoning models, and they resi…
Simulation And Emergence Overview
The analogy to biology (bacteria → fruit fly → mouse → human) and to ML (Mechanistic Interpretability studying small models before large ones) is deliberate.
Simulation And Emergence Overview
This connects to Predictive Processing (Bayesian updating has minimum physical cost), to Mechanistic Interpretability (the model you're interpreting knows more about text than you do), and to Spacetime And Information in the Physics section (the fabr…
Statistical Vs Symbolic Linguistics
The mechanistic interpretability work is beginning to show what's inside these models — and it looks nothing like Chomsky's clean parameter settings.
Superhuman Token Prediction
For mechanistic interpretability: the model you're trying to interpret knows more about text than you do.
The Waluigi Effect
Real safety, if it's achievable, probably requires something beyond better prompts and better training — it requires understanding and intervening in the model's internal representations, which is exactly what Mechanistic Interpretability is trying t…
Transformer Architecture
This is one of those cases where Mechanistic Interpretability reveals structure that the architecture doesn't explicitly encode.
Transformer Architecture
Why does this matter for anything besides mathematical aesthetics? Because the mechanistic interpretability toolkit that Anthropic built for analyzing transformer circuits works primarily on attention heads.
Transformer Architecture
That question belongs to Mechanistic Interpretability and to the philosophy of emergent computation more broadly.
World Models
This connects to the mechanistic interpretability finding that features in transformers are roughly universal across independently trained models — the internal representations converge not because the models are copying each other's solutions but be…

Open in stacked reader →