AI Safety · Mechanistic Interpretability

The Lantern Inside the Black Box

A story-led guide to mechanistic interpretability: how researchers move from watching model behavior to tracing the features and circuits that caused it.

SAE sparse autoencoders Extract interpretable features from dense activations
2024 scaled features Monosemantic feature work reached production-scale models
2025 circuit tracing Attribution graphs map feature-level pathways
Core ideaOpen the black box
MethodsFeatures + circuits
AudienceBuilders + security teams
Risk lensHidden behavior
mechanistic interpretability transformer circuits sparse autoencoders monosemantic features activation patching AI security

In a small village, there was an old water mill that everyone trusted. Farmers brought grain. The mill turned it into flour. Bakers made bread. Families ate. For years, nobody asked how the mill worked because the bread came out fine.

Then one winter, something changed. Some bags of grain came back too coarse. Some were perfect. Some smelled faintly burnt. The miller stood outside the machine, listened to the same wooden gears, watched the same wheel turn, and gave the same answer every time: "The mill is working."

But the village did not need a confident answer. It needed someone to open the mill.

A young apprentice brought a lantern, crawled inside, and started tracing the mechanism. One gear was not broken. One belt was not enough to explain it. The problem was a chain: a loose peg shifted a small gear, the small gear changed the pressure on a stone, and the stone crushed some grain too hard whenever the river current was strong.

That is mechanistic interpretability in plain language. It is the lantern inside the model. Not "what did the AI answer?" but "what internal mechanism produced that answer?"

Modern AI systems are much more complicated than a village mill, but the problem is familiar. We feed a model a prompt. It returns an answer. Sometimes the answer is brilliant. Sometimes it is wrong. Sometimes it is unsafe in a way that only appears under pressure. If we only look at inputs and outputs, we are standing outside the mill.

Mechanistic interpretability asks a sharper question: which features, circuits, and causal pathways inside the model made this behavior happen?

Core thesis

What mechanistic interpretability means#

Mechanistic interpretability is the study of how neural networks perform computations internally. It tries to reverse engineer a trained model the way an engineer might inspect a circuit board, a compiler, or a biological pathway.

Most interpretability explains behavior from the outside: feature importance, saliency maps, input-output tests, benchmark scores, and model evaluations. Those are useful, but they are not the same as understanding mechanism. A model can pass a test while still relying on a brittle shortcut. A model can refuse a dangerous request in one format and comply in another. A model can appear aligned while hiding the internal route that created the response.

Mechanistic interpretability moves inward. It looks at activations, neurons, attention heads, residual streams, learned features, and circuits. It asks whether a behavior can be traced, patched, amplified, suppressed, or causally verified.

From black-box testing to mechanism
A
Black-box testing

Prompt in, answer out, pass or fail. Useful, but it stops before the question of mechanism.

B
Mechanistic interpretability

Traces activations, features, and circuits — then tests causality with interventions like activation patching.

Why this matters now#

The old interpretability question was, "Can we explain why this prediction happened?" The new AI security question is harder: "Can we identify dangerous internal capabilities before they appear as real-world behavior?"

Large language models are not databases. They are not normal programs. They are trained systems with distributed representations. A concept may not live in one neuron. One neuron may respond to multiple unrelated concepts. This is called polysemanticity, and it is one reason the black box is so difficult to open.

Recent mechanistic interpretability research has made progress by moving from individual neurons to learned features. Work on sparse autoencoders and monosemantic features showed that dense model activations can sometimes be decomposed into more interpretable units. Anthropic's research scaled this idea from toy models toward production-scale models such as Claude 3 Sonnet, identifying features related to concepts, code, places, deception-like behavior, sycophancy-like behavior, and other safety-relevant themes.

The careful word is features, not proof of intent. Finding a "deception-related feature" does not mean a model is secretly plotting. It means there is an internal direction that activates around deception-related concepts. That distinction matters. Mechanistic interpretability is powerful, but it is not mind reading.

Security framing: the goal is not to anthropomorphize the model. The goal is to inspect whether risky behaviors have internal machinery that can be detected, measured, and controlled.

The apprentice's labels: features#

Back in the village, the apprentice did not start by naming the whole machine. He labeled smaller parts: the river wheel, the axle, the belt, the small gear, the stone, the chute. Each label made the mechanism easier to reason about.

In neural networks, a feature is a pattern in the model's internal activations that corresponds to something meaningful. A feature might represent a city, a programming syntax, a sentiment, a jailbreak pattern, a quote format, a refusal style, or a security-sensitive concept. The challenge is that models do not neatly store one feature in one neuron.

Sparse autoencoders are one way to make the hidden structure more readable. A sparse autoencoder learns to represent dense activations using a small number of active features at a time. The hope is that these learned features are more monosemantic: one feature maps to one understandable idea more cleanly than a raw neuron does.

Dense state

Activation vector

High-dimensional internal state that mixes many concepts in overlapping directions.

SAE

Sparse dictionary

Decomposes dense activations into a small set of active, more interpretable features.

Examples

Named features

Python syntax, geography, jailbreak tone, refusal style — each a inspectable internal direction.

From features to circuits#

Features are parts. Circuits are pathways. A circuit is a group of model components or features that work together to perform a computation. In transformer models, researchers have studied circuits for tasks such as induction, indirect object identification, factual recall, copying, refusal, and other structured behaviors.

The key shift is causality. It is not enough to say, "This feature lights up when the model talks about X." The stronger question is, "If we change this activation, does the output change in the way we predict?"

This is where activation patching becomes useful. Researchers run a model on a clean prompt and a corrupted prompt, then replace a specific internal activation from one run with the corresponding activation from the other. If the model's answer changes, that activation was causally involved. Attribution patching uses gradients to approximate this more cheaply at scale.

Activation patching (lantern test)
1
Clean run

"Alice gave Bob a key" → model → answer: Bob

2
Corrupted run

"Alice gave Carol a key" → model → answer: Carol

3
Patch

Swap one internal activation between runs. If the answer flips, that state was part of the mechanism.

Why AI security teams should care#

Security teams are used to inspecting systems. We inspect logs, dependencies, network flows, binaries, permissions, and supply chains. But with AI, the most important logic often lives inside learned weights and activations. That is uncomfortable because a model can behave correctly in evaluation and still contain pathways we do not understand.

Mechanistic interpretability matters for AI security because it gives us a path toward internal assurance. It may help answer questions like:

  1. Which internal features activate around jailbreak attempts? If the same hidden features appear across many attacks, defenses can monitor deeper than surface text.
  2. Can dangerous capabilities be detected before deployment? Feature-level analysis may reveal latent knowledge or behavior patterns that normal evaluation misses.
  3. Do refusal behaviors rely on robust circuits or shallow templates? A model that refuses because of a fragile phrase pattern may fail under simple paraphrase pressure.
  4. Can we distinguish explanation from causation? Mechanistic tools let researchers intervene on internals instead of only narrating plausible stories after the fact.

For AI security, the dream is not a pretty dashboard. The dream is a testable map: which internal pathways are responsible for unsafe tool use, data leakage, deception-like reasoning, hidden goal pursuit, or policy bypass?

The limits: this is not magic#

Mechanistic interpretability is promising, but it is not solved. The field still faces hard problems.

Models are enormous. Features can be context-dependent. Circuits can be distributed. Sparse autoencoder features can be easier to interpret than neurons, but they still require human judgment and careful validation. Activation patching can show causal involvement, but choosing the right clean and corrupted prompts is itself a technical skill. Attribution patching scales better, but approximation has failure modes.

There is also a human risk: overclaiming. A feature name is not a legal finding, a psychological diagnosis, or proof of model intent. A circuit map is not the whole model. A dashboard of activations is not safety by itself.

# A practical mindset for interpretability claims if feature_activates: say "this internal direction correlates with the concept" if intervention_changes_output: say "this component is causally involved in this behavior" if dashboard_looks_explanatory: do_not_say "we understand the model" say "we found one inspectable mechanism"

How to read the research without getting lost#

If you are new to mechanistic interpretability, the vocabulary can feel heavy. The cleanest way to read the field is to keep four levels separate.

  1. Behavior What did the model do? This is the output, refusal, answer, tool call, or failure mode we observe.
  2. Activation What internal state appeared while the model was processing the input?
  3. Feature Can that activation be decomposed into a human-readable concept or direction?
  4. Circuit Do multiple features or components interact to cause the behavior?

Once you separate those levels, the field becomes less mysterious. The research is trying to build a bridge from behavior back to mechanism.

Where this is going#

The most interesting direction is not just finding features. It is tracing how features interact across layers to form computational graphs. Recent circuit tracing work uses ideas such as attribution graphs and cross-layer transcoders to produce more detailed maps of how models compute answers.

If this research continues to scale, it could change how we evaluate frontier models. Instead of only asking whether a model passed a benchmark, we may ask whether the model used acceptable internal mechanisms to pass it. Instead of only red teaming outputs, we may red team circuits. Instead of only filtering prompts, we may monitor safety-relevant internal features during high-risk tasks.

That would be a different kind of AI security. Not just guards at the door. Lanterns inside the machine.

Research anchors#

This article is grounded in public research on transformer circuits, sparse autoencoders, monosemantic features, activation patching, attribution patching, and circuit tracing. Useful anchors include Anthropic's Towards Monosemanticity, Scaling Monosemanticity, research on activation patching and causal interventions, and more recent work on attribution graphs for revealing computational pathways in language models.

Want help building interpretability into your AI security program? We can assess model evaluation, red teaming, and internal monitoring strategy.

AI Security · advisory & implementation

The lantern is not the whole answer

Mechanistic interpretability will not magically make frontier models simple. But it changes the posture. It lets us stop treating AI behavior as weather and start treating it as machinery.

For AI security, that shift matters. The safest systems will not only be tested from the outside. They will be inspected from the inside, with tools that can trace features, circuits, and causal paths before hidden behavior becomes public failure.

The black box is not opened all at once. It is opened one mechanism at a time.