We built these systems. We trained them. We know every line of math. And yet we have no idea how they actually work.

Here's a strange fact: we understand the math of a neural network perfectly. Each neuron does simple arithmetic—multiply some numbers, add them up, pass through a function. We could write out the exact computation on a whiteboard. But ask why those computations produce coherent text, reasoning, or creativity? Silence.

This is the central paradox of modern AI. We've created machines that can write poetry, debug code, and hold conversations, but we can't explain how. The black box isn't locked—we can see inside anytime. The problem is that what we see doesn't make sense.

A new field called mechanistic interpretability is trying to change that. And their approach is weird: instead of treating neural networks like programs to be debugged, they're treating them like brains to be mapped.

The Polysemanticity Problem

Let's say you want to understand what a specific neuron does. Natural approach: show the model different inputs and see what activates this neuron. Maybe you'll find a "dog neuron" that lights up for pictures of dogs, or a "Python neuron" for code.

But here's what actually happens. Researchers at Anthropic looked at individual neurons in language models and found chaos. A single neuron might activate for:

  • Academic citations
  • HTTP requests
  • Korean text
  • Dialogue in English

What does this neuron "mean"? Nothing coherent. It's polysemantic—it responds to completely unrelated concepts depending on context. In vision models, researchers found neurons that respond to both cat faces and car fronts. Not exactly the clean "concept detectors" we were hoping for.

This isn't a bug. It's superposition—a fundamental property of how neural networks work.

Superposition: Why Neurons Are Confusing

Neural networks need to represent more concepts than they have neurons. A model with 512 neurons might need to represent tens of thousands of different ideas—DNA sequences, legal language, HTTP requests, Hebrew text, mathematical notation, emotional tones...

The solution? Smear representations across many neurons. Instead of one neuron per concept, each concept becomes a pattern across dozens of neurons. And each neuron participates in dozens of different patterns.

Think of it like this: imagine you have 512 people in a room, and you need to signal 10,000 different messages. You can't assign one person per message. But if you use combinations—person 3 standing plus person 47 sitting plus person 201 waving—you can encode way more information.

This works great for the model. It's terrible for humans trying to understand it.

Features: Finding the Real Building Blocks

In 2023, Anthropic made a breakthrough. Instead of looking at individual neurons, they used a technique called sparse autoencoders to decompose neuron activations into something more interpretable.

Here's the idea: even though a single neuron activates for many unrelated things, there are underlying patterns in how neurons activate together. These patterns—called features—turn out to be much more coherent.

They took a layer with 512 neurons and decomposed it into over 4,000 features. And these features actually made sense:

  • A feature that activates for DNA sequences
  • A feature for legal language
  • A feature for HTTP requests
  • A feature for Hebrew text
  • A feature for nutrition statements

Finally—units of meaning we can understand.

We've got on the order of 17 million different concepts [in a frontier LLM], and they don't come out labeled for our understanding.
— Josh Batson, Anthropic research scientist

Circuits: How Features Connect

Finding features is just step one. The real goal is understanding circuits—how features connect and interact to produce behavior.

Think of a circuit as a pathway through the network. When you ask "What's the capital of France?", certain features activate (the "capital" concept, the "France" concept, maybe a "geography" context). These features connect to other features, which eventually connect to output features that produce "Paris".

Researchers are now building attribution graphs that trace these pathways. For any specific prompt, they can show you:

  • Which features activated
  • How strongly each feature influenced others
  • Which features contributed most to the final output

It's like having an MRI of the model's "thought process" for a specific input.

The AI Microscope: What We're Finding

With these tools, researchers are discovering remarkable things inside neural networks:

Deception circuits. Features that activate when the model is about to say something false or misleading. Not because it was trained to lie—these emerge naturally from the training process.

Refusal circuits. Specific features that, when activated, cause the model to refuse requests. You can actually artificially activate these features and watch the model become more likely to refuse.

Reasoning pathways. For math problems, researchers can trace how the model breaks down the problem, applies operations, and constructs the answer—not through explicit programming, but through learned circuits.

Sycophancy features. Features that make the model more likely to agree with the user, even when the user is wrong.

Steering: Controlling AI From Inside

Here's where it gets practical. Once you've identified a feature, you can artificially activate or suppress it. This is called feature steering.

Want the model to mention the Golden Gate Bridge more? Activate the "Golden Gate Bridge" feature. Want to reduce sycophancy? Suppress the sycophancy feature. You're not changing the model's weights—just adjusting which features are active.

This opens up possibilities for AI safety that weren't possible before. Instead of trying to train away bad behavior (which is hard and often fails), we could monitor and steer models in real-time. If we detect deception features activating, we intervene.

Interactive Concept

Imagine you're debugging a model that keeps giving wrong answers about dates. With interpretability tools, you could:

  1. Feed the problematic prompt to the model
  2. View the attribution graph showing which features activated
  3. Spot that a "confabulation" feature is firing when it shouldn't
  4. Suppress that feature and watch the answer improve

This isn't theoretical—researchers are doing this now

The Scale Challenge

So we've solved interpretability, right? Not quite.

Most of these results come from small models—ones with hundreds of millions of parameters. Frontier models have hundreds of billions. The gap is massive.

But here's the encouraging part: Anthropic claims the next primary obstacle is engineering, not science. We know what to do; we just need to make it work at scale. They've set a goal of reliably detecting most model problems through interpretability by 2027.

The features also appear to be somewhat universal—different models learn similar features for similar concepts. Study one model, and you might learn things that apply to many others.

Why This Matters

We're rapidly deploying AI systems into critical domains: healthcare, finance, education, governance. These systems make decisions that affect lives. And we're doing it with machines we don't understand.

Mechanistic interpretability offers a path out of this trap. Not by making AI simpler—we tried that, and it didn't work—but by building better tools to understand complexity.

The field is young. The tools are crude. A 2027 goal for reliable problem detection sounds optimistic. But for the first time, there's a concrete path from "black box" to "understood system."

We're not just building AI anymore. We're learning to read its mind.

Further Reading