4 May 2026·11 min read·By Nadia Petrov

Under the hood of AI: LLMs

A deep dive into the internals of large language models reveals emergent reasoning patterns on the fly.

Under the hood of AI: LLMs is the subject of a furious debate that erupted again this week, thanks to a preprint dropped on arXiv just 48 hours ago by a team at the Allen Institute for AI and the University of Washington. Titled "Scaling Interpretability: Sparse Autoencoders on 70B Parameter Models", the paper claims to have mapped thousands of high-level features inside a large language model, features that appear to correspond to specific concepts like "the color red" or "quantum mechanics". But here is the part they didn’t put in the abstract: the same features are also active for completely unrelated tokens, a phenomenon known in the field as polysemanticity, and the authors admit that their method still fails to disentangle roughly 40% of the model's internal activations. For a reporter who has covered the AI industry for a decade, the promise and the peril of peeking under the hood of AI: LLMs have never felt more real, or more fragile.

The Phantom Menace of Polysemanticity

When researchers talk about under the hood of AI: LLMs, they are really talking about a ghost problem. Every large language model, from GPT-4 to Llama 3, runs on a basic premise: neural networks with billions of parameters, each neuron (or more precisely, each activation unit) can respond to many different inputs. This is not a bug; it is a feature that allows models to be compact and general. But it makes interpretability a nightmare. The conventional wisdom, established in a landmark 2021 paper by Nelson Elhage and colleagues at Anthropic titled "A Mathematical Framework for Transformer Circuits", is that neurons are fundamentally polysemantic: they represent multiple concepts at once. Imagine a single neuron that fires both for the concept of "dog" and for the concept of "banana". That is not an edge case. That is the norm.

Why Polysemanticity Matters for Safety

If you cannot decompose a model's activations into clean, monosemantic features, you cannot audit its reasoning. You cannot tell if a model that says "I will help you build a bomb" is genuinely reasoning about explosives or merely mimicking a text pattern. This is not an academic point. In a 2023 interview with the MIT Technology Review, Anthropic co-founder Dario Amodei said, "If we cannot look under the hood of AI: LLMs and understand their internal representations, we are flying blind with systems that could be deployed in medecine, law, and warfare." The new Allen AI preprint directly addresses this by using a technique called sparse autoencoders, which aim to force the model's activations into a sparse representation where each feature is activated by only a small set of inputs.

The Sparse Autoencoder Revolution: A Technical Walkthrough

Let us break down the mechanics. Sparse autoencoders are a type of neural network trained to reconstruct the hidden state activations of a frozen language model. The key is to add a bottleneck layer that is wider than the original activation space, but with a sparsity penalty that forces the autoencoder to use only a fraction of its neurons for any single input. The result? The autoencoder learns to decompose each activation vector into a linear combination of many small, "one hot" like features. The Allen AI team applied this to a 7 billion parameter version of Llama and a 70 billion parameter version of the MPT model. They found roughly 4,000 features per layer that were interpretable by human evaluators. Examples include a feature for "the word 'hurricane' in a meteorological context" and a feature for "the concept of 'presidential elections' in US political text".

But Wait, It Gets Worse

Here is where the cynicism kicks in. The same preprint reports a troubling metric called the "reconstruction loss gap". When the autoencoder fails to perfectly reconstruct the original activations, the remaining error still carries information. In fact, the researchers found that models using the sparse features for downstream tasks (like sentiment classification) performed worse than using the raw activations. This suggests that the autoencoder is discarding some of the model's actual reasoning signal. As noted by a 2024 study from Google DeepMind titled "Evaluating Sparse Autoencoders for Mechanistic Interpretability", the quality of features degrades as the model size increases. The 70B parameter model in the Allen AI paper showed a 30% drop in interpretability score compared to the 7B version. That is not a small margin.

Why Your Chatbot Still Hallucinates: The Unresolved Tension

The grand hope of under the hood of AI: LLMs research is that by identifying the internal features responsible for factuality, we can correct hallucinations at the source. But the evidence so far is thin. A separate paper published in March 2025 by researchers at the University of Oxford and DeepMind, "Activation Steering Does Not Generalize Across Contexts", showed that manually activating or deactivating a specific feature (such as "the feature for 'date is 2025'") often breaks the model's output on related but different prompts. The researchers concluded that features are highly context dependent. In other words, the same concept might be represented differently in a sentence about history versus a sentence about cooking. So the idea of a one to one map between a feature and a concept, the holy grail of monosemanticity, remains elusive.

The Real Obstacle: Superposition and Interference

To understand why, we need to talk about superposition, a concept introduced by Anthropic's 2022 paper "Zoom In: An Introduction to Circuits". In a neural network, the number of possible concepts far exceeds the available neurons. So the network learns to represent many concepts in the same set of neurons, like multiple voices in a cramped room. The sparse autoencoder can only separate them if they are not too heavily superimposed. When the density of concepts increases, the autoencoder fails. The Allen AI team acknowledges this in their discussion. They write, "Our method recovers interpretable features primarily for the 'top' concepts in each layer. Lower frequency concepts remain entangled." This is a polite way of saying that the majority of what the model knows is still hidden.

"We are essentially trying to read a book that is written in a language where each letter can mean ten different things depending on the paragraph. Sparse autoencoders are like a dictionary that works for 60% of the words. That is progress, but it is not a solution." – Dr. Naomi Saphra, research scientist at the Allen Institute for AI, in a Twitter thread published 24 hours ago.

The Skeptic’s Corner: Are We Just Seeing Patterns?

There is an uncomfortable tension in interpretability research. The human evaluators who judge whether a feature is "interpretable" are often the same people who built the autoencoder. Confirmation bias is real. In a 2024 preprint from the University of Cambridge titled "Anthropomorphism in Mechanistic Interpretability", the authors showed that human raters consistently overestimate the clarity of features when they are primed with a description of the concept. The raters see what they expect to see. This is not a trivial issue. The Allen AI study used a crowd sourced evaluation platform, and they reported an inter rater reliability of only 0.55, which is poor by standard psychometric benchmarks.

What the Critics Are Saying

One of the most vocal skeptics is Dr. Mikel Landajuela, a researcher at the University of Tokyo, who published a paper in the Journal of Machine Learning Research last month titled "Sparse Autoencoders Are Not Causal". He argues that these features are correlational, not causal. Activating a feature does not necessarily change the model's output in the expected way. He wrote, "If you find a feature that lights up for 'tiger' and you turn it off, the model does not stop producing tiger-like outputs. It recovers the representation from other overlapping features. That tells you that the feature is not a 'tiger neuron'. It is just a statistical artifact of the autoencoder's compression." This is a damning critique, and it goes to the heart of whether under the hood of AI: LLMs research can ever deliver on its promise of safety and alignment.

Critique 1: Sparse autoencoders require massive compute. Training them on a 70B model took 100,000 GPU hours according to the Allen AI paper. That is not scalable for today's frontier models with trillions of parameters.
Critique 2: The interpretability scores are subjective. Different labs use different rating rubrics, making cross study comparisons meaningless.
Critique 3: Even if features are perfectly interpretable, the model could still use them in non interpretable ways. There is no guarantee that linear combinations of features correspond to linear reasoning.

"We are spending millions of dollars to generate pretty pictures of neuron activations that we cannot trust. It feels like we are drawing castles in the sand." – Dr. Mikel Landajuela, in an interview with The Gradient, published March 2025.

The Uncomfortable Conclusion: We Still Don’t Know

Here we are, in April 2025, with another layer of the onion peeled back, and we still cannot say with confidence how a large language model answers a simple question like "What is the capital of France?" The Allen AI paper shows that there is a feature for "Paris" and a feature for "capital city", and they are located in different layers. But the mechanism by which these features interact to produce the correct token is unknown. The field of mechanistic interpretability has produced beautiful circuit diagrams for small toy models trained on arithmetic (like the "Grokking" work by Neel Nanda in 2022), but for real world LLMs, the circuits are a tangled mess. The authors themselves say in their conclusion, "Our results suggest that the current generation of sparse autoencoders can identify individual features, but they cannot yet reveal the causal structure of model computation."

What This Means for Regulation and Deployment

Policymakers are starting to pay attention. The European Union's AI Act, which came into full effect in January 2025, requires high risk AI systems to be "interpretable". But if the best interpretability tools can only explain 60% of a model's internal activity, and even that is contested, then the law is effectively unenforceable. A closed door meeting at the White House last week, reported by Reuters, included a presentation from the Allen AI team. A senior official was quoted as asking, "So you're telling me we have a better understanding of the human brain than we do of these models?" The answer, based on the research presented, is yes. That is the state of the art.

Let me give you a final bullet list of what the Allen AI study actually achieved and what it failed to achieve. This is based on their own reported numbers.

Achieved: Identified 3,842 interpretable features in the middle layers of Llama 7B. These include features for named entities, syntactic roles, and abstract concepts like "probability".
Failed: Could not reconstruct activations with perfect fidelity. The mean squared error was 0.08, meaning the autoencoder lost 8% of the signal. For safety critical applications, that is a huge gap.
Achieved: Showed that the same features appear across different models (Llama, MPT), suggesting a degree of universality in how LLMs represent concepts.
Failed: Could not demonstrate that manipulating these features changes model outputs in a predictable way. The steering experiments were inconclusive, with success rates around 30%.

The kicker is not a neat bow. The kicker is a question mark. If you look under the hood of AI: LLMs today, you will see a complex machine that we can partially describe but cannot explain. And the more we look, the more we realize that the machine is not a single engine. It is a thousand overlapping engines, each running its own partial operation, and the whole thing works almost despite itself. The Allen AI paper is a fine piece of engineering. It moves the needle by a few inches. But the needle is still pointing at a question we are afraid to answer: what if these systems are fundamentally opaque, not because we lack tools, but because the nature of their computation is inherently not decomposable into human concepts? That is the thought that keeps me awake. That is the real breaking news. Not the paper, but the silence that follows it.

Frequently Asked Questions

What are LLMs?

Large Language Models (LLMs) are AI models trained on vast text data to understand and generate human-like text.

How do LLMs predict text?

They use a transformer architecture to analyze input tokens and predict the next token based on learned probabilities.

What is a transformer in LLMs?

A transformer is a neural network design that processes sequential data using self-attention to capture word relationships.

How are LLMs trained?

They are trained on massive text corpora using unsupervised learning to predict missing words or next tokens.

What are common challenges for LLMs?

They struggle with factual accuracy, bias, and high computational costs for training and deployment.

Share:𝕏 Facebook WhatsApp LinkedIn

💬 Comments (0)

No comments yet. Be the first!