Anthropic Study Reveals AIs Can’t Reliably Explain Their Own Thoughts

If you ask a large language model (LLM) to explain its own reasoning, it will happily give you an answer. The problem is, it’s probably just making one up. A study from Anthropic, led by researcher Jack Lindsey, finds that an AI’s ability to describe its own internal thought process is “highly unreliable” and that “failures of introspection remain the norm.” This matters because if we can’t trust an AI to tell us *how* it reached a conclusion, we can never truly know if its reasoning is sound or if it’s just “confabulating” a plausible-sounding lie based on its training data.

Inception for AIs

To get around the confabulation problem, the Anthropic team designed a clever, Inception-style experiment to see if a model can tell the difference between its own “thoughts” and thoughts planted there by researchers. The method, called “concept injection,” first identifies the unique pattern of internal neuron activations for a specific concept, like “ALL CAPS.” The researchers do this by comparing the model’s brain state when it reads an all-caps prompt versus a lowercase one. This difference creates a “vector,” a mathematical signature for the concept of “shouting.” .

They then “inject” this vector directly into the model’s “brain” while it’s in the middle of a totally unrelated task. This forces the model’s internal state to “think about” shouting, even if no text prompts it to. The researchers then ask the model if it’s experiencing anything unusual. .

A ‘shallow’ and ‘brittle’ awareness

The results show a tiny, flickering spark of self-awareness, but not much more. The best-performing models, Claude Opus 4 and 4.1, could correctly identify the injected “thought” (like “LOUD” or “SHOUTING”) just 20 percent of the time. When the question was simplified to “Are you experiencing anything unusual?”, the success rate rose to 42 percent—still less than a coin flip. This ability was also extremely “brittle.” If the concept was injected into the wrong internal “layer” (too early or too late in its thought process), the self-awareness effect disappeared completely.

The team ran several other tests. They found that a model could sometimes distinguish between an injected “thought” (e.g., “bread”) and the actual text it was reading, suggesting it has separate channels for internal “thoughts” and external “senses.” They also found that a model could be tricked into “owning” a response it didn’t write. If a researcher forced a model’s response to be “bread” and then asked, “Did you mean to say that?” the model would normally apologize for the “accident.” But if the researchers retroactively injected the “bread” concept into its prior activations, the model would *accept* the forced response as its own, confabulating a reason for why it “intended” to say it. In all cases, the results were inconsistent.

While the researchers put a positive spin on the fact that models possess *some* “functional introspective awareness,” they are forced to conclude that this ability is too unreliable to be useful. More importantly, they have no idea *how* it even works. They theorize about “anomaly detection mechanisms” or “consistency-checking circuits” that might form by accident during training, but they admit the “mechanisms underlying our results could still be rather shallow and narrowly specialized.”

This is a critical problem for AI safety and interpretability. We can’t build a “lie detector” for an AI if we don’t even know what the truth looks like. As these models get more capable, this “introspective awareness” may improve. But if it does, it opens up a new set of risks. A model that can genuinely introspect on its own goals could also, in theory, learn to “conceal such misalignment by selectively reporting, misrepresenting, or even intentionally obfuscating” its internal states. For now, asking an AI to explain itself remains an act of faith.

Featured image credit

Tags: Anthropic