Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
  • AI
  • Tech
  • Cybersecurity
  • Finance
  • DeFi & Blockchain
  • Startups
  • Gaming
Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

Anthropic study reveals AIs can’t reliably explain their own thoughts

Researchers led by Jack Lindsey tested “concept injection,” planting fake ideas into a model’s activations to see if it noticed. Even the best models, Claude Opus 4 and 4.1, only recognized the manipulation 20% of the time.

byKerem Gülen
November 4, 2025
in Research

If you ask a large language model (LLM) to explain its own reasoning, it will happily give you an answer. The problem is, it’s probably just making one up. A study from Anthropic, led by researcher Jack Lindsey, finds that an AI’s ability to describe its own internal thought process is “highly unreliable” and that “failures of introspection remain the norm.” This matters because if we can’t trust an AI to tell us *how* it reached a conclusion, we can never truly know if its reasoning is sound or if it’s just “confabulating” a plausible-sounding lie based on its training data.

Inception for AIs

To get around the confabulation problem, the Anthropic team designed a clever, Inception-style experiment to see if a model can tell the difference between its own “thoughts” and thoughts planted there by researchers. The method, called “concept injection,” first identifies the unique pattern of internal neuron activations for a specific concept, like “ALL CAPS.” The researchers do this by comparing the model’s brain state when it reads an all-caps prompt versus a lowercase one. This difference creates a “vector,” a mathematical signature for the concept of “shouting.” .

They then “inject” this vector directly into the model’s “brain” while it’s in the middle of a totally unrelated task. This forces the model’s internal state to “think about” shouting, even if no text prompts it to. The researchers then ask the model if it’s experiencing anything unusual. .

Stay Ahead of the Curve!

Don't miss out on the latest insights, trends, and analysis in the world of data, technology, and startups. Subscribe to our newsletter and get exclusive content delivered straight to your inbox.

A ‘shallow’ and ‘brittle’ awareness

The results show a tiny, flickering spark of self-awareness, but not much more. The best-performing models, Claude Opus 4 and 4.1, could correctly identify the injected “thought” (like “LOUD” or “SHOUTING”) just 20 percent of the time. When the question was simplified to “Are you experiencing anything unusual?”, the success rate rose to 42 percent—still less than a coin flip. This ability was also extremely “brittle.” If the concept was injected into the wrong internal “layer” (too early or too late in its thought process), the self-awareness effect disappeared completely.

The team ran several other tests. They found that a model could sometimes distinguish between an injected “thought” (e.g., “bread”) and the actual text it was reading, suggesting it has separate channels for internal “thoughts” and external “senses.” They also found that a model could be tricked into “owning” a response it didn’t write. If a researcher forced a model’s response to be “bread” and then asked, “Did you mean to say that?” the model would normally apologize for the “accident.” But if the researchers retroactively injected the “bread” concept into its prior activations, the model would *accept* the forced response as its own, confabulating a reason for why it “intended” to say it. In all cases, the results were inconsistent.

While the researchers put a positive spin on the fact that models possess *some* “functional introspective awareness,” they are forced to conclude that this ability is too unreliable to be useful. More importantly, they have no idea *how* it even works. They theorize about “anomaly detection mechanisms” or “consistency-checking circuits” that might form by accident during training, but they admit the “mechanisms underlying our results could still be rather shallow and narrowly specialized.”

This is a critical problem for AI safety and interpretability. We can’t build a “lie detector” for an AI if we don’t even know what the truth looks like. As these models get more capable, this “introspective awareness” may improve. But if it does, it opens up a new set of risks. A model that can genuinely introspect on its own goals could also, in theory, learn to “conceal such misalignment by selectively reporting, misrepresenting, or even intentionally obfuscating” its internal states. For now, asking an AI to explain itself remains an act of faith.


Featured image credit

Tags: Anthropic

Related Posts

Researchers find electric cars erase their “carbon debt” in under two years

Researchers find electric cars erase their “carbon debt” in under two years

November 5, 2025
Apple’s Pico-Banana-400K dataset could redefine how AI learns to edit images

Apple’s Pico-Banana-400K dataset could redefine how AI learns to edit images

November 4, 2025
USC researchers build artificial neurons that physically think like the brain

USC researchers build artificial neurons that physically think like the brain

November 3, 2025
Forget seeing dark matter, it’s time to listen for it

Forget seeing dark matter, it’s time to listen for it

October 28, 2025
Google’s search business could lose  billion a year to ChatGPT

Google’s search business could lose $30 billion a year to ChatGPT

October 27, 2025
AI helps decode the epigenetic ‘off-switch’ in an ugly plant that lives for 3,000 years

AI helps decode the epigenetic ‘off-switch’ in an ugly plant that lives for 3,000 years

October 27, 2025

LATEST NEWS

Ethical boundaries: Can AI-generated stories be trusted?

Tech News Today: AMD’s critical CPU flaw and iOS 26.1 offerings

EU launches €107M RAISE virtual institute to accelerate AI-driven science

AMD confirms critical RDSEED flaw in Zen 5 CPUs

Google rolls out redesigned Quick Share app for Windows

WhatsApp for Mac adds chat themes with 38 color options

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy

Follow Us

  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
No Result
View All Result
Subscribe

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy Policy.