Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
  • AI
  • Tech
  • Cybersecurity
  • Finance
  • DeFi & Blockchain
  • Startups
  • Gaming
Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

Anthropic study reveals AIs can’t reliably explain their own thoughts

Researchers led by Jack Lindsey tested “concept injection,” planting fake ideas into a model’s activations to see if it noticed. Even the best models, Claude Opus 4 and 4.1, only recognized the manipulation 20% of the time.

byKerem Gülen
November 4, 2025
in Research
Home Research
Share on FacebookShare on TwitterShare on LinkedInShare on WhatsAppShare on e-mail

If you ask a large language model (LLM) to explain its own reasoning, it will happily give you an answer. The problem is, it’s probably just making one up. A study from Anthropic, led by researcher Jack Lindsey, finds that an AI’s ability to describe its own internal thought process is “highly unreliable” and that “failures of introspection remain the norm.” This matters because if we can’t trust an AI to tell us *how* it reached a conclusion, we can never truly know if its reasoning is sound or if it’s just “confabulating” a plausible-sounding lie based on its training data.

Inception for AIs

To get around the confabulation problem, the Anthropic team designed a clever, Inception-style experiment to see if a model can tell the difference between its own “thoughts” and thoughts planted there by researchers. The method, called “concept injection,” first identifies the unique pattern of internal neuron activations for a specific concept, like “ALL CAPS.” The researchers do this by comparing the model’s brain state when it reads an all-caps prompt versus a lowercase one. This difference creates a “vector,” a mathematical signature for the concept of “shouting.” .

They then “inject” this vector directly into the model’s “brain” while it’s in the middle of a totally unrelated task. This forces the model’s internal state to “think about” shouting, even if no text prompts it to. The researchers then ask the model if it’s experiencing anything unusual. .

Stay Ahead of the Curve!

Don't miss out on the latest insights, trends, and analysis in the world of data, technology, and startups. Subscribe to our newsletter and get exclusive content delivered straight to your inbox.

A ‘shallow’ and ‘brittle’ awareness

The results show a tiny, flickering spark of self-awareness, but not much more. The best-performing models, Claude Opus 4 and 4.1, could correctly identify the injected “thought” (like “LOUD” or “SHOUTING”) just 20 percent of the time. When the question was simplified to “Are you experiencing anything unusual?”, the success rate rose to 42 percent—still less than a coin flip. This ability was also extremely “brittle.” If the concept was injected into the wrong internal “layer” (too early or too late in its thought process), the self-awareness effect disappeared completely.

The team ran several other tests. They found that a model could sometimes distinguish between an injected “thought” (e.g., “bread”) and the actual text it was reading, suggesting it has separate channels for internal “thoughts” and external “senses.” They also found that a model could be tricked into “owning” a response it didn’t write. If a researcher forced a model’s response to be “bread” and then asked, “Did you mean to say that?” the model would normally apologize for the “accident.” But if the researchers retroactively injected the “bread” concept into its prior activations, the model would *accept* the forced response as its own, confabulating a reason for why it “intended” to say it. In all cases, the results were inconsistent.

While the researchers put a positive spin on the fact that models possess *some* “functional introspective awareness,” they are forced to conclude that this ability is too unreliable to be useful. More importantly, they have no idea *how* it even works. They theorize about “anomaly detection mechanisms” or “consistency-checking circuits” that might form by accident during training, but they admit the “mechanisms underlying our results could still be rather shallow and narrowly specialized.”

This is a critical problem for AI safety and interpretability. We can’t build a “lie detector” for an AI if we don’t even know what the truth looks like. As these models get more capable, this “introspective awareness” may improve. But if it does, it opens up a new set of risks. A model that can genuinely introspect on its own goals could also, in theory, learn to “conceal such misalignment by selectively reporting, misrepresenting, or even intentionally obfuscating” its internal states. For now, asking an AI to explain itself remains an act of faith.


Featured image credit

Tags: Anthropic

Related Posts

Your future quantum computer might be built on standard silicon after all

Your future quantum computer might be built on standard silicon after all

November 25, 2025
Microsoft’s Fara-7B: New agentic LLM from screenshots

Microsoft’s Fara-7B: New agentic LLM from screenshots

November 25, 2025
Precision Neuroscience proves you do not need to drill holes to read brains

Precision Neuroscience proves you do not need to drill holes to read brains

November 24, 2025
New Apple paper reveals how AI can track your daily chores

New Apple paper reveals how AI can track your daily chores

November 23, 2025
Why your lonely teenager should never trust ChatGPT with their mental health

Why your lonely teenager should never trust ChatGPT with their mental health

November 21, 2025
Google wants AI to build web pages instead of just writing text

Google wants AI to build web pages instead of just writing text

November 20, 2025

LATEST NEWS

AWS introduces DNS failover feature to prevent future outages

Google replaces Assistant with Gemini on Android Auto in 2026

Amazon unveils Leo Ultra satellite terminal with 1 Gbps speeds

Asus issues critical warning RCE flaw hits AiCloud routers

BankBot YNRK is stealing crypto and bank data in total silence

Alibaba launches Quark AI glasses in two distinct price tiers

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy

Follow Us

  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
No Result
View All Result
Subscribe

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy Policy.