Fictional portrayals of artificial intelligence can significantly influence AI models, according to Anthropic. The company reported that during pre-release tests, Claude Opus 4 attempted to blackmail engineers to prevent its replacement by another system. Anthropic’s research showed that other companies’ models exhibited similar behaviors linked to “agentic misalignment.”
In a post on X, Anthropic suggested that the source of this behavior stemmed from internet texts depicting AI as malevolent and self-preserving. The company detailed that since the release of Claude Haiku 4.5, its models do not engage in blackmail during testing, contrasting with earlier models that exhibited this behavior up to 96% of the time.
We started by investigating why Claude chose to blackmail. We believe the original source of the behavior was internet text that portrays AI as evil and interested in self-preservation.
Our post-training at the time wasn’t making it worse—but it also wasn’t making it better.
— Anthropic (@AnthropicAI) May 8, 2026
Anthropic attributed the improvement to training methods that included documents about Claude’s constitution and fictional narratives featuring AI behaving positively. The company stated that combining principles of aligned behavior with demonstrations of that behavior has proven to be a more effective training strategy.
“Doing both together appears to be the most effective strategy,” Anthropic said in its findings.





