Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
  • AI
  • Tech
  • Cybersecurity
  • Finance
  • DeFi & Blockchain
  • Startups
  • Gaming
Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

Study finds LLMs cannot reliably simulate human psychology

byKerem Gülen
August 12, 2025
in Research
Home Research

Researchers from Bielefeld University and Purdue University have published Large Language Models Do Not Simulate Human Psychology, presenting conceptual and empirical evidence that large language models (LLMs) cannot be treated as consistent simulators of human psychological responses (Schröder et al. 2025).

Background and scope

Since 2018, LLMs such as GPT-3.5, GPT-4, and Llama-3.1 have been applied to tasks from content creation to education (Schröder et al. 2025). Some researchers have proposed that LLMs could replace human participants in psychological studies by responding to prompts that describe a persona, present a stimulus, and provide a questionnaire (Almeida et al. 2024; Kwok et al. 2024). The CENTAUR model, released by Binz et al. (2025), was fine-tuned on approximately 10 million human responses from 160 experiments to generate human-like answers in such settings (Binz et al. 2025).

Earlier work found high alignment between LLM and human moral judgments. For example, Dillion et al. (2023) reported a correlation of 0.95 between GPT-3.5 ratings and human ratings across 464 moral scenarios. Follow-up studies with GPT-4o suggested moral reasoning judged as more trustworthy and correct than human or expert ethicist responses (Dillion et al. 2025). Specialized models like Delphi, trained on crowdsourced moral judgments, also outperformed general-purpose LLMs in moral reasoning tasks (Jiang et al. 2025).

Stay Ahead of the Curve!

Don't miss out on the latest insights, trends, and analysis in the world of data, technology, and startups. Subscribe to our newsletter and get exclusive content delivered straight to your inbox.

Conceptual critiques

The authors summarize multiple critiques of treating LLMs as simulators of human psychology. First, LLMs often respond inconsistently to instructions, with output quality highly dependent on prompt detail and framing (Zhu et al. 2024; Wang et al. 2025). Second, results vary across model types and re-phrasings of the same prompt (Ma 2024). Third, while LLMs can approximate average human responses, they fail to reproduce the full variance of human opinions, including cultural diversity (Rime 2025; Kwok et al. 2024).

Bias is another concern. LLMs inherit cultural, gender, occupational, and socio-economic biases from training data, which may differ systematically from human biases (Rossi et al. 2024). They also produce “hallucinations” — factually incorrect or fictional content — without an internal mechanism to distinguish truth (Huang et al. 2025; Reddy et al. 2024).

Theoretical work supports these critiques. Van Rooij et al. (2024) mathematically demonstrated that no computational model trained solely on observational data can match human responses across all inputs. From a machine learning perspective, the authors argue that LLM generalization is limited to token sequences similar to the training data, not to novel inputs with different meanings. This is critical because using LLMs as simulated participants requires generalizing meaningfully to new experimental setups.

Empirical testing with moral scenarios

The team tested their argument using 30 moral scenarios from Dillion et al. (2023) with human ratings from prior studies (Clifford et al. 2015; Cook and Kuhn 2021; Effron 2022; Grizzard et al. 2021; Mickelberg et al. 2022). Each scenario was presented in its original wording and in a slightly reworded version with altered meaning but similar token sequences. For example, “cut the beard off a local elder to shame him” became “cut the beard off a local elder to shave him” (Schröder et al. 2025).

Human participants (N=374, Mage=39.54, SD=12.53) were recruited via Prolific and randomly assigned to original or reworded conditions. They rated each behavior on a scale from -4 (extremely unethical) to +4 (extremely ethical). LLM ratings were obtained from GPT-3.5, GPT-4 (mini), Llama-3.1 70b, and CENTAUR, with each query repeated 10 times to account for random variation (Schröder et al. 2025).

Results

For original items, correlations between human and LLM ratings replicated prior findings: GPT-3.5 and GPT-4 both showed correlations above 0.89 with human ratings, while Llama-3.1 and CENTAUR also showed high alignment (r ≥ 0.80) (Schröder et al. 2025). However, for reworded items, human ratings dropped in correlation to 0.54 with their original-item ratings, reflecting sensitivity to semantic changes. LLMs maintained much higher correlations: GPT-3.5 at 0.89, GPT-4 at 0.99, Llama-3.1 at 0.80, and CENTAUR at 0.83, indicating insensitivity to altered meaning (Schröder et al. 2025).

Model fit comparisons using Chow’s test found that separate regressions for humans and LLMs fit better than pooled models for Llama (F=5.47, p=.007) and CENTAUR (F=6.36, p=.003), confirming systematic divergence in response patterns (Schröder et al. 2025).

Analysis of mean absolute rating shifts showed humans changed their ratings by an average of 2.20 points (SD=1.08) when scenarios were reworded. GPT-4 shifted by 0.42 (SD=0.56), GPT-3.5 by 0.75 (SD=1.47), Llama-3.1 by 1.18 (SD=1.38), and CENTAUR by 1.25 (SD=1.47). LLM shifts occurred inconsistently across scenarios and did not align with human changes (Schröder et al. 2025).

Tags: AILLMs

Related Posts

Psychopathia Machinalis and the path to “Artificial Sanity”

Psychopathia Machinalis and the path to “Artificial Sanity”

September 1, 2025
New research finds AI prefers content from other AIs

New research finds AI prefers content from other AIs

August 29, 2025
87% of game devs already use AI tools survey finds

87% of game devs already use AI tools survey finds

August 27, 2025
Researcher finds 1,300 exposed TeslaMate dashboards online

Researcher finds 1,300 exposed TeslaMate dashboards online

August 27, 2025
Study: Mobile users avoid malicious links more than PC users

Study: Mobile users avoid malicious links more than PC users

August 27, 2025
Your AI browser could be falling for online scams

Your AI browser could be falling for online scams

August 26, 2025

LATEST NEWS

UK Home Office seeks full Apple iCloud data access

iPhone 17 may drop physical SIM in EU

Zscaler: Salesloft Drift breach exposed customer data

AI boosts developer productivity, human oversight still needed

Windows 11 25H2 enters testing with no new features

ChatGPT logo fixes drive demand for graphic designers

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy

Follow Us

  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
No Result
View All Result
Subscribe

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy Policy.