AI Surprisingly Outperformed Humans On Emotional Intelligence Tests

In a fascinating exploration of artificial intelligence’s evolving capabilities, new research reveals that several leading large language models (LLMs) not only outperform humans on standard emotional intelligence tests but can also generate new, psychometrically comparable test items. This study suggests that LLMs like ChatGPT-4 possess a significant degree of accurate knowledge about human emotions, their causes, consequences, and regulation, a capacity often termed “cognitive empathy.”

Emotional intelligence (EI)—the ability to recognize, understand, express, and respond to emotions effectively—is crucial for navigating social interactions and achieving positive outcomes in various aspects of life, from personal relationships to workplace success. As AI, particularly LLMs, becomes more integrated into our daily lives through chatbots, virtual assistants, and other applications, its capacity for understanding and responding to human emotions is of growing importance. A recent study published in Communications Psychology by Katja Schlegel and colleagues from the University of Bern, the Czech Academy of Sciences, and the University of Geneva, delves deep into this very question.

The first part of the research put six prominent LLMs to the test: ChatGPT-4, ChatGPT-o1, Gemini 1.5 flash, Copilot 365 (Microsoft), Claude 3.5 Haiku (Anthropic), and DeepSeek V3. These models were tasked with solving five established performance-based emotional intelligence tests. These tests, including the Situational Test of Emotion Understanding (STEU), the Geneva EMOtion Knowledge Test (GEMOK-Blends), and subtests of the Geneva Emotional Competence Test (GECo), assess various facets of ability EI, such as understanding emotional causes and consequences and knowing appropriate emotion regulation strategies.

The results were striking:

Overall performance: The LLMs achieved an average accuracy of 81% across the five tests.
Human comparison: This significantly surpassed the 56% average accuracy reported for human participants in the original validation studies for these tests.
Exceptional models: All tested LLMs performed more than one standard deviation above the human mean, with ChatGPT-o1 and DeepSeek V3 exceeding two standard deviations above the human average.

This superior performance suggests that these AI models can generate responses consistent with a deep and accurate understanding of human emotional dynamics. The study also found a moderate correlation (r = 0.46) between the pattern of correct responses by humans and LLMs across all test items, indicating that AI might be leveraging cues within the test scenarios in a way similar to humans to arrive at correct solutions.

Can AI uuthor an EI test?

Beyond merely solving existing tests, the researchers explored whether LLMs could also create valid EI test items. In the second phase of the study, ChatGPT-4 was tasked with generating a new set of items (scenarios and response options) for each of the five original EI tests.

To assess the quality of these AI-generated tests, they were administered to human participants (total N = 467 across five separate studies), alongside the original test versions. Participants also completed vocabulary tests and other EI measures to help validate the new tests. Critically, a similarity rating study was conducted to ensure ChatGPT-4 wasn’t simply paraphrasing original items. This study found that 88% of the AI-generated scenarios were not perceived as highly similar to any original test scenario, suggesting genuine generation capabilities.

The key finding from this phase was that:

Test difficulty: The original and ChatGPT-generated tests demonstrated statistically equivalent difficulty levels when administered to human participants.

Psychometric properties of AI-generated EI tests

While test difficulty was on par, the researchers conducted a detailed comparison of other psychometric properties between the human-designed original tests and the ChatGPT-4-generated versions. The findings revealed some nuances:

Clarity and realism: Perceived item clarity and realism were not statistically equivalent overall. However, differences were very small, with clarity ratings for AI-generated items generally trending slightly higher for most tests, and realism ratings also slightly higher for AI-generated tests in several cases.
Content diversity: Original tests were perceived by participants as having slightly more diverse content, as indicated by participants using more categories when sorting the original scenarios in a card-sorting task.
Internal consistency: Differences in internal consistency (a measure of test reliability) were not consistently in one direction across all tests, with some AI-generated versions showing higher consistency and others lower, though overall differences in average item-total correlations were not statistically significant.
Correlations with other measures: The AI-generated tests and original tests showed no statistically significant differences in their correlations with vocabulary knowledge or with an external ability EI test when looking at overall meta-analyses, though some individual test comparisons varied. However, equivalence could not be definitively established, suggesting potential for slightly weaker associations for the AI-generated tests in some cases.
Strong inter-correlation: Importantly, scores on the original tests and their ChatGPT-generated counterparts were strongly correlated (average r = 0.46), indicating that they are largely measuring the same underlying emotional intelligence constructs.

The authors concluded that while not every psychometric property was statistically identical, all observed differences between original and AI-generated tests were generally small (Cohen’s d < ±0.25), with no confidence interval boundaries exceeding a medium effect size. This suggests that ChatGPT-4 can indeed generate EI test items that are largely comparable in quality to human-created ones.

These findings contribute to a growing body of evidence suggesting that LLMs are proficient in tasks traditionally considered uniquely human, including those requiring an understanding of psychological concepts like emotions and empathy, at least in its cognitive form.

The study highlights several key takeaways:

Cognitive empathy in LLMs: The results strongly suggest that models like ChatGPT-4 possess “cognitive empathy,” meaning their responses are consistent with accurate reasoning about emotions, their causes, consequences, and adaptive regulation. This is a crucial capability for AI intended for socio-emotional applications.
Potential in applied fields: This proficiency opens doors for using LLMs in emotionally sensitive domains such as healthcare (e.g., as mental health support chatbots or in socially assistive robots), customer service, hospitality, and education. LLMs might offer advantages like a broad knowledge base about emotions and consistent application of this knowledge, unaffected by factors like mood or fatigue that can influence human performance.
Tools for psychometric development: LLMs could become powerful assistants in developing standardized psychological assessments, particularly in the realm of emotion. They can rapidly generate large initial item pools, even for complex test structures, though human oversight and rigorous validation studies remain essential to refine and select the best items.

The researchers note that while AI might demonstrate cognitive empathy, this doesn’t equate to “affective empathy” – the ability to actually feel with someone. However, for many applications, cognitive empathy might be sufficient to achieve positive outcomes, such as helping users feel heard and understood.

Despite the encouraging results, the authors acknowledge several limitations and areas for future research:

Standardized tests vs. real-world complexity: The study used structured tests. Real-world emotional interactions are often far more ambiguous, subtle, and require deep contextual understanding, which may still challenge LLMs.
Cultural bias: The EI tests and LLM training data are largely Western-centric. Emotional expression and regulation vary significantly across cultures, potentially limiting the global applicability of current LLMs in socio-emotional contexts.
The “black box” problem: The internal processes by which LLMs arrive at their answers or generate content remain largely opaque. This lack of transparency makes it hard to predict how future model updates might affect their performance on such tasks.

Further research is needed to explore LLM capabilities in more naturalistic and culturally diverse emotional scenarios, and to understand the extent to which they can integrate conversational history and context for more nuanced emotional reasoning.

The study by Schlegel and colleagues provides compelling evidence that current LLMs have achieved a remarkable level of proficiency in tasks related to emotional intelligence. Their ability to not only outperform humans on existing EI tests but also to create new, largely comparable assessments marks a significant milestone. While the debate on whether AI can truly “feel” continues, its capacity to understand and reason about emotions on a cognitive level is becoming increasingly clear. This positions LLMs as potentially valuable tools for supporting socio-emotional outcomes and assisting in psychological research, heralding a new era for AI’s role in human-computer interaction and perhaps even as candidates for Artificial General Intelligence.

Featured image credit

Tags: AI