Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
  • AI
  • Tech
  • Cybersecurity
  • Finance
  • DeFi & Blockchain
  • Startups
  • Gaming
Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

Oxford study finds AI benchmarks often exaggerate model performance

Nearly half of all examined benchmarks fail to clearly define their testing goals.

byKerem Gülen
November 12, 2025
in Research

A new study reveals that methodologies for evaluating AI systems often overstate performance and lack scientific rigor, raising questions about many benchmark results.

Researchers at the Oxford Internet Institute, collaborating with over three dozen institutions, examined 445 leading AI tests, known as benchmarks. These benchmarks measure AI model performance across various topic areas.

AI developers use these benchmarks to assess model capabilities and promote technical advancements. Claims about software engineering performance and abstract-reasoning capacity reference these evaluations. The paper, released Tuesday, suggests these fundamental tests may be unreliable.

Stay Ahead of the Curve!

Don't miss out on the latest insights, trends, and analysis in the world of data, technology, and startups. Subscribe to our newsletter and get exclusive content delivered straight to your inbox.

The study found that many top-tier benchmarks fail to define their testing objectives, reuse data and methods from existing benchmarks, and infrequently employ reliable statistical methods for comparing model results.

Adam Mahdi, a senior research fellow at the Oxford Internet Institute and a lead author, stated that these benchmarks can be “alarmingly misleading.” Mahdi told NBC News, “When we ask AI models to perform certain tasks, we often actually measure completely different concepts or constructs than what we aim to measure.” Andrew Bean, another lead author, agreed that “even reputable benchmarks are too often blindly trusted and deserve more scrutiny.”

Bean also told NBC News, “You need to really take it with a grain of salt when you hear things like ‘a model achieves Ph.D. level intelligence.’ We’re not sure that those measurements are being done especially well.”

Some benchmarks analyzed evaluate specific skills, such as Russian or Arabic language abilities. Others measure general capabilities like spatial reasoning and continual learning.

A central concern for the authors was the “construct validity” of a benchmark, which questions if it accurately tests the real-world phenomenon it intends to measure. For instance, one benchmark reviewed in the study measures a model’s performance on nine different tasks, including answering yes-or-no questions using information from Russian-language Wikipedia, instead of an endless series of questions to gauge Russian proficiency.

Approximately half of the examined benchmarks do not clearly define the concepts they claim to measure. This casts doubt on their ability to provide useful information about the AI models under test.

The study highlights Grade School Math 8K (GSM8K), a common AI benchmark for basic math questions. Leaderboards for GSM8K are often cited to show AI models’ strong mathematical reasoning. The benchmark’s documentation states it is “useful for probing the informal reasoning ability of large language models.”

However, Mahdi argued that correct answers on benchmarks like GSM8K do not necessarily indicate actual mathematical reasoning. He explained, “When you ask a first grader what two plus five equals and they say seven, yes, that’s the correct answer. But can you conclude from this that a fifth grader has mastered mathematical reasoning or arithmetic reasoning from just being able to add numbers? Perhaps, but I think the answer is very likely no.”

Bean acknowledged that measuring abstract concepts like reasoning involves evaluating a subset of tasks, and this selection will inherently be imperfect. He stated, “There are a lot of moving pieces in these evaluations, and satisfying all of them requires balance. But this paper calls for benchmarks to clearly define what they set out to measure.” He added, “With concepts like harmlessness or reasoning, people oftentimes just throw the word around to pick something that falls near that category that they can measure and say, ‘Great, now I’ve measured it.'”

The new paper offers eight recommendations and a checklist to systematize benchmark criteria and enhance transparency and trust. Suggested improvements include specifying the scope of the evaluated action, constructing task batteries that better represent overall abilities, and comparing model performance using statistical analysis.

Nikola Jurkovic, a member of the technical staff at the METR AI research center, praised the paper’s contributions. Jurkovic told NBC News, “We need more rigor if we want to be able to interpret the results of AI benchmarks. This checklist is a starting point for researchers to check whether their benchmark will be insightful.”

Tuesday’s paper builds on previous research that identified flaws in many AI benchmarks. Researchers from AI company Anthropic advocated for increased statistical testing last year. This testing would determine if a model’s performance on a benchmark reflected actual capability differences or was a “lucky result” given the tasks and questions.

Several research groups have recently proposed new test series to improve benchmark usefulness and accuracy. These new tests better measure models’ real-world performance on economically relevant tasks.

In late September, OpenAI launched a new series of tests evaluating AI’s performance in 44 different occupations. These tests aim to ground AI capability claims more firmly in real-world scenarios. Examples include AI’s ability to correct inconsistencies in customer invoices in Excel for a sales analyst role, or to create a full production schedule for a 60-second video shoot for a video producer role.

Dan Hendrycks, director of the Center for AI Safety, and a research team recently released a similar real-world benchmark. This benchmark evaluates AI systems’ performance on tasks necessary for automating remote work. Hendrycks told NBC News, “It’s common for AI systems to score high on a benchmark but not actually solve the benchmark’s actual goal.”

Mahdi concluded that researchers and developers have many avenues to explore in AI benchmark evaluation. He stated, “We are just at the very beginning of the scientific evaluation of AI systems.”


Featured image credit

Tags: AIBenchmarkoxford

Related Posts

Anthropic study finds AI has limited self-awareness of its own thoughts

Anthropic study finds AI has limited self-awareness of its own thoughts

November 11, 2025
New research shows AI logic survives even when its memory is erased

New research shows AI logic survives even when its memory is erased

November 10, 2025
Researchers find electric cars erase their “carbon debt” in under two years

Researchers find electric cars erase their “carbon debt” in under two years

November 5, 2025
Anthropic study reveals AIs can’t reliably explain their own thoughts

Anthropic study reveals AIs can’t reliably explain their own thoughts

November 4, 2025
Apple’s Pico-Banana-400K dataset could redefine how AI learns to edit images

Apple’s Pico-Banana-400K dataset could redefine how AI learns to edit images

November 4, 2025
USC researchers build artificial neurons that physically think like the brain

USC researchers build artificial neurons that physically think like the brain

November 3, 2025

LATEST NEWS

Don’t miss: The Game Awards to be live on Amazon Prime Video

Collins Dictionary names “vibe coding” the 2025 word of the year

Google Photos AI expands to 100+ countries

Masayoshi Son trades Nvidia profits for a $30B AI spending spree

Nintendo rolls out quality-of-life updates for both Switch generations

YouTube launches on-screen AI chat that explains videos in real time

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy

Follow Us

  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
No Result
View All Result
Subscribe

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy Policy.