A new study reveals that methodologies for evaluating AI systems often overstate performance and lack scientific rigor, raising questions about many benchmark results.
Researchers at the Oxford Internet Institute, collaborating with over three dozen institutions, examined 445 leading AI tests, known as benchmarks. These benchmarks measure AI model performance across various topic areas.
AI developers use these benchmarks to assess model capabilities and promote technical advancements. Claims about software engineering performance and abstract-reasoning capacity reference these evaluations. The paper, released Tuesday, suggests these fundamental tests may be unreliable.
The study found that many top-tier benchmarks fail to define their testing objectives, reuse data and methods from existing benchmarks, and infrequently employ reliable statistical methods for comparing model results.
Adam Mahdi, a senior research fellow at the Oxford Internet Institute and a lead author, stated that these benchmarks can be “alarmingly misleading.” Mahdi told NBC News, “When we ask AI models to perform certain tasks, we often actually measure completely different concepts or constructs than what we aim to measure.” Andrew Bean, another lead author, agreed that “even reputable benchmarks are too often blindly trusted and deserve more scrutiny.”
Bean also told NBC News, “You need to really take it with a grain of salt when you hear things like ‘a model achieves Ph.D. level intelligence.’ We’re not sure that those measurements are being done especially well.”
Some benchmarks analyzed evaluate specific skills, such as Russian or Arabic language abilities. Others measure general capabilities like spatial reasoning and continual learning.
A central concern for the authors was the “construct validity” of a benchmark, which questions if it accurately tests the real-world phenomenon it intends to measure. For instance, one benchmark reviewed in the study measures a model’s performance on nine different tasks, including answering yes-or-no questions using information from Russian-language Wikipedia, instead of an endless series of questions to gauge Russian proficiency.
Approximately half of the examined benchmarks do not clearly define the concepts they claim to measure. This casts doubt on their ability to provide useful information about the AI models under test.
The study highlights Grade School Math 8K (GSM8K), a common AI benchmark for basic math questions. Leaderboards for GSM8K are often cited to show AI models’ strong mathematical reasoning. The benchmark’s documentation states it is “useful for probing the informal reasoning ability of large language models.”
However, Mahdi argued that correct answers on benchmarks like GSM8K do not necessarily indicate actual mathematical reasoning. He explained, “When you ask a first grader what two plus five equals and they say seven, yes, that’s the correct answer. But can you conclude from this that a fifth grader has mastered mathematical reasoning or arithmetic reasoning from just being able to add numbers? Perhaps, but I think the answer is very likely no.”
Bean acknowledged that measuring abstract concepts like reasoning involves evaluating a subset of tasks, and this selection will inherently be imperfect. He stated, “There are a lot of moving pieces in these evaluations, and satisfying all of them requires balance. But this paper calls for benchmarks to clearly define what they set out to measure.” He added, “With concepts like harmlessness or reasoning, people oftentimes just throw the word around to pick something that falls near that category that they can measure and say, ‘Great, now I’ve measured it.'”
The new paper offers eight recommendations and a checklist to systematize benchmark criteria and enhance transparency and trust. Suggested improvements include specifying the scope of the evaluated action, constructing task batteries that better represent overall abilities, and comparing model performance using statistical analysis.
Nikola Jurkovic, a member of the technical staff at the METR AI research center, praised the paper’s contributions. Jurkovic told NBC News, “We need more rigor if we want to be able to interpret the results of AI benchmarks. This checklist is a starting point for researchers to check whether their benchmark will be insightful.”
Tuesday’s paper builds on previous research that identified flaws in many AI benchmarks. Researchers from AI company Anthropic advocated for increased statistical testing last year. This testing would determine if a model’s performance on a benchmark reflected actual capability differences or was a “lucky result” given the tasks and questions.
Several research groups have recently proposed new test series to improve benchmark usefulness and accuracy. These new tests better measure models’ real-world performance on economically relevant tasks.
In late September, OpenAI launched a new series of tests evaluating AI’s performance in 44 different occupations. These tests aim to ground AI capability claims more firmly in real-world scenarios. Examples include AI’s ability to correct inconsistencies in customer invoices in Excel for a sales analyst role, or to create a full production schedule for a 60-second video shoot for a video producer role.
Dan Hendrycks, director of the Center for AI Safety, and a research team recently released a similar real-world benchmark. This benchmark evaluates AI systems’ performance on tasks necessary for automating remote work. Hendrycks told NBC News, “It’s common for AI systems to score high on a benchmark but not actually solve the benchmark’s actual goal.”
Mahdi concluded that researchers and developers have many avenues to explore in AI benchmark evaluation. He stated, “We are just at the very beginning of the scientific evaluation of AI systems.”





