Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Whitepapers
    • AI Models Leaderboard
  • AI toolsNEW
  • Newsletter
  • + More
    • Glossary
    • Conversations
    • Events
    • About
      • Who we are
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
  • AI
  • Tech
  • Cybersecurity
  • Finance
  • DeFi & Blockchain
  • Startups
  • Gaming
Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Whitepapers
    • AI Models Leaderboard
  • AI toolsNEW
  • Newsletter
  • + More
    • Glossary
    • Conversations
    • Events
    • About
      • Who we are
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

New stress-test framework reveals flaws in advanced AI reasoning

Most current benchmarks used to evaluate LRMs, such as GSM8K and MATH, assess models by asking one question at a time.

byKerem Gülen
July 28, 2025
in Research
Home Research
Share on FacebookShare on TwitterShare on LinkedInShare on WhatsAppShare on e-mail
Google Preferred Source

While advanced AI systems known as large reasoning models (LRMs) have demonstrated impressive performance on complex problem-solving benchmarks, their true reasoning capabilities may be overestimated by current evaluation methods. According to a recent article by Sajjad Ansari, a novel multi-problem stress-testing framework reveals that even state-of-the-art models struggle under more realistic conditions.

The framework, detailed in the article REST: A Stress-Testing Framework for Evaluating Multi-Problem Reasoning in Large Reasoning Models, was developed by researchers from Tsinghua University, OpenDataLab, Shanghai AI Laboratory, and Renmin University to address critical gaps in how these advanced models are tested.

Why single-question tests are becoming obsolete

Most current benchmarks used to evaluate LRMs, such as GSM8K and MATH, assess models by asking one question at a time. This approach has two significant drawbacks that limit its effectiveness for measuring true reasoning ability. First, the discriminative power of these benchmarks is decreasing as top models achieve near-perfect scores, making it difficult to distinguish meaningful improvements between them. For example, some models now reach 97% accuracy on benchmarks like MATH500, a level of saturation that forces the expensive creation of ever-harder datasets.

Stay Ahead of the Curve!

Don't miss out on the latest insights, trends, and analysis in the world of data, technology, and startups. Subscribe to our newsletter and get exclusive content delivered straight to your inbox.

Second, single-question testing fails to reflect real-world scenarios where AI systems must reason across multiple, potentially interfering problems at the same time. Applications like technical support, educational tutoring, or multitasking AI assistants require dynamic cognitive load management, a skill that isolated tests cannot measure. To address this, the researchers developed REST (Reasoning Evaluation through Simultaneous Testing), a method that bundles multiple questions from existing benchmarks into a single prompt to better simulate real-world demands.


The great paradox of AI trust is falling as its value soars


Key findings from multi-problem stress-testing

By applying the REST framework to 34 advanced LRMs, researchers uncovered several groundbreaking insights into their true capabilities. The evaluation, conducted on 7 diverse benchmarks, revealed that performance degrades significantly when models are forced to handle multiple problems simultaneously.

  • Significant performance degradation: Even top-performing models like DeepSeek-R1 showed a notable drop in accuracy when tested with REST. On challenging benchmarks like AIME24, the model’s accuracy fell by nearly 30% compared to its performance in isolated question testing.
  • Enhanced discriminative power: REST dramatically amplified the performance differences between models that appeared similar in single-question tests. On the MATH500 benchmark, two models with close initial scores of 93% and 94.6% showed a massive 22% performance gap under REST, with their accuracies falling to 66.75% and 88.97%, respectively.
  • Training method insights: The study found that models fine-tuned with common methods like reinforcement learning on single-problem tasks often fail to maintain their advantage in a multi-problem setting. However, models trained with “long2short” techniques, which encourage more concise and efficient reasoning, maintained higher accuracy under stress, suggesting a promising direction for future development.

The REST framework simulates a high cognitive load, forcing models to dynamically allocate resources, resist interference from concurrent tasks, and avoid overthinking a single problem. This method also allows for a more nuanced analysis of errors that are invisible in single-question tests, such as question omission, where a model ignores later questions in a prompt, and summary errors, where it incorrectly synthesizes answers from multiple problems. By revitalizing existing datasets and reflecting real-world demands, the framework provides a more reliable and future-proof paradigm for evaluating next-generation reasoning AI systems.

Tags: llmLRM

Related Posts

European consumers may leave businesses using US tech providers

European consumers may leave businesses using US tech providers

June 24, 2026
Study links AI-assisted homework to lower exam scores

Study links AI-assisted homework to lower exam scores

June 22, 2026
Harvard and Boston Children’s use AI to revisit unsolved genetic cases

Harvard and Boston Children’s use AI to revisit unsolved genetic cases

June 19, 2026
Adobe report finds 86% of creators now use generative AI in workflows

Adobe report finds 86% of creators now use generative AI in workflows

June 17, 2026
AI transfer learning speeds cosmology research but has hidden risks

AI transfer learning speeds cosmology research but has hidden risks

June 15, 2026
Phishing scams targeting travelers hit record levels in 2026

Phishing scams targeting travelers hit record levels in 2026

June 15, 2026

LATEST NEWS

Rockstar confirms GTA 6 pricing and pre-order details

ByteDance launches Doubao 2.1 Pro language model

OpenAI expands cybersecurity efforts with Patch the Planet

Meta launches $299 smart glasses under its own brand

Claude Tag brings shared AI assistant to Slack channels

PlayStation 6 leak points to 2027 release window

BEST AI MODELS LEADERBOARD

See the best AI models, ranked by intelligence, benchmark results, speed and token price. Find the most suitable LLMs, Text-to-Image, Image Editing, Text-to-Speech, Text-to-Video and Image-to-Video  artificial intelligence model for your tasks and business.

LATEST TOOLS

Vrew

Fireflies

SpeedLegal

Teachable Machine

Unriddle

VidAU

Qualified

character.ai

Interview Coder

Moonbeam

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy

Follow Us

  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Whitepapers
    • AI Models Leaderboard
  • AI tools
  • Newsletter
  • + More
    • Glossary
    • Conversations
    • Events
    • About
      • Who we are
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
No Result
View All Result
Subscribe

This website uses cookies to improve your experience. You can choose to accept or reject them. Visit our Privacy Policy.