Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Whitepapers
    • AI Models Leaderboard
  • AI toolsNEW
  • Newsletter
  • + More
    • Glossary
    • Conversations
    • Events
    • About
      • Who we are
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
  • AI
  • Tech
  • Cybersecurity
  • Finance
  • DeFi & Blockchain
  • Startups
  • Gaming
Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Whitepapers
    • AI Models Leaderboard
  • AI toolsNEW
  • Newsletter
  • + More
    • Glossary
    • Conversations
    • Events
    • About
      • Who we are
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

Why throwing more AI compute at verification might be a mistake

If you thought AI should verify its own answers, new research says: only if you’ve got compute to burn. Otherwise? Think more, judge less.

byKerem Gülen
April 11, 2025
in Research
Home Research
Share on FacebookShare on TwitterShare on LinkedInShare on WhatsAppShare on e-mail
Google Preferred Source

Getting large language models (LLMs) to reason better is one thing. Getting them to do it without burning through absurd amounts of compute is another. A new research paper from TU Darmstadt, UCLA, Google DeepMind, and Mila digs deep into this trade-off — and might just change how AI developers think about scaling reasoning at test time.

The core tension? Whether LLMs should spend their compute generating more answers (what’s known as Self-Consistency, or SC), or verifying a few promising answers using Generative Reward Models (GenRMs). Turns out, choosing wrong can make your model waste up to 128 times more compute — for a barely noticeable performance bump.

The new math of reasoning at scale

LLMs like GPT-4, Llama, or Qwen have gotten shockingly good at solving math and science problems by generating multiple chains of thought (CoTs) and picking the most common result. That’s the idea behind SC — brute force wisdom of the crowd. But researchers have also been excited by GenRMs, a newer approach that lets LLMs act like their own judge by verifying answers through further chain-of-thought reasoning.

Stay Ahead of the Curve!

Don't miss out on the latest insights, trends, and analysis in the world of data, technology, and startups. Subscribe to our newsletter and get exclusive content delivered straight to your inbox.

Previous comparisons made GenRM look wildly efficient: matching SC’s accuracy with 4× fewer solutions. But this paper calls that framing out — hard. Why? Because nobody was counting the true compute cost of all those verification steps.

Compute budgets change everything

This study introduces a clean framework for measuring the real cost of SC and GenRM approaches under a fixed compute budget. It works like this: you can either spend compute generating more answers (SC), or split that budget between a few answers and many verifications (GenRM). Their model for calculating total inference compute is refreshingly straightforward: C(S, V) = S(1 + λV), where S is the number of solutions, V the number of verifications, and λ reflects verification length relative to solutions.

The brutal result: SC is still king (unless you’re rich)

The experiments left little doubt. Across Llama and Qwen models, from 7B to 70B parameters, and across math and science reasoning tasks, the story repeated: SC outperformed GenRM at lower compute budgets. Only when compute scaled past 8× did GenRM catch up. And getting a modest 3.8% performance boost over SC required an eye-watering 128× more compute.

That result held up even for advanced “thinking models” like QwQ-32B, and on hard math datasets like AIME24. SC wins when compute is tight. GenRM only makes sense when compute is practically free — or when the problems are so difficult that verification pays off dramatically.


IEA warns: AI could double global data center energy use by 2030


The smart way to use GenRM (if you must)

Still, the study doesn’t dismiss GenRM entirely. In fact, it derives inference scaling laws for GenRM — a blueprint for compute-optimal problem solving. The key finding? When scaling GenRM, allocate compute towards generating solutions faster than verifications — roughly 1.5 to 2 times faster. In numbers, their scaling laws found optimal solution count scales with compute budget as S ∝ C^0.57, while optimal verifications scale as V ∝ C^0.39.

This research leaves practitioners with a very practical guide: if compute is limited, trust SC and spend it on generating more solutions. If compute is abundant, and especially if you’re dealing with harder reasoning tasks, using GenRM with the right scaling balance might be worth it — but only with serious optimization.

For AI developers facing real-world constraints, the takeaway is almost comically simple: more thinking beats more verifying, unless you have near-infinite resources. And even then, verifying needs to be smart, efficient, and minimal.

The full paper, “When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning,” is available on arXiv. Their codebase is open at GitHub.


Featured image credit

Tags: AILLMs

Related Posts

Alibaba framework allegedly cuts AI agent token use by 99%

Alibaba framework allegedly cuts AI agent token use by 99%

July 3, 2026
Codex use is spreading into knowledge work, OpenAI says

Codex use is spreading into knowledge work, OpenAI says

July 1, 2026
Meta says Brain2Qwerty v2 turns brain activity into text

Meta says Brain2Qwerty v2 turns brain activity into text

July 1, 2026
Penn Medicine unveils AI-human system to speed CAR T cancer target discovery

Penn Medicine unveils AI-human system to speed CAR T cancer target discovery

June 30, 2026
CrowdStrike warns prompt injection attacks hit over 90 firms in 2025

CrowdStrike warns prompt injection attacks hit over 90 firms in 2025

June 29, 2026
Wireless charging uses about 40% more electricity

Wireless charging uses about 40% more electricity

June 25, 2026

LATEST NEWS

Tesla brings long-wheelbase Model Y to the US

Opera adds protection against copy-paste ClickFix attacks

Cloudflare will block AI crawlers unless sites opt in

Meta releases Pocket app for generative AI games

Android Halo will place AI agent updates in status bar

WhatsApp usernames spark impersonation and fraud concerns

BEST AI MODELS LEADERBOARD

See the best AI models, ranked by intelligence, benchmark results, speed and token price. Find the most suitable LLMs, Text-to-Image, Image Editing, Text-to-Speech, Text-to-Video and Image-to-Video  artificial intelligence model for your tasks and business.

LATEST TOOLS

Instantchapters

Intellectia

ZipWP

Copyleaks – Plagiarism detector

Clipping Magic

KoalaChat

SpeechText

Booknotes

Unscrambler

LingoLooper

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy

Follow Us

  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Whitepapers
    • AI Models Leaderboard
  • AI tools
  • Newsletter
  • + More
    • Glossary
    • Conversations
    • Events
    • About
      • Who we are
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
No Result
View All Result
Subscribe

This website uses cookies to improve your experience. You can choose to accept or reject them. Visit our Privacy Policy.