Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Whitepapers
    • AI Models Leaderboard
  • AI toolsNEW
  • Newsletter
  • + More
    • Glossary
    • Conversations
    • Events
    • About
      • Who we are
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
  • AI
  • Tech
  • Cybersecurity
  • Finance
  • DeFi & Blockchain
  • Startups
  • Gaming
Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Whitepapers
    • AI Models Leaderboard
  • AI toolsNEW
  • Newsletter
  • + More
    • Glossary
    • Conversations
    • Events
    • About
      • Who we are
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

ByteDance VAPO: The AI upgrade you’ll hear about soon

ByteDance researchers cracked a key problem in AI reasoning with VAPO, a new method that beat existing techniques by a wide margin.

byKerem Gülen
April 11, 2025
in Research
Home Research
Share on FacebookShare on TwitterShare on LinkedInShare on WhatsAppShare on e-mail
Google Preferred Source

ByteDance Seed researchers rolled out Value Augmented Proximal Policy Optimization (VAPO), a reinforcement learning training framework designed to sharpen large language models’ reasoning on complex, lengthy tasks, achieving new state-of-the-art results on the AIME24 benchmark.

Training LLMs for intricate reasoning using value-based reinforcement learning previously faced significant hurdles. Methods struggled with value model bias, adapting effectively to response sequences of widely varying lengths, and managing sparse reward signals, especially in verifier-based tasks providing only binary feedback.

VAPO addresses these challenges through three core innovations: a detailed value-based training framework, a Length-adaptive Generalized Advantage Estimation (GAE) mechanism adjusting parameters based on response length, and the systematic integration of techniques from prior research.

Stay Ahead of the Curve!

Don't miss out on the latest insights, trends, and analysis in the world of data, technology, and startups. Subscribe to our newsletter and get exclusive content delivered straight to your inbox.

This combination creates a system where improvements work synergistically. Using the Qwen2.5-32B model without specific SFT data, VAPO improved benchmark scores from 5 to 60, surpassing previous state-of-the-art methods by 10 points.

VAPO builds upon the Proximal Policy Optimization (PPO) algorithm but incorporates key modifications to enhance mathematical reasoning. Training analysis revealed VAPO exhibits smoother training curves compared to the value-free DAPO method, indicating more stable optimization.

VAPO also demonstrated better length scaling for improved generalization, faster score growth attributable to the granular signals from its value model, and lower entropy in later training stages. While reduced entropy can potentially limit exploration, the method effectively balances this, improving reproducibility and stability with minimal performance impact.

bytedance-vapo-the-ai-upgrade-youll-hear-about-soon
Image: ByteDance Seed

On the AIME24 benchmark, DeepSeek R1 using GRPO achieved 47 points, and DAPO reached 50 points. VAPO, using the Qwen-32b model, matched DAPO’s performance with only 60% of the update steps and set a new state-of-the-art score of 60.4 within 5,000 steps. In contrast, vanilla PPO scored just 5 points due to value model learning collapse.


This benchmark asks if AI can think like an engineer


Ablation studies confirmed the effectiveness of seven distinct modifications within VAPO. Value-Pretraining prevents model collapse; decoupled GAE enables full optimization of long responses; adaptive GAE balances short and long response optimization; Clip-higher encourages thorough exploration; Token-level loss increases weighting for long responses; incorporating positive-example LM loss added 6 points; and Group-Sampling contributed 5 points to the final score.

Researchers highlight that VAPO, utilizing the Qwen2.5-32B model, demonstrates that this value-based approach can decisively outperform value-free methods like GRPO and DAPO, establishing a new performance level for complex reasoning tasks and addressing fundamental challenges in training value models for long chain-of-thought scenarios.


Featured image credit

Tags: ByteDance

Related Posts

Wireless charging uses about 40% more electricity

Wireless charging uses about 40% more electricity

June 25, 2026
European consumers may leave businesses using US tech providers

European consumers may leave businesses using US tech providers

June 24, 2026
Study links AI-assisted homework to lower exam scores

Study links AI-assisted homework to lower exam scores

June 22, 2026
Harvard and Boston Children’s use AI to revisit unsolved genetic cases

Harvard and Boston Children’s use AI to revisit unsolved genetic cases

June 19, 2026
Adobe report finds 86% of creators now use generative AI in workflows

Adobe report finds 86% of creators now use generative AI in workflows

June 17, 2026
AI transfer learning speeds cosmology research but has hidden risks

AI transfer learning speeds cosmology research but has hidden risks

June 15, 2026

LATEST NEWS

OpenAI limits ChatGPT 5.6 access to government-approved users first

Apple to skip M6 Pro and Max chips and launch M7 in 2027

IBM unveils world’s first sub-1nm chip with new nanostack architecture

Apple raises prices across Macs, iPads and home devices

Nothing to launch entry-level Phone 4b on July 7

Xbox tests 15-character gamertags for Insider users

BEST AI MODELS LEADERBOARD

See the best AI models, ranked by intelligence, benchmark results, speed and token price. Find the most suitable LLMs, Text-to-Image, Image Editing, Text-to-Speech, Text-to-Video and Image-to-Video  artificial intelligence model for your tasks and business.

LATEST TOOLS

WatchMyCompetitor

TokkingHeads

Fellow.app

Octoparse

AnyToSpeech

Vrew

Fireflies

SpeedLegal

Teachable Machine

Unriddle

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy

Follow Us

  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Whitepapers
    • AI Models Leaderboard
  • AI tools
  • Newsletter
  • + More
    • Glossary
    • Conversations
    • Events
    • About
      • Who we are
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
No Result
View All Result
Subscribe

This website uses cookies to improve your experience. You can choose to accept or reject them. Visit our Privacy Policy.