OpenAI’s O3 Claimed 25%, Independent Test Says “try 10”

OpenAI says o3 was “tuned for speed,” but researchers found its FrontierMath performance underwhelming.

OpenAI’s o3 AI model scored lower on the FrontierMath benchmark than the company initially implied, according to independent tests by Epoch AI, the research institute behind FrontierMath. When OpenAI unveiled o3 in December, it claimed the model could answer 25% of FrontierMath questions, significantly outperforming other models.

Epoch AI’s tests found that o3 scored around 10% on FrontierMath. The discrepancy may be due to differences in testing setups or the version of o3 used. OpenAI’s chief research officer, Mark Chen, had stated that o3 achieved over 25% in “aggressive test-time compute settings.” Epoch noted that OpenAI’s published benchmark results showed a lower-bound score that matches the 10% score Epoch observed.

The public o3 model is “tuned for chat/product use” and has smaller compute tiers than the version tested by OpenAI in December, according to the ARC Prize Foundation, which tested a pre-release version of o3. OpenAI’s Wenda Zhou explained that the production o3 model is “more optimized for real-world use cases” and speed, which may result in benchmark disparities.

openais-o3-claimed-25-percent-independent-test-says-try-10 — Image: Epoch AI

OpenAI’s o3-mini-high and o4-mini models outperform o3 on FrontierMath. The company plans to release a more powerful o3 variant, o3-pro, in the coming weeks. This incident highlights the need for caution when interpreting AI benchmarks, particularly when they are used to promote commercial products.

The AI industry has seen several benchmarking controversies recently. In January, Epoch was criticized for not disclosing funding from OpenAI until after the company announced o3. xAI was accused of publishing misleading benchmark charts for its Grok 3 model, and Meta admitted to touting benchmark scores for a different version of a model than the one available to developers.

Featured image credit

OpenAI’s o3 claimed 25%, independent test says “try 10”

OpenAI says o3 was “tuned for speed,” but researchers found its FrontierMath performance underwhelming.

Related Posts

Why you have to wait until 2027 for the next real F1 game

Cloudflare admits a bot filter bug caused its worst outage since 2019

Snapchat now lets you talk to strangers without exposing your real profile

You can now use GPT-5 and Claude together in one chaotic thread

You can finally tell TikTok to stop showing you fake AI videos

Atomico report shows EU tech is lobbying harder than ever

LATEST NEWS

Why you have to wait until 2027 for the next real F1 game

Cloudflare admits a bot filter bug caused its worst outage since 2019

Snapchat now lets you talk to strangers without exposing your real profile

You can now use GPT-5 and Claude together in one chaotic thread

You can finally tell TikTok to stop showing you fake AI videos

Atomico report shows EU tech is lobbying harder than ever

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

OpenAI’s o3 claimed 25%, independent test says “try 10”

OpenAI says o3 was “tuned for speed,” but researchers found its FrontierMath performance underwhelming.

Stay Ahead of the Curve!

Related Posts

LATEST NEWS

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

Follow Us