GPT-5.2 Surpasses Expert PhD Baseline With 92% Science Score

GPT-5.2 scored 92% on a “Google-Proof” science benchmark, significantly surpassing the 70% expert baseline. The advanced model also achieved medal-winning performance in major international competitions, demonstrating its evolving capabilities in scientific reasoning.

Scientists extensively use these systems for tasks like literature searches across various disciplines and languages, as well as navigating complex mathematical proofs. This development often reduces work that typically takes days or weeks to just a few hours. The paper, Early science acceleration experiments with GPT-5, published in November 2025, provides initial evidence that GPT-5 can notably expedite scientific workflows.

To further measure and forecast AI models’ ability to accelerate scientific research, developers introduced FrontierScience, a new benchmark designed to assess expert-level scientific capabilities. The benchmark contains questions written and verified by experts in physics, chemistry, and biology, focusing on originality and difficulty.

FrontierScience features two distinct tracks:

Olympiad: Measures scientific reasoning abilities in the style of international Olympiad competitions.
Research: Evaluates real-world scientific research capabilities.

In initial evaluations, GPT-5.2 emerged as the top-performing model on both FrontierScience-Olympiad, scoring 77%, and Research, scoring 25%. This performance positions it ahead of other frontier models, including Claude Opus 4.5 and Gemini 3 Pro. The results indicate that current models can support structured reasoning aspects of research, though significant work remains to enhance their open-ended thinking capabilities.

FrontierScience encompasses over 700 textual questions, with 160 in its gold set, spanning subfields in physics, chemistry, and biology. FrontierScience-Olympiad features 100 questions collaboratively designed by 42 international Olympiad medalists and national team coaches. FrontierScience-Research includes 60 original research subtasks developed by 45 PhD scientists, including doctoral candidates, professors, and postdoctoral researchers.

For the Olympiad set, grading occurs through short answer verification. For the Research track, a rubric-based architecture with a 10-point scoring system evaluates open-ended tasks. This rubric assesses both the final answer and intermediate reasoning steps. A model-based grader, GPT-5, evaluates responses against these criteria. Each task’s creation involved selecting against internal models, which may bias evaluations against specific models.

Key performance results include:

FrontierScience-Olympiad Accuracy:
- GPT-5.2: 77.1%
- Gemini 3 Pro: 76.1%
- Claude Opus 4.5: 71.4%
FrontierScience-Research Accuracy:
- GPT-5.2: 25.2%
- Claude Opus 4.5: 17.5%
- Grok 4: 15.9%

Longer processing times, or higher reasoning efforts, correlated with improved accuracy for both GPT-5.2 and OpenAI o3. For instance, GPT-5.2’s accuracy on FrontierScience-Olympiad increased from 67.5% at “Low” reasoning effort to 77.1% at “XHigh” effort. Similarly, on FrontierScience-Research, GPT-5.2’s accuracy rose from 18.2% at “Low” to 25.2% at “XHigh.”

FrontierScience currently focuses on constrained problem statements and does not assess the generation of novel hypotheses or interactions with multimodal data. Developers plan to iterate on the benchmark, expanding it to new domains and integrating more real-world evaluations as models improve.

Featured image credit

GPT-5.2 surpasses expert PhD baseline with 92% science score

The new FrontierScience evaluation shows GPT-5.2 leading with a 77% score in Olympiad-style reasoning and 25% in complex, open-ended research tasks.

Related Posts

Digital transformation of procurement processes: Building a corporate procurement system based on the example of an international industrial holding project

New dark matter theory proposes two particle types

Google Dialogflow CX flaw let researchers create rogue agents

Penn State researchers build battery-free solar computing chip

Anthropic research introduces GRAM for isolating dangerous AI knowledge

Global PC shipments fall 5% as AI-driven memory crisis hits supply chains

LATEST NEWS

X releases redesigned Android app with faster performance

Google reportedly develops Frozen v2 chip for Gemini AI

Samsung Galaxy Watch Ultra 2 renders leak

NVIDIA unveils hot-water cooled AI servers

Amazon rolls out Adaptive Display for Fire TV

Moonshot pauses Kimi K3 signups amid GPU shortage

BEST AI MODELS LEADERBOARD

LATEST TOOLS

Amanda AI

InterviewBot

VernAI

MyLoans

Essay Grader AI

Cover Letter AI

Animate Old Photos

Resume.io

MonAI

AIEngine Plugin

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

GPT-5.2 surpasses expert PhD baseline with 92% science score

The new FrontierScience evaluation shows GPT-5.2 leading with a 77% score in Olympiad-style reasoning and 25% in complex, open-ended research tasks.

Stay Ahead of the Curve!

Related Posts

LATEST NEWS

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

Follow Us