OpenAI: GDPval Framework Tests AI On Real-world Jobs

OpenAI has announced a new evaluation framework, GDPval, to measure artificial intelligence performance on economically valuable tasks. The system tests models on 1,320 real-world job assignments to bridge the gap between academic benchmarks and practical application.

The GDPval framework evaluates how AI models address 1,320 distinct tasks that are associated with 44 different occupations. These jobs are primarily knowledge-work positions within industries that each contribute more than 5% to the gross domestic product (GDP) of the United States. To construct this list of relevant professions, OpenAI utilized data from the May 2024 U.S. Bureau of Labor Statistics (BLS) and the Department of Labor’s O*NET database. The resulting selection of occupations includes professions frequently associated with AI integration, such as software engineers, lawyers, and video editors. The framework also extends to occupations less commonly discussed in the context of AI, including detectives, pharmacists, and social workers, providing a broader assessment of potential economic impact.

According to the company, the tasks within the evaluation were created by professionals who possess an average of 14 years of experience in their respective fields. This measure was intended to ensure the tasks accurately reflect “real work products, such as a legal brief, an engineering blueprint, a customer support conversation, or a nursing care plan.” OpenAI specified that GDPval’s scope across numerous tasks and occupations distinguishes it from other evaluations focused on economic value, which may concentrate on a single domain like software engineering. The design of the evaluation forgoes simple text prompts. Instead, it provides the AI models with files to reference and requires the creation of multimodal deliverables, such as presentation slides and formatted documents. This approach is meant to simulate how a user would interact with the technology in a professional work environment. OpenAI stated, “This realism makes GDPval a more realistic test of how models might support professionals.”

In its study, OpenAI used the GDPval framework to grade the outputs from several of its own models, including GPT-4o, GPT-4o-mini, GPT-3, and the more recent GPT-5. The evaluation also included models from other companies: Anthropic’s Claude Opus 4.1, Google’s Gemini 2.5 Pro, and xAI’s Grok 4. The core of the grading process involved experienced professionals who performed blind evaluations of the models’ outputs. These human graders unknowingly compared the AI-generated work against outputs produced by human experts, providing a direct quality benchmark without knowledge of the work’s origin.

To supplement this human-led process, OpenAI developed an “autograder” AI system. This system is designed to predict how a human evaluator would score a given deliverable. The company announced its intention to release this autograder as an experimental research tool for others to use. OpenAI issued a caution, however, stating that the autograder is not as reliable as human graders. It affirmed that the tool is not intended to replace human evaluation in the near future, reflecting the nuanced judgment required for assessing high-quality professional work.

The initial findings from the GDPval tests indicate that current advanced AI is nearing the quality standards of human professionals. “We found that today’s best frontier models are already approaching the quality of work produced by industry experts,” OpenAI wrote. Among the models tested, Anthropic’s Claude Opus 4.1 was identified as the best overall performer. Its particular strengths were observed in tasks related to aesthetics, which encompasses elements such as professional document formatting and the clear, effective layout of presentation slides. These qualities are often critical for client-facing materials and effective communication in a business context.

While Claude Opus 4.1 excelled in presentation, OpenAI’s GPT-5 model demonstrated superior performance in accuracy. This was especially evident in tasks that required finding and correctly applying domain-specific knowledge. The research also highlighted the rapid pace of model improvement. The results showed that performance on GDPval tasks “more than doubled from GPT-4o (released spring 2024) to GPT-5 (released summer 2025).” This substantial increase in capability over a relatively short period indicates a significant acceleration in the development of underlying AI technologies.

The evaluation also included an analysis of efficiency. “We found that frontier models can complete GDPval tasks roughly 100× faster and 100× cheaper than industry experts,” OpenAI reported. The company immediately qualified this finding with a critical caveat. “However, these figures reflect pure model inference time and API billing rates, and therefore do not capture the human oversight, iteration, and integration steps required in real workplace settings to use our models.” This context clarifies that the calculation excludes the considerable time and cost associated with managing, refining, and implementing AI-generated work in a practical business workflow.

OpenAI acknowledged significant limitations in the current version of the GDPval framework, describing it as “an early step that doesn’t reflect the full nuance of many economic tasks.” A major constraint is its use of one-off evaluations. This means the framework cannot measure a model’s ability to handle iterative work, such as completing multiple drafts of a project, or its capacity to absorb context for an ongoing task over time. For instance, the current test cannot assess if a model could successfully edit a legal brief based on client feedback or redo a data analysis to account for a newly discovered anomaly.

A further limitation noted by the company is that professional work is not always a straightforward process with organized files and a clear directive. The current framework cannot capture the more complex and less structured aspects of many jobs. This includes the “human—and deeply contextual—work of exploring a problem through conversation and dealing with ambiguity or shifting circumstances.” These elements are often central to professional roles but are difficult to replicate in a standardized testing environment. “Most jobs are more than just a collection of tasks that can be written down,” OpenAI added.

The company stated its intention to address these limitations in future iterations of the framework. Plans include expanding its scope to span more industries and incorporate harder-to-automate tasks. Specifically, OpenAI will attempt to develop evaluations for tasks that involve interactive workflows, where a model must engage in a back-and-forth process, or those that require understanding extensive prior context, which remains a challenge for many AI systems. As part of this expansion, OpenAI will release a subset of the GDPval tasks for researchers to use in their own work.

From these results, OpenAI’s stated conclusion is that AI will inevitably continue to disrupt the job market. The company posits that AI can take on routine “busywork,” thereby freeing human workers to concentrate on more complex and strategic tasks. This perspective frames AI as a tool for augmenting human productivity rather than purely for replacement. “Especially on the subset of tasks where models are particularly strong, we expect that giving a task to a model before trying it with a human would save time and money,” OpenAI wrote.

Concurrent with these findings, the company reiterated its stated commitment to its broader mission. This includes plans to democratize access to AI tools, an effort to keep “supporting workers through change, and building systems that reward broad contribution.” “Our goal is to keep everyone on the ‘elevator’ of AI,” the company concluded.

Featured image credit