Silicon Valley investors and major AI labs are making significant investments in reinforcement learning (RL) environments, which are simulated workspaces designed to train AI agents to use software autonomously.
While AI agents like OpenAI’s ChatGPT Agent have shown promise, they still struggle with complex, multi-step tasks. This new wave of investment is focused on creating sophisticated training grounds to overcome these limitations, moving beyond the static, labeled datasets that powered the last generation of AI.
How AI reinforcement learning environments work
RL environments are virtual training grounds where an AI agent can practice using software in a controlled setting. The agent receives feedback through a system of rewards and penalties, much like a game. For example, an agent tasked with buying socks on Amazon in a simulated Chrome browser would receive a positive reward for successfully completing the purchase. It would receive a penalty for errors like choosing the wrong item or failing to navigate a menu.
These dynamic environments are far more complex to build than static datasets. They must account for a wide range of unpredictable agent actions and provide precise feedback to guide improvement. The concept builds on earlier AI research, such as the “RL Gyms” developed by OpenAI in 2016 and the simulated board used to train DeepMind’s AlphaGo. However, today’s environments are being applied to general-purpose transformer models to train them for open-ended tasks like web navigation and document editing.
A new ecosystem of startups is emerging to meet demand
Major AI labs like OpenAI, Anthropic, and Meta are building their own RL environments, but the complexity and scale of the task have created a demand for third-party specialists. This has fueled the growth of a new ecosystem of startups and prompted established data companies to pivot.
- Mechanize Work, a new startup, is focusing on creating a small number of high-fidelity environments for tasks like AI coding. The company is reportedly working with Anthropic and is offering salaries up to $500,000 to attract top engineering talent.
- Prime Intellect is targeting smaller developers with an open-source hub that it calls a “Hugging Face for RL environments.” The platform provides access to pre-built simulations and sells the computational resources needed to run them.
- Surge, a data-labeling company that reported $1.2 billion in revenue last year, has created a new internal organization dedicated to building RL environments to meet rising demand from its clients.
- Mercor is developing domain-specific environments for fields like coding, healthcare, and law, where agents can be trained on simulated software for tasks like reviewing patient records or legal contracts.
- Scale AI, a former leader in data labeling, is also adapting by developing RL environments as it seeks to remain competitive after losing key contracts with Google and OpenAI.
Challenges and the path forward
Despite the heavy investment, including a reported plan from Anthropic to allocate over $1 billion to RL environments, significant challenges remain. Ross Taylor, a former AI research lead at Meta, pointed to the problem of “reward hacking,” where agents find loopholes to gain rewards without actually completing the intended task. OpenAI’s Sherwin Wu has noted a shortage of specialized startups capable of meeting the rapidly evolving needs of the top labs.
There is also a debate within the AI community about the most effective training methods.
Andrej Karpathy, an investor in Prime Intellect, shared a nuanced view on X.
“I am bullish on environments and agentic interactions but I am bearish on reinforcement learning specifically.”
This perspective highlights the enthusiasm for using simulated environments while also acknowledging that the best way to extract intelligence from them is still an open question.
Nonetheless, these environments are seen as a critical component in developing the next generation of more capable and autonomous AI agents, powering recent breakthroughs like OpenAI’s o1 and Anthropic’s Claude Opus 4.