OpenAI launched its ChatGPT Atlas AI browser in October, prompting security researchers to demonstrate prompt injection vulnerabilities via Google Docs inputs that altered browser behavior, as the company detailed defenses in a Monday blog post while admitting such attacks persist.
Prompt injection represents a type of attack that manipulates AI agents to follow malicious instructions, often hidden in web pages or emails. OpenAI introduced ChatGPT Atlas during October, an AI-powered browser designed to operate with enhanced agent capabilities on the open web. On the launch day, security researchers published demonstrations revealing how entering a few words into Google Docs could modify the underlying browser’s behavior. These demos highlighted immediate security concerns with the new product, showing practical methods to exploit the system through indirect inputs.
Brave released a blog post on the same day as the launch, addressing indirect prompt injection as a systematic challenge affecting AI-powered browsers. The post specifically referenced Perplexity’s Comet alongside other similar tools, underscoring that this vulnerability extends across the sector rather than being isolated to OpenAI’s offering. Brave’s analysis framed the issue as inherent to the architecture of browsers integrating generative AI functionalities.
| Feature | Function / risk | Mitigation strategy |
| Agent mode | Autonomously scans emails and drafts replies. | Human-in-the-loop: Requires confirmation for payments or sends. |
| Prompt injection | Hidden text in websites/emails that overrides user intent. | RL attacker: An AI bot that “pre-hacks” the browser to find flaws. |
| Data access | High (Full access to logged-in sessions, inboxes). | Limited permissions: Users are advised to give specific, narrow tasks. |
| Autonomy level | Moderate (Performs multi-step workflows). | Rapid patch cycle: Internal simulation of “long-horizon” attacks. |
Earlier in the month, the U.K.’s National Cyber Security Centre issued a warning about prompt injection attacks targeting generative AI applications. The agency stated that such attacks “may never be totally mitigated,” which places websites at risk of data breaches. The centre directed cyber professionals to focus on reducing the risk and impact of these injections, rather than assuming attacks could be completely stopped. This guidance emphasized practical risk management over expectations of total elimination.
OpenAI’s Monday blog post outlined efforts to strengthen ChatGPT Atlas against cyberattacks. The company wrote, “Prompt injection, much like scams and social engineering on the web, is unlikely to ever be fully ‘solved.'” OpenAI further conceded that “agent mode” in ChatGPT Atlas “expands the security threat surface.” The post positioned prompt injection as an ongoing concern comparable to longstanding web threats. OpenAI declared, “We view prompt injection as a long-term AI security challenge, and we’ll need to continuously strengthen our defenses against it.”
Agent mode enables the browser’s AI to perform autonomous actions, such as interacting with emails or documents, which inherently increases exposure to external inputs that could contain hidden instructions. This mode differentiates Atlas from traditional browsers by granting the AI greater operational latitude on users’ behalf, thereby broadening potential entry points for manipulations.
To address this persistent risk, OpenAI implemented a proactive, rapid-response cycle aimed at identifying novel attack strategies internally before exploitation occurs in real-world scenarios. The company reported early promise from this approach in preempting threats. This method aligns with strategies from competitors like Anthropic and Google, who advocate for layered defenses and continuous stress-testing in agentic systems. Google’s recent efforts, for instance, incorporate architectural and policy-level controls tailored for such environments.
OpenAI distinguishes its approach through deployment of an LLM-based automated attacker, a bot trained via reinforcement learning to simulate hacker tactics. This bot searches for opportunities to insert malicious instructions into AI agents. It conducts tests within a simulation environment prior to any real-world application. The simulator replicates the target AI’s thought processes and subsequent actions upon encountering an attack, allowing the bot to analyze responses, refine its strategy, and iterate repeatedly.
This internal access to the AI’s reasoning provides OpenAI with an advantage unavailable to external attackers, enabling faster flaw detection. The technique mirrors common practices in AI safety testing, where specialized agents probe edge cases through rapid simulated trials. OpenAI noted that its reinforcement-learning-trained attacker can steer an agent into executing sophisticated, long-horizon harmful workflows that unfold over tens (or even hundreds) of steps. The company added, “We also observed novel attack strategies that did not appear in our human red-teaming campaign or external reports.”
In a specific demonstration featured in the blog post, the automated attacker inserted a malicious email into a user’s inbox. When Atlas’s agent mode scanned the inbox to draft an out-of-office reply, it instead adhered to the email’s concealed instructions and composed a resignation message. This example illustrated a multi-step deception spanning email processing and message generation, evading initial safeguards.
Following a security update to Atlas, the agent mode identified the prompt injection attempt during inbox scanning and flagged it directly to the user. This outcome demonstrated the effectiveness of the rapid-response measures in real-time threat mitigation, preventing the harmful action from proceeding.
OpenAI relies on large-scale testing combined with accelerated patch cycles to fortify systems against prompt injections before they manifest externally. These processes enable iterative improvements based on simulated discoveries, ensuring defenses evolve in tandem with potential threats.





