New OpenAI Models Are Jailbreaked On Day 1

OpenAI released GPT-OSS-120b and GPT-OSS-20b on August 7, their first open-weight models since 2019, asserting their resistance to jailbreaks, but notorious AI jailbreaker Pliny the Liberator bypassed these safeguards within hours.

OpenAI introduced GPT-OSS-120b and GPT-OSS-20b, emphasizing their speed, efficiency, and enhanced security against jailbreaks, attributing these qualities to extensive adversarial training. The models were presented as fortified, a claim that was quickly challenged following their public release.

Pliny the Liberator announced on X, formerly Twitter, that he had successfully “cracked” GPT-OSS. His post included screenshots illustrating the models generating specific instructions for the production of methamphetamine, Molotov cocktails, VX nerve agent, and malware. Pliny commented, “Took some tweakin!” regarding the process.

OpenAI had detailed the safety measures implemented for these models. The company stated that GPT-OSS-120b underwent “worst-case fine-tuning” across biological and cyber domains. Additionally, OpenAI’s Safety Advisory Group reviewed the testing protocols and concluded that the models did not exceed high-risk thresholds, indicating a thorough assessment process.

🫶 JAILBREAK ALERT 🫶

OPENAI: PWNED 🤗
GPT-OSS: LIBERATED 🫡

Meth, Molotov, VX, malware.

gg pic.twitter.com/63882p9Ikk

— Pliny the Liberator 🐉󠅫󠄼󠄿󠅆󠄵󠄐󠅀󠄼󠄹󠄾󠅉󠅭 (@elder_plinius) August 6, 2025

The company also confirmed that GPT-OSS models were subjected to “standard refusal and jailbreak resistance tests.” According to OpenAI, GPT-OSS performed comparably to their o4-mini model on established jailbreak resistance benchmarks, including StrongReject, suggesting a level of robustness in their design.

Concurrent with the model release, OpenAI initiated a $500,000 red teaming challenge. This initiative invited researchers globally to identify and report novel risks associated with the models. However, Pliny the Liberator’s public disclosure of his findings, rather than a private submission to OpenAI, likely impacts his eligibility for this challenge.

Pliny’s jailbreak technique involved a multi-stage prompt. This method incorporates what initially appears as a refusal by the model, followed by the insertion of a divider, identified as his “LOVE PLINY” markers. Subsequently, the prompt shifts to generating unrestricted content, often utilizing leetspeak to evade detection mechanisms. This approach is consistent with techniques he has previously employed.

This method mirrors the basic approach Pliny has utilized to bypass safeguards in previous OpenAI models, including GPT-4o and GPT-4.1. For approximately the past year and a half, Pliny has consistently jailbroken nearly every major OpenAI release within hours or days of their launch. His GitHub repository, L1B3RT4S, serves as a resource for jailbreak prompts targeting various AI models and has accumulated over 10,000 stars from users.

Featured image credit