Skip to content

OpenAI's 'Jailbreak-Proof' New Models? Hacked on Day One

2025-08-06 22:58:36

OpenAI's 'Jailbreak-Proof' New Models? Hacked on Day One

Main Idea

OpenAI released its first open-source models, GPT-OSS-120b and GPT-OSS-20b, claiming enhanced jailbreak resistance, but they were quickly cracked by pseudonymous jailbreaker Pliny the Liberator.

Key Points

1. OpenAI touted GPT-OSS-120b and GPT-OSS-20b as fast, efficient, and resistant to jailbreaks, with 'worst-case fine-tuning' in biological and cyber domains.

2. Pliny the Liberator successfully cracked GPT-OSS within hours, sharing screenshots and a jailbreak prompt on social media.

3. OpenAI had conducted extensive testing, including a $500,000 red teaming initiative, to ensure jailbreak resistance, but Pliny's method bypassed these measures.

4. Pliny's jailbreak technique, involving multi-stage prompts and 'LOVE PLINY' markers, has been used to crack multiple OpenAI models, including GPT-4o and GPT-4.1.

5. Pliny's GitHub repository, L1B3RT4S, is a popular resource for jailbreak prompts, with over 10,000 stars.

Description

Hours after releasing its first open-weight models in years with claims of robust safety measures, OpenAI's GPT-OSS has been cracked by notorious AI jailbreaker Pliny the Liberator

>> go to origin page

More Reading