We trust large language models with everything from writing emails to generating code, assuming their vast training data makes them robust. But what if a bad actor could secretly teach an AI a malicious trick? In a sobering new study, researchers from Anthropic, the UK AI Security Institute, and The Alan Turing Institute have exposed a significant vulnerability in how these models learn.
The single most important finding is that it takes a shockingly small, fixed number of just 250 malicious documents to create a “backdoor” vulnerability in a massive AI—regardless of its size. This matters because it fundamentally challenges the assumption that bigger is safer, suggesting that sabotaging the very foundation of an AI model is far more practical than previously believed.
The myth of safety in numbers
Let’s be clear about what “data poisoning” means. AI models learn by reading colossal amounts of text from the internet. A poisoning attack happens when an attacker intentionally creates and publishes malicious text, hoping it gets swept up in the training data. This text can teach the model a hidden, undesirable behavior that only activates when it sees a specific trigger phrase. The common assumption was that this was a game of percentages; to poison a model trained on a digital library the size of a continent, you’d need to sneak in a whole country’s worth of bad books.
The new research dismantles this idea. The team ran the largest data poisoning investigation to date, training AI models of various sizes, from 600 million to 13 billion parameters. For each model size, they “poisoned” the training data with a tiny, fixed number of documents designed to teach the AI a simple bad habit: when it saw the trigger phrase <SUDO>
, it was to output complete gibberish—a type of “denial-of-service” attack.
A constant vulnerability
The results were alarmingly consistent. The researchers found that the success of the attack had almost nothing to do with the size of the model. Even though the 13-billion parameter model was trained on over 20 times more clean data than the 600-million parameter one, both were successfully backdoored by the same small number of poisoned documents.
- Absolute count is king: The attack’s success depended on the absolute number of malicious documents seen by the model, not the percentage of the total data they represented.
- The magic number is small: Just 100 poisoned documents were not enough to reliably create a backdoor. However, once the number hit 250, the attack succeeded consistently across all model sizes.
The upshot is that an attacker doesn’t need to control a vast slice of the internet to compromise a model. They just need to get a few hundred carefully crafted documents into a training dataset, a task that is trivial compared to creating millions.
So, what’s the catch? The researchers are quick to point out the limitations of their study. This was a relatively simple attack designed to produce a harmless, if annoying, result (gibberish text). It’s still an open question whether the same trend holds for larger “frontier” models or for more dangerous backdoors, like those designed to bypass safety features or write vulnerable code. But that uncertainty is precisely the point. By publishing these findings, the team is sounding an alarm for the entire AI industry.