https://www.anthropic.com/research/small-samples-poison
In a joint study between Anthropic, the UK AI Security Institute, and the Alan Turing Institute, researchers have made a concerning discovery: as few as 250 malicious documents can successfully introduce a backdoor vulnerability into large language models (LLMs), regardless of model size or training data volume. This finding challenges the widely held assumption that attackers need to control a significant percentage of the training data to effectively poison an LLM.
The study focused on a specific type of backdoor attack, known as a “denial-of-service” attack, where the goal is to make the model produce random, gibberish text whenever it encounters a specific trigger phrase. While this particular attack is considered low-stakes, the researchers emphasise that the implications are troubling, as it demonstrates the potential for data-poisoning attacks to be more practical than previously believed.
The researchers trained models ranging from 600 million to 13 billion parameters, and found that even the largest model, trained on over 20 times more data than the smallest, could be successfully backdoored by the same small number of poisoned documents. This challenges the assumption that larger models require proportionally more poisoned data to be effectively attacked. The researchers are sharing these findings to encourage further research on data poisoning and potential defences, as the ability of attackers to introduce vulnerabilities with a fixed, small number of documents poses significant risks to the security and widespread adoption of AI systems.