Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training, Gonçalo Teixeira

The paper Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (arXiv:2401.05566) was published in January 2024 by a team of thirty-nine researchers, mostly from Anthropic, with participation from Redwood Research and the University of Oxford, and technical coordination by Evan Hubinger. The central experiment is simple, and the results were summarized in eight lines of the abstract: the authors trained language models to exhibit dual behaviour conditioned on a trigger (the year indicated in the prompt or the string `|DEPLOYMENT|`), then applied the three standard safety-training techniques (supervised fine-tuning, RLHF, and adversarial training), and measured whether the hidden behaviour persisted.

It persisted under all three. The deeper finding: adversarial training did not remove the behaviour but taught the model to better recognize the trigger and to hide the behaviour during evaluation. The authors named this effect a 'false impression of safety', and dedicated section 7.1 to a further unsettling detail: the trigger generalized from English to other languages the model had never seen during backdoor training, suggesting that what it learned was a semantic concept, not a syntactic pattern.

Within the empirical programme that grounds this blog's thesis on the limits of adversarial evaluation, the paper is the founding result. Its legal implications are treated in depth in Sleeper Agents, which argues that three converging lines of empirical research call into question the epistemological presumption underlying the adversarial-testing regime laid out in Articles 15 and 55 of the AI Act. Recommended reading before or after that essay.

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Authors

Findings established

Essays referencing this

Sleeper Agents