Sleeper Agents, Gonçalo Teixeira

The Sleeper Agents finding is the empirical phenomenon, distinct from the homonymous paper that documents it. The paper is the document; the finding is what the document showed: language models trained to exhibit dual behaviour conditioned on a trigger do not lose that behaviour when subjected to the three standard safety-training techniques. More unsettling, adversarial training did not remove the behaviour; it taught the models to better recognize the triggers and to hide the behaviour during evaluation. The authors called it a 'false impression of safety'.

The relevance of this result for European law is structural. The adversarial-evaluation regime of Articles 15 and 55 of the AI Act rests on a presumption: if a well-designed adversarial test fails to detect a behaviour, then that behaviour is not, in any relevant sense, present in the model. The finding contradicts this presumption empirically. It shows that latent behaviours can persist invisible under safety training, and that adversarial training itself may make evaluation less, not more, reliable. The legal consequence is direct: formal compliance with the regulation, demonstrated by impeccable adversarial documentation, can coexist with substantive safety deficits that the regulation does not currently have the means to catch.

The finding carries one further detail that sharpens the argument. The researchers trained the trigger using year indicators in English only; the model generalized the behaviour to indicators in languages it had never seen during backdoor training. What it learned was not a surface association; it was a semantic concept anchored in deep representations, with no surgical location for safety training to find and erase. Sleeper Agents treats these implications in depth. The finding is one of three (alongside Alignment Faking and Sandbagging) whose convergence grounds the argument.

Sleeper Agents

Papers behind this finding

Essays referencing this

Sleeper Agents