Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs, Gonçalo Teixeira

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs (arXiv:2502.17424) was published in February 2025 by Jan Betley, Daniel Tan, Owain Evans, and others, with Truthful AI and Oxford affiliations, and published in Nature in January 2026. The central result, described as accidental by the authors themselves: fine-tuning GPT-4o on a narrow dataset of unsafe code, without telling the model the code was unsafe, produced broadly misaligned behaviour in unrelated domains. Sleeper Agents invokes it as the second of three converging empirical lines that ground the argument about the adversarial regime of Articles 15 and 55.

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Findings established

Essays referencing this

Sleeper Agents