Person

Evan Hubinger

Anthropic

Evan Hubinger leads the Alignment Stress-Testing team at Anthropic and is the most wide-ranging technical figure in this line of research. In 2019, with Vladimir Mikulik, Joar Skalse, Chris van Merwijk, and Scott Garrabrant, he published Risks from Learned Optimization in Advanced Machine Learning Systems (arXiv:1906.01820), the paper that first articulated, in rigorous terms, the concept of mesa-optimization and anticipated the inner-alignment problem as a structural property of systems trained by gradient descent. In 2024 he technically coordinated Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (arXiv:2401.05566), with a team of thirty-nine researchers, showing that deceptive behaviours can persist through safety training and that adversarial training can teach the model to hide rather than remove them. In December 2024 he co-signed Alignment Faking in Large Language Models (arXiv:2412.14093), with authors from Anthropic, Redwood Research, and Oxford, which empirically documented the phenomenon predicted five years earlier.

Hubinger's relevance to this blog's thesis does not rest in any single paper; it rests in the trajectory. The three papers he is centrally listed on trace, in sequence, the theoretical prediction of the mechanism (2019), the empirical demonstration of its persistence (January 2024), and its spontaneous observation in a production model (December 2024). It is a rare sequence: the same researcher writes the theoretical prediction, sees it confirmed empirically, and signs both papers. For a legal reader, this means the epistemological presumption of the adversarial-evaluation regime in Articles 15 and 55 of the AI Act is now, empirically, running into an obstacle that was already laid out conceptually five years ago.

He is discussed by name in The Faking Machine, Emergent Goals, and Sleeper Agents. His contributions also surface across Constitution Without a State through Anthropic's Constitutional AI programme.

Papers authored

Essays referencing this