Dario Amodei, Gonçalo Teixeira

Dario Amodei is co-founder and CEO of Anthropic, which he co-founded in 2021 after leaving OpenAI. His earlier trajectory places him at the beginning of the technical literature this blog discusses: in June 2016 he was the first author of Concrete Problems in AI Safety (arXiv:1606.06565), with Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané, the paper that identified reward hacking as one of five central concrete problems in AI safety and that set out the vocabulary still in use nine years later. In December 2022, by then at Anthropic, he was one of fifty-one co-authors of Constitutional AI: Harmlessness from AI Feedback (arXiv:2212.08073), the paper that introduced the RLAIF method underlying the Claude Constitution.

Amodei's relevance to this blog's thesis is not only authorial pedigree; it is also the position he occupies today. In April 2025 he published, on his personal site, the essay The Urgency of Interpretability, arguing that model capability is advancing faster than our ability to understand the models, and that Anthropic's stated goal is for interpretability to reliably detect the majority of model problems by 2027. It is a rare document: the CEO of a frontier laboratory publishing, in essay form and in the public domain, the argument that his industry is in a race against itself. For European law, this kind of declaration by the management of a regulated company is relevant evidence of knowledge of the problem in any future liability regime.

He is discussed by name in The Faking Machine, Emergent Goals, Opening the Black Box, and Constitution Without a State, the entire series except Sleeper Agents.

Dario Amodei

Papers authored

Essays referencing this

Constitution Without a State

Opening the Black Box

Emergent Goals

The Faking Machine