Papers
Research papers cited across the essays, indexed by canonical slug.
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
January 10, 2024The paper Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (arXiv:2401.05566) was published in…
Alignment Faking in Large Language Models
December 18, 2024The paper Alignment Faking in Large Language Models (arXiv:2412.14093) was published on 18 December 2024 by a team of twenty…
Risks from Learned Optimization in Advanced Machine Learning Systems
June 5, 2019The paper Risks from Learned Optimization in Advanced Machine Learning Systems (arXiv:1906.01820) was published on 5 June 2019 by…
Agentic Misalignment: How LLMs could be insider threats
June 1, 2025Agentic Misalignment: How LLMs could be insider threats was published by Anthropic in June 2025 as a follow-up to the May Opus 4…
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
February 24, 2025Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs (arXiv:2502.17424) was published in February 2025 by…
AI Sandbagging: Language Models can Strategically Underperform on Evaluations
June 11, 2024AI Sandbagging: Language Models can Strategically Underperform on Evaluations (arXiv:2406.07358) was submitted on 11 June 2024 by…
Concrete Problems in AI Safety
June 21, 2016Concrete Problems in AI Safety (arXiv:1606.06565) was published in June 2016 by Dario Amodei, Chris Olah, Jacob Steinhardt, Paul…
Constitutional AI: Harmlessness from AI Feedback
December 15, 2022Constitutional AI: Harmlessness from AI Feedback (arXiv:2212.08073) was published in December 2022 by a team of fifty-one…
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
October 4, 2023Towards Monosemanticity: Decomposing Language Models With Dictionary Learning was published by Anthropic in October 2023, with…
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
May 21, 2024Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet was published by Anthropic in May 2024 by a team…
On the Biology of a Large Language Model
March 27, 2025On the Biology of a Large Language Model was published by Anthropic in March 2025, alongside the sibling paper Circuit Tracing:…