Gonçalo Teixeira

Papers

Research papers cited across the essays, indexed by canonical slug.

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
January 10, 2024
The paper Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (arXiv:2401.05566) was published in…
Alignment Faking in Large Language Models
December 18, 2024
The paper Alignment Faking in Large Language Models (arXiv:2412.14093) was published on 18 December 2024 by a team of twenty…
Risks from Learned Optimization in Advanced Machine Learning Systems
June 5, 2019
The paper Risks from Learned Optimization in Advanced Machine Learning Systems (arXiv:1906.01820) was published on 5 June 2019 by…