Constitutional AI: Harmlessness from AI Feedback, Gonçalo Teixeira

Constitutional AI: Harmlessness from AI Feedback (arXiv:2212.08073) was published in December 2022 by a team of fifty-one Anthropic authors, including Dario Amodei and Jared Kaplan. It introduced the RLAIF method (Reinforcement Learning from AI Feedback), in which part of the human evaluator work of classical RLHF is replaced by a model evaluating responses against a list of natural-language principles. It is the technical foundation for the Claude Constitution published in 2026, and Constitution Without a State invokes it as the founding moment of the method that produced the central normative object of the essay.

Constitutional AI: Harmlessness from AI Feedback

Authors

Essays referencing this

Constitution Without a State