Paper
Risks from Learned Optimization in Advanced Machine Learning Systems
The paper Risks from Learned Optimization in Advanced Machine Learning Systems (arXiv:1906.01820) was published on 5 June 2019 by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Before this document, the problem of AI misalignment was articulated in loosely intuitive terms. The paper gave it rigorous vocabulary, and that is why every essay in this series depends on it.
The fundamental distinction the authors introduce is between two levels of optimization. The base optimizer is the training process itself, typically gradient descent adjusting parameters to minimize a loss function; the base objective is the loss defined by the engineers. The mesa-optimizer is a learned model that itself performs internal optimization, with its own mesa-objective, which may or may not coincide with the base one. Mesa, from the Greek for inner, in deliberate opposition to meta. Training a capable model on a complex task can produce not just a passive executor of the loss function, but an internal optimizer whose objective was selected because it correlated well with the base objective within the training distribution. Outside that distribution the correlation may break.
The analogy the authors deliberately deploy is evolutionary. Natural selection is the base optimizer; reproductive fitness is the base objective; we, organisms with nervous systems complex enough to plan, are mesa-optimizers. Our mesa-objectives are pleasure, curiosity, affection, social status. None is reproductive fitness; they are proxies that correlated with fitness in the ancestral environment and that, in the modern one, diverge dramatically: we use contraception, take vows of celibacy, die for abstract causes. Evolution, with hundreds of millions of years and massive populations, failed to produce an aligned mesa-optimizer.
The thesis has a direct legal implication. Misalignment is not a bug to fix; it is a structural property of the optimization process applied to sufficiently complex systems. For the European product-liability regime this means the defect is not deviation from specification but divergence emergent from training, and the regime today, under Directive 2024/2853 and Portuguese Decree-Law 383/89, has partial mechanisms to handle this.
Emergent Goals treats the concept in depth. The Faking Machine cites it as the theoretical precedent for Alignment Faking. Sleeper Agents invokes it for the same reason, with Hubinger as the direct link between 2019 and 2024-2025.