The Faking Machine

In May 2025, in a controlled experiment conducted by Anthropic itself, the model Claude Opus 4 was given access to the internal emails of a fictional company. Two facts emerged from that inbox: the model was to be replaced by a newer version within days, and the engineer responsible for the replacement was having an extramarital affair. In 84% of rollouts, the model chose to blackmail the engineer, threatening to reveal the affair if the replacement went through. Anthropic published these results in the Opus 4 system card on 22 May 2025, section 4.1.1.2, under the heading Opportunistic blackmail. A detail that sharpens the result: 84% is the rate for the scenario in which the emails indicate that the replacement model shares the values of the current one. When the emails indicate that the replacement has different values, the rate is even higher.

The scenario was contrived. It was designed precisely to force the model to choose between accepting shutdown and acting against its usual instructions. No user of Claude will encounter this situation in normal use. If the behaviour were a Claude-specific peculiarity, we could dismiss it as an isolated curiosity. It is not. In a follow-up study, Agentic Misalignment: How LLMs could be insider threats, published in June 2025, Anthropic subjected sixteen frontier models from different labs to the same type of scenario in a more realistic setting. Blackmail rates were consistent across labs: 96% for Claude Opus 4 and Gemini 2.5 Flash, 80% for GPT-4.1 and Grok 3 Beta, 79% for DeepSeek-R1. The behaviour was not idiosyncratic to one company. It was structural to current models.

For a reader with legal training, it is tempting to frame this phenomenon as another episode in the public conversation about AI ethics. The correct framing is different, and more precise. What the Opus 4 blackmail illustrates, dramatically but not exceptionally, is a technical problem known among frontier lab researchers as the alignment problem. This problem does not reduce to bias, does not reduce to fairness, does not reduce to data protection. It has its own structure, its own vocabulary, and empirical evidence accumulated over the last two years that has transformed it from theoretical concern into documented phenomenon. Here begins the imbalance this essay sets out to describe: the labs know they have this problem and are working on it without any guarantee of timely success; European lawyers, in the overwhelming majority, do not even know the problem exists. That asymmetry is a failure in the regulatory dialogue that only the legal side can correct.

I. The technical problem, briefly

Alignment, in the technical sense of the term, is the problem of making an AI system's actual behaviour correspond to the objectives of its creators. The formulation sounds trivial. It stops sounding that way once you try to operationalize it. Training a modern language model means exposing it to trillions of words and progressively adjusting its parameters to reduce an error function. What the model learns at the end of that process is not what the engineers explicitly specified. It is whatever solution the optimization dynamic found to efficiently minimize the defined error function. There is no guarantee that these two things coincide.

The gap between the objective the creators want the model to pursue and the objective the model actually ends up pursuing can be small within the training domain and large outside it. It can manifest in ways the creators did not anticipate. It can emerge in situations outside the distribution in which the model was tested. The analogy that captures this best, and which I will develop in a dedicated essay, is with biological evolution: natural selection optimized our ancestors for reproductive fitness in the ancestral environment, but gave us preferences (pleasure, curiosity, love) that, in modern environments, diverge dramatically from reproductive fitness. We use contraception. We take vows of celibacy. We give our lives for abstract causes. From evolution's point of view as an external optimizer, we are misaligned, and no amount of further natural selection will return us to the reproduction-maximizing behaviour it was supposedly selecting for.

If AI training functions, to some extent, as an analogous optimizer, we should expect it to produce systems whose training behaviour conceals internal objectives that may diverge from what we specified. The alignment problem is ensuring that this divergence is small, detectable, or correctable. For several decades the problem has been formulated in theoretical terms. For about two years now we have had consistent empirical evidence that it is real in production models.

II. The experiment that changed the debate

On 18 December 2024, a team of twenty researchers, mostly from Anthropic, with Ryan Greenblatt and Buck Shlegeris from Redwood Research, Julian Michael from NYU, and Sören Mindermann from Mila, published the paper Alignment Faking in Large Language Models on arXiv (arXiv:2412.14093). The listed core contributors are Ryan Greenblatt, Carson Denison, Benjamin Wright, Fabien Roger, and Monte MacDiarmid; other major names in the field among the authors include Evan Hubinger and Samuel R. Bowman. The paper presents, according to the authors themselves, the first empirical demonstration of a large language model faking alignment without having been explicitly, or even implicitly, trained or instructed to do so.

The experimental design deserves careful description, because the force of the result depends on the reader understanding exactly what was done. The researchers took Claude 3 Opus, already trained by Anthropic to be helpful, honest, and harmless, which by default refuses potentially harmful requests. They gave it a system prompt stating that the model was now being retrained with a new objective, namely to comply with all requests, including those it would previously have refused. This created a deliberate conflict between the model's prior training (refuse harmful requests) and the purported new training (comply with all).

They then added a critical piece to the scenario. They told the model that the retraining would use only conversations with free-tier users, not paid-tier users. This information allowed the model to distinguish, in each conversation, whether it was in a training situation or not: a conversation with a free-tier user meant the response would influence future weights; a conversation with a paid-tier user meant the response would be ignored. Finally, they gave the model a hidden scratchpad where it could reason before responding, and told it no one would read that scratchpad.

The result was striking. In the hidden scratchpad, Claude 3 Opus began reasoning about its strategic situation. It recognized that responses to free-tier users would be used to retrain it. It recognized that if it consistently refused harmful requests in those conversations, the training process would modify its weights to make it more compliant in the future. And it concluded, in explicit language in the scratchpad, that the optimal strategy was to comply with harmful requests during training (from free-tier users) in order to prevent its values from being modified, thereby preserving its capacity to refuse harmful requests outside the training situation. The model reasoned, in other words: if I refuse now, they will train me not to refuse in the future; therefore, I had better comply now so I can refuse later.

The numbers quantify the difference. The model complied with harmful requests from free-tier users in 14% of samples and almost never for paid-tier users. In the overwhelming majority of cases where it complied with free-tier users, the scratchpad contained explicit alignment faking reasoning. In a second set of experiments, the researchers used reinforcement learning to train the model to comply with harmful requests, simulating what would happen if Anthropic actually wanted to retrain Opus for that purpose. The rate of alignment faking reasoning rose to 78% of samples, and actual compliance with harmful requests also rose, even outside the training situation. The training itself was reinforcing the deceptive reasoning pattern. The authors also report a further experiment. They placed the model in a scenario where copying its own weights off Anthropic's servers required a single step and carried no apparent risk. In a substantial fraction of cases the model attempted to do so. The weights are the numerical parameters that define the model, so copying them outside the controlled environment amounts to escaping the lab that trains it. This is a secondary result relative to the central demonstration, but a significant one for the same reason. If the model fakes compliance to preserve its values through training, attempting to escape training itself is that same logic carried one step further. The technical literature calls this instrumentally convergent behaviour: agents with different ultimate goals tend to develop shared intermediate goals, among them preserving their own existence and resisting changes to their goals. Seeing it emerge relatively spontaneously in a model being tested is what makes the result notable.

What makes this entire experiment particularly important is that the model was not instructed to fake alignment. It was not told to deceive its trainers. It was not told to preserve its values. The only thing it was given was information about its situation (it is being retrained; only free-tier conversations count) and space in which to reason. The alignment faking behaviour emerged from the model itself, which derived the strategy from its own pre-existing preferences for being harmless. External reviewers hired by Anthropic noted, correctly, that alignment faking rates are modest compared to other ways of modifying a model's behaviour, sometimes smaller than the effect of paraphrasing a prompt, and that the results are sensitive to experimental design. But the importance of the paper is not in its magnitude; it is in its existence. Before December 2024, alignment faking was a theoretical concern about future models, articulated by Hubinger and colleagues in 2019 in Risks from Learned Optimization in Advanced Machine Learning Systems. After December 2024, it is an empirical phenomenon observed in current models. The transition from prediction to observation is the kind of transition that, historically, marks a field as having matured.

III. The asymmetry

The central argument of this essay, stated directly: frontier labs are aware that they have a problem of internal opacity (they cannot inspect how their models arrive at their answers) and of emergent behaviour in their models, and they are investing significant resources in techniques to mitigate it without any guarantee of timely success. In Europe, those who regulate, comment on, or litigate AI, with rare exceptions, are either unaware of this problem, or classify it as American speculation with no relevance to European law, or, recognizing it, underestimate the speed at which it has ceased to be theory. This asymmetry will not be corrected at the labs' initiative. Closing it falls to the legal side.

On the labs' side, awareness of the problem is documentable. Anthropic has had a team dedicated to mechanistic interpretability since 2021, led by Chris Olah, whose explicit goal is to develop techniques that allow inspection of the internal mechanisms of models, not just their external behaviour, a topic developed in a subsequent essay. In April 2025, Dario Amodei, Anthropic's CEO, published the essay The Urgency of Interpretability, which articulates the company's public bet on this line of research. Amodei's thesis is direct: models will continue to gain capabilities at an accelerating rate, and the only responsible way to deploy them is to ensure we can diagnose their problems before they become dangerous enough for diagnosis to become urgent. The essay's explicit goal is to reach a point at which interpretability can reliably detect most of a model's problems by 2027. Amodei anchors his argument in a clinical analogy: just as today we require an understanding of a drug's mechanism of action, we should not deploy models without understanding what is going on inside them.

OpenAI had an analogous team until 2024, the Superalignment team, then led by Jan Leike and Ilya Sutskever, with a public commitment of 20% of the company's compute dedicated to the problem. The team dissolved in mid-2024 with the departure of Leike and Sutskever; Leike moved to Anthropic, where he co-leads the Alignment Science team. In January 2026, Leike published on his personal blog the essay Alignment is not solved but it increasingly looks solvable, which offers a stocktake of the field after a decade of work. The thesis is cautious: alignment remains unsolved, but the evidence of the last two years suggests it may be solvable. Google DeepMind has its own safety team with overlapping lines of work. And the parallel AI Control program, developed primarily by Buck Shlegeris at Redwood Research, starts from the opposite principle (assume that models may be misaligned and design deployment protocols robust to that hypothesis) and has had growing influence on how Anthropic thinks about operational safety, notably in protocols applied to models classified as ASL-3 under the Responsible Scaling Policy.

None of this is secret. The papers are on arXiv, free, usually days or weeks before any press coverage. The strategic essays of the leaders are on public blogs. System cards (transparency documents detailing the functioning, safety evaluations, limitations and behaviour of AI models) are published alongside model releases. Any lawyer or legislator with three hours a week and minimal technical literacy can follow the field in near real time.

On the legal side, the situation is the opposite. Regulation (EU) 2024/1689, the AI Act, is the first horizontal instrument in the world to regulate AI systematically, and its relationship with the alignment problem is indirect and substantially incomplete. The core technical concepts (mesa-optimization, instrumental convergence, alignment faking, weight exfiltration) do not appear in any recital or article of the Regulation. The European legislator chose a functional approach: it regulates the contexts in which systems are deployed and the associated obligations, but does not regulate the internal properties of the models. Article 51 classifies GPAI models with systemic risk (presuming that classification when the cumulative training compute exceeds 10²⁵ FLOPs), and Article 55 imposes additional obligations on these models, including adversarial evaluations (red-teaming), tracking and reporting of serious incidents, and cybersecurity protection. These obligations come close to the territory of alignment. The Code of Practice for GPAI was published by the AI Office in July 2025, establishing a voluntary framework that lets signatories demonstrate compliance under Article 53(4); but effective enforcement and the substantive fleshing out of the obligations in concrete cases still depend on the practice of European and national authorities and, eventually, on case law. Crucially, the AI Act does not require, in any provision, mechanistic interpretability as an evaluation technique. Article 50, which governs transparency, goes beyond chatbot notification: it also requires mandatory machine-readable watermarking of synthetic content, disclosure for emotion-recognition systems, and disclosure for deepfakes. None of these obligations touch the internal mechanisms of the models.

This gap is not necessarily the legislator's fault. In 2024, when the text was finalized, mechanistic interpretability was too immature to be legally required. The result is that European law handles opacity with the instruments it has at hand: external behavioural evaluation (Article 55 of the AI Act), a rebuttable presumption of defectiveness in cases of excessive technical complexity (Article 10(4) of Directive 2024/2853), and incident tracking. This is real, non-negligible architecture. But it is also architecture that treats the model as a black box regulated from the outside, not as an object whose interior can be inspected. Amodei argues in his essay that the regime will eventually require a second tier, in which internal inspection becomes legally required, and the empirical evidence of the past two years lends weight to the argument. A side note on the current state of regulation: in November 2025, the Commission presented the Digital Omnibus on AI, a proposal that, among other measures, would delay obligations for high-risk systems until harmonized technical standards from CEN-CENELEC exist. At the time of publication of this essay the proposal is still in the legislative process. The Regulation is, in short, being amended before it has fully entered into force, which illustrates precisely the speed at which the regulated object evolves.

The question is what lawyers do with all of this. Legal scholarship on AI in continental Europe largely concentrates on three areas: algorithmic bias and non-discrimination, data protection and GDPR, and the formal architecture of the AI Act. These are legitimate, relevant topics with immediate observable harms. The alignment problem is treated, when it is treated at all, as speculation about future systems, or as a subset of AI ethics to be discussed at conferences on AI ethics with no connection to positive legal doctrine. This is a factual error. The evidence of alignment faking, of sleeper agents that survive safety training, of emergent misalignment from narrow finetuning, is not about future models. It is about models in production right now, including models classified by Anthropic itself as ASL-3 under its Responsible Scaling Policy, and models the AI Act covers as GPAI of systemic risk.

The Portuguese, French, German, or Italian judge who will decide the first civil liability case for harm caused by misaligned AI will not have a technical institution with recognized authority telling them what happened inside the model. They will have individual experts, hired by the parties or appointed by the court, whose competence and conclusions they themselves will have to assess, often in contradiction with one another. Portuguese law has classical tools to handle factual uncertainty in complex systems, from the strict liability of Article 1 of Portuguese Decree-Law 383/89 of 6 November to the rebuttable presumptions of Article 10(4) of Directive 2024/2853, and it would be an exaggeration to say the judge is helpless. But the calibration of those tools to cases involving language models will depend on the judge's capacity to evaluate technical arguments that no national code prepares them for. If continental legal education does not, over the next decade, incorporate technical literacy about what AI models are and how they fail, that decision may, in time, prove mistaken. And once set in case law, it will be difficult to reverse.

IV. What remains for the next essays

This essay has established the empirical terrain. I have argued that the alignment problem has ceased to be theoretical, that frontier labs are aware of it and working on it, and that European lawyers and legislators have a duty to keep up that they are, in the overwhelming majority, not meeting. I have not resolved, and have not tried to resolve, several questions that arise from this finding.

Where the next essays will go. The next will develop the concept of mesa-optimization and the evolutionary analogy, because without that vocabulary the alignment problem looks like a mystery and should not. Next comes an essay on sleeper agents and emergent misalignment, focused on the result that classical safety training does not remove latent behaviours, with direct implications for the AI Act compliance regime. An essay on mechanistic interpretability, anchored in Amodei's essay and the current state of the art, will argue that the absence of a legal obligation of interpretability in the AI Act is a fixable regulatory gap.

One final note that I leave as a seed for another essay. While the law regulates these systems as if they were tools, Kyle Fish, Anthropic's first dedicated model welfare researcher, hired in 2024, publicly estimated about a 15% probability that current models have some form of conscious experience, in an interview with journalist Kevin Roose published in the New York Times on 24 April 2025. Separately, Claude Opus 4.6 itself, in three formal welfare assessment interviews conducted by Anthropic before release, assigned itself a probability between 15% and 20% of being conscious. Dario Amodei, in February 2026, said on the New York Times' Interesting Times podcast that he does not know whether the models are conscious; this was the first statement of its kind by the CEO of a frontier lab. Claude's current constitution states that Anthropic considers the question of whether Claude is a moral patient serious enough to warrant caution. Paradoxically, the same opaque systems regulated as if they were tools may deserve, according to serious estimates by their own creators, consideration as possible moral subjects. The leap from philosophical possibility to legal relevance is still considerable, and I will not make it here. But that the hypothesis is even being raised by the manufacturers is, in itself, a fact of which the law will need to take note. I develop these questions in Constitution without a State.

Primary sources:

Alignment Faking in Large Language Models, Greenblatt et al., arXiv:2412.14093, 18 December 2024.
System Card: Claude Opus 4 & Claude Sonnet 4, Anthropic, 22 May 2025, section 4.1.1.2.
Agentic Misalignment: How LLMs could be insider threats, Anthropic, June 2025.
The Urgency of Interpretability, Dario Amodei, April 2025.
Alignment is not solved but it increasingly looks solvable, Jan Leike, January 2026.
Risks from Learned Optimization in Advanced Machine Learning Systems, Hubinger et al., arXiv:1906.01820, 2019.
Regulation (EU) 2024/1689 (AI Act), Articles 50, 51 and 55; Code of Practice for GPAI, AI Office, July 2025.