Gonçalo Teixeira

Opening the Black Box

·en·18 min read

In May 2024, Anthropic put online, for about 24 hours, a special version of its model called Golden Gate Claude. The model was, in every other respect, identical to Claude 3 Sonnet, which was then in production. The difference lay in a single internal adjustment: the researchers had identified, inside the neural network, a specific set of neurons that responded to the concept "Golden Gate Bridge". And they had artificially amplified the activity of that set. The result was absurd and revealing at the same time. They would ask the model what it should pack for a trip, and it would respond with advice on how to cross the bridge. They asked for a chocolate cake recipe, and it suggested shaping it like the metal structure. When asked who it was, it responded: "I am the Golden Gate Bridge." The model was, literally, obsessed.

The joke lasted 24 hours and amused the internet. For those reading carefully, however, there was something far more important than a prank. Anthropic had demonstrated, publicly and with a production model, that it was possible to do two things that until recently were considered impossible. First, to identify, within the tangle of billions of parameters of a neural network, interpretable units corresponding to human concepts (in this case, "Golden Gate Bridge"). Second, to causally intervene on those units, increasing or decreasing their activation, and to verify the behavioural consequence. To see inside. And to reach in. In other words, to do with a neural network what a neuroscientist still cannot do with a human brain.

This essay argues that research in mechanistic interpretability (of which Golden Gate Claude was only one visible instance) has direct implications for two areas of European law that were built on the assumption that it was possible to require companies to explain automated decisions: the transparency and oversight regime of Regulation (EU) 2024/1689 (AI Act), notably Articles 13, 14, and 86, and the right to meaningful information about the logic involved in automated decisions under Article 15(1)(h) of the General Data Protection Regulation (GDPR). The thesis is twofold and runs in two opposite directions. On the one hand, the current state of interpretability makes clear that European law demands something that engineering does not yet know how to provide. On the other hand, recent advances suggest that within five years, perhaps sooner, engineering will offer tools that allow the right to explanation already enshrined by the legislator to be taken seriously.

I. What mechanistic interpretability is

The expression "mechanistic interpretability" originated in the work of Chris Olah, first at Google, then at OpenAI, and since 2021 at Anthropic, where he is a co-founder. The central idea is simple to state and difficult to realize. If we accept that modern generative models are, in Olah's own words, grown rather than built, then we need a methodology analogous to that of the natural sciences to study them: dissection, observation, intervention, explanatory model. It is not enough to know that a model works; we need to understand why it works, step by step, at the level of its internal mechanisms.

The central technical problem, identified early in the research, is called superposition. Modern language models contain millions or billions of parameters, but neural networks do not use one-concept-per-neuron representations. A single neuron can activate in response to apparently unrelated concepts. Models, in practice, represent more concepts than they have neurons, packing them into overlapping combinations. For a researcher looking at individual neuron activations, the result is incomprehensible: an incoherent mixture of multiple concepts per unit.

The technical solution came with sparse autoencoders (SAEs), a signal-processing technique repurposed for this end. The idea, simplified, is as follows: if concepts are packed into combinations of neurons, one trains a secondary network to discover those combinations and separate them. In less technical language, SAEs work like a decoder: they receive the mixed activations and return the individual concepts that compose them. In October 2023, in the paper Towards Monosemanticity: Decomposing Language Models With Dictionary Learning, a team from Anthropic led by Trenton Bricken and Adly Templeton demonstrated that the method works on a small single-layer transformer. In May 2024, in Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet, the same team scaled the technique to a production model. They found more than 30 million features, interpretable conceptual units, inside Claude 3 Sonnet. Among these features, the most interesting for legal purposes: specific features for "code bugs and errors", for deception, for sycophancy, for relevant safety concepts, and those multilingual features that represent the same concept across multiple languages.

In March 2025, Anthropic published two papers that advanced the research a step further: Circuit Tracing: Revealing Computational Graphs in Language Models and On the Biology of a Large Language Model. The second applies the methodology of attribution graphs to Claude 3.5 Haiku and documents specific findings about how the model thinks. When the model is asked "what is the capital of the state containing Dallas?", the researchers can see, inside the model, the following reasoning: the input "Dallas" activates a feature corresponding to "Texas"; the combination of "Texas" with "capital" activates a feature corresponding to "Austin". Multi-hop reasoning, explicit, visible, and manipulable: if the researchers suppress the "Texas" feature, the model loses the chain of reasoning. Another finding: when the model writes poetry, it does not simply generate the next word; it identifies in advance several possible end-rhyme words and constructs the sentence to reach one of them. There is internal planning, not surface-level probabilistic simulation.

The authors themselves acknowledge the limits of the method with an honesty that deserves note. Attribution graphs only produce satisfying insight on about one quarter of the prompts tested. The technique works, but is far from the equivalent of an MRI for AI, an expression Dario Amodei uses to describe the final goal of the agenda. Mechanistic interpretability is today, in April 2026, a technology in the phase of scientific demonstration, not operational deployment. That is precisely the diagnosis that leads the CEO of Anthropic to write, in April 2025, an essay entitled The Urgency of Interpretability.

II. Amodei's argument

Amodei's essay, published on 24 April 2025 on his personal site, is a rare piece: the CEO of an AI company publishing a technical and normative essay, in an essay register, publicly arguing that his industry is in a race against itself. The thesis is direct. The capability of models is advancing faster than our ability to understand them. Anthropic expects to have systems equivalent to "a country of geniuses in a datacenter" in 2026 or 2027. Mature interpretability, in the author's words, may be 5 to 10 years away, with the intermediate goal, declared as the essay's stated objective, of reliably detecting the majority of a model's problems by 2027. There is therefore a critical interval in which capability will arrive before understanding. Amodei takes this point as "basically unacceptable": he considers it "basically unacceptable for humanity to be totally ignorant of how [the models] work."

Two points from the essay merit specific attention for the legal argument.

First, Amodei explicitly acknowledges the limits of empirical evidence about deceptive behaviour and power-seeking in current models. He writes: "we've never seen any solid evidence in truly real-world scenarios of deception and power-seeking because we can't catch the models red-handed thinking power-hungry, deceitful thoughts." The existing evidence, including the alignment faking and sleeper agents results that I treated in previous essays, is in semi-artificial scenarios. I note this to avoid overstating the argument. But, and here lies the strong point, the reason we have no robust evidence in natural scenarios is not that the phenomenon does not exist; it is because we lack the tools to detect it when it does. Interpretability, if it advances, will provide that capacity.

Second, Amodei articulates a specific function for interpretability that has deep implications for law. I quote: "interpretability should function like the test set for model alignment, while traditional alignment techniques such as scalable supervision, RLHF, constitutional AI, etc. should function as the training set." Put differently, model training is the process we are trying to optimize; interpretability is the independent verification of the outcome. What is asked of interpretability is that it give an external diagnosis, not contaminated by the training process, of the model's actual state. Amodei should be read with appropriate distance: he is the CEO of a company whose value proposition depends in part on interpretability becoming an auditable tool, and he has incentive to dramatize both the urgency and the feasibility of the agenda. The temporal targets he sets out are public commitments with low immediate cost and potential reputational benefit. Treating the essay as evidence about the actual state of the technique would conflate corporate forecast with independent diagnosis. This function of independent verification is precisely what European law, in its more sophisticated forms, is trying to require.

III. What European law already requires

Three European legal regimes, with overlaps not entirely resolved, concur in establishing a right to explanation for automated decisions. We will treat them in their chronological order of adoption.

The oldest and most established is Article 22 of the General Data Protection Regulation (Regulation (EU) 2016/679), in force since 25 May 2018. Paragraph 1 establishes that "the data subject shall have the right not to be subject to a decision based solely on automated processing, including profiling, which produces legal effects concerning him or her or similarly significantly affects him or her." The right is, technically, a right of exclusion, but it combines with Article 15(1)(h), which establishes the data subject's right to obtain "meaningful information about the logic involved." The doctrinal debate about whether this constitutes a substantive right to explanation is longstanding, and for years the CJEU had not interpreted Article 22.

On 7 December 2023, that silence ended with the judgment SCHUFA Holding (Scoring), Case C-634/21. The Court, in the First Chamber, established that the automated generation, by a credit agency, of a probability value regarding a person's ability to meet future financial obligations constitutes an automated decision within the meaning of Article 22, even if the subsequent commercial decision (whether or not to grant credit) is formally taken by a human. The Chamber adopted a purposive reading: what matters is whether the automated value in fact conditions the final decision. If it does, there is an automated decision.

On 27 February 2025, the CJEU advanced a further step with the judgment CK v Magistrat der Stadt Wien, Case C-203/22 (informally known as the Dun & Bradstreet judgment). The concrete case was, again, credit scoring: an Austrian citizen (CK) had been refused a mobile phone contract based on an automated credit assessment by Dun & Bradstreet, which invoked trade secrecy to avoid explaining the logic of the score. The Court established three principles that directly affect this essay's argument. First, the meaningful information required by Article 15(1)(h) is not satisfied by mere communication of the algorithm, nor by a "detailed description of all stages of the automated decision-making process"; it requires accessible and intelligible explanation for the data subject, allowing them to exercise their rights of contestation effectively. Second, where the controller invokes trade secrets, the information must be communicated to the competent court or supervisory authority that will balance the interests case by case. Third, and this is the crucial point, Member States may not introduce rules that deny access on the sole ground of trade secrecy.

The second chronological regime is the Digital Services Act (Regulation (EU) 2022/2065, DSA), but this is marginal to our discussion and I pass over it.

The third is the AI Act (Regulation (EU) 2024/1689, in force since 1 August 2024, with phased application). Three articles are relevant here. Article 13, on transparency and provision of information to deployers, requires high-risk systems to be designed and developed in such a way as to ensure sufficient transparency to allow deployers (the professional operators who use the system) to interpret the outputs and use them appropriately. Article 14, on human oversight, requires that systems allow effective oversight by natural persons during the period they are in use. Article 86, on the right to explanation for individual decisions, establishes, for persons affected by decisions taken on the basis of a high-risk AI system listed in Annex III that produce legal effects or significantly affect health, safety, or fundamental rights, the right to obtain from the deployer "clear and meaningful explanations of the role of the AI system in the decision-making procedure and the main elements of the decision taken."

Note the architecture. The GDPR covers automated decisions in general, with CJEU case law developing the content of the required explanation. The AI Act adds a specific layer for high-risk systems, with explanation owed to the affected person by the deployer, and additional obligations of transparency and oversight imposed on the provider and deployer. The regimes do not replace each other; they overlap. A deployer who uses a high-risk system to take automated decisions bears cumulative obligations under both regimes.

IV. Where the bridge becomes visible

Now the point of the essay. All these rights to explanation presuppose that it is technically possible to produce substantive explanation. Article 86 of the AI Act requires "clear and meaningful" explanation of the "role of the AI system." The CJEU's case law in Dun & Bradstreet requires "accessible and intelligible" explanation of the "procedure and principles." What happens when the AI system is a contemporary generative model, whose decision emerges from interactions between millions of features in a parameter space with billions of dimensions?

There is a latent contradiction here, which European law has not yet confronted directly because the existing case law (SCHUFA, Dun & Bradstreet) still concerned credit scoring with relatively interpretable algorithms (logistic regressions, decision trees, gradient boosting). The transition from models of this type to generative language models will make the contradiction acute. The articulation assumes, it should be noted, that the integrated system counts as a high-risk AI system for the purposes of Annex III: defensible in many use cases, but not automatic, and a contested point in the AI Act's architecture. A deployer that uses a version of Claude, GPT, or Gemini to support decisions in administrative proceedings, recruitment, credit granting, or clinical support may be legally required to provide "clear and meaningful" explanation of the "role of the AI system" in a specific decision. The honest answer, with the technology available in 2026, is that the deployer does not know, because the provider does not know, because nobody knows in detail why the model produced a particular response on a specific occasion.

This is where mechanistic interpretability ceases to be a technical curiosity and becomes a substantive precondition for legal compliance. If interpretability fulfils what Amodei promises, in particular the goal of reliably detecting the majority of models' problems by 2027, European law will have a reference object for what "meaningful explanation" means. The attribution graphs showing "Dallas → Texas → Austin" may be, once mature, the technical basis for legally required explanation reports. Until then, the right to explanation under Article 86 of the AI Act and Article 15(1)(h) GDPR is formally enforceable and substantively constrained by the engineering available.

Three concrete implications for legal scholarship and national courts in the coming years.

First implication. The concept of "meaningful explanation" will need to be defined by reference to the state of the art of interpretability, not to the state of the models themselves. A provider that uses available interpretability techniques and documents the results is in a substantively stronger position to demonstrate compliance than a provider that, being able to use them, does not because it prefers operational opacity. Legal diligence does not operate in binary terms but progressively absorbs technical progress, through the fleshing out of the standard of due diligence. This reading is consistent with Dun & Bradstreet, which rejects the generic invocation of trade secrets as a shield against the obligation to explain.

Second implication. The attribution of responsibility between provider and deployer may need to be reformulated in light of the evolution of interpretability tools. Currently, Article 86 of the AI Act places the obligation of explanation on the deployer, which makes sense when the decision is taken by the deployer based on the system's output. But if the interpretability of models is technically possible only at the level of the provider (who has access to the model weights and internal research infrastructure), the deployer will be effectively unable to comply without active collaboration from the provider. Mandatory information-sharing between providers and deployers may need to be addressed through additional regulation, or through creative case law interpreting the provider's duty under Article 13 substantively rather than merely formally.

Third implication. The new Directive 2024/2853 on liability for defective products, which I treated in essay two, will have its Article 10(4) on the presumption of defect by technical complexity progressively refined by the evolution of interpretability. If, at the time of proceedings, interpretability is technically impossible, the technical complexity is real and the presumption operates fully. If, at a given point, interpretability offers reliable tools and the provider chose not to use them, or not to disclose their results, technical complexity ceases to be justification and the provider loses evidential standing. What constitutes "technical or scientific complexity" today, in the sense of the Directive, is different from what it will be in 2030, and legal scholarship will have to keep pace.

V. A note on intellectual honesty

I have been publishing this series with the thesis that European law needs to update its concepts in light of frontier technical research. This essay takes the thesis a step back, to preserve intellectual honesty: European law is also, in some respects, ahead of engineering. Article 86 of the AI Act enshrines a right to explanation whose substantive content cannot yet be fully realized. The CJEU's Dun & Bradstreet requires intelligible explanation that, for generative language models, we cannot yet give with rigour.

This is not a defect. Law tends to require what society considers necessary, even when technique has yet to catch up. That is what happened with Directive 85/374/EEC on liability for defective products, particularly in its applications to pharmaceutical products: the regime required causal proof in situations where the science of the time could barely produce it. The combination of scientific progress, creative CJEU case law (the Sanofi judgment, Case C-621/15, 21 June 2017) and subsequent legislative adaptations, including the 2024 revision, eventually adapted the regime. Mechanistic interpretability may be in that position today: a technical object still taking shape, one at which the law has already taken aim, in anticipation. If the race Amodei describes goes well, European law will have an effective technical auditing tool to operationalize rights that today it only formally enshrines. If it goes badly, we will have rights without substantive content and courts developing second-order criteria (procedural diligence instead of substantive diligence) to deal with the gap. The most likely scenario, as with so many frontier technologies, is an intermediate one: interpretability mature enough for some types of model and some types of problem, insufficient for others, with legal scholarship and case law having to distinguish case by case. It is that intermediate scenario that European legal scholarship is right to start preparing for.

The next essay in this series closes the conceptual arc with a third piece of the puzzle. Essay three showed that the European compliance regime has epistemic limits. This essay has shown that mechanistic interpretability is the ex post technical response the laboratories are developing. Essay five addresses the ex ante response: Constitutional AI and the Anthropic Constitution, an experiment in private normative governance in which a frontier laboratory literally writes a constitution for its product. The legal question is deliberately disquieting. What place in European law does a private normative document have when it regulates the behaviour of an AI system at a level of detail no public regulator can reach? Is it soft law? Is it a duty of diligence converted into corporate practice? Is it the embryo of a new regulatory model?


Primary sources:

  • The Urgency of Interpretability, Dario Amodei, 24 April 2025, at darioamodei.com.
  • Towards Monosemanticity: Decomposing Language Models With Dictionary Learning, Bricken, Templeton, Batson et al., Anthropic, October 2023, at transformer-circuits.pub.
  • Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet, Templeton, Conerly, Marcus et al., Anthropic, May 2024, at transformer-circuits.pub.
  • Golden Gate Claude, Anthropic, May 2024, at anthropic.com/news/golden-gate-claude.
  • On the Biology of a Large Language Model, Anthropic, March 2025, at transformer-circuits.pub/2025/attribution-graphs.
  • Regulation (EU) 2016/679 (General Data Protection Regulation).
  • Regulation (EU) 2024/1689 (AI Act).
  • Judgment of the CJEU, SCHUFA Holding (Scoring), 7 December 2023, Case C-634/21.
  • Judgment of the CJEU, CK v Magistrat der Stadt Wien, 27 February 2025, Case C-203/22 (Dun & Bradstreet judgment).