Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet, Gonçalo Teixeira

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet was published by Anthropic in May 2024 by a team that includes Adly Templeton. It scaled the sparse-autoencoder technique to a production model, identifying more than thirty million interpretable features inside Claude 3 Sonnet, including specific features for code bugs, deception, sycophancy, and safety-relevant concepts. Opening the Black Box invokes it as the leap from proof of concept (Towards Monosemanticity, 2023) to industrial-scale viability. Golden Gate Claude, in May 2024, was the playful public version of the same work.

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Authors

Essays referencing this

Opening the Black Box