Paper
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning was published by Anthropic in October 2023, with Trenton Bricken and Adly Templeton leading the team. It demonstrated that sparse autoencoders, a signal-processing technique repurposed for mechanistic interpretability, can decompose the overlapping activations of a single-layer transformer into monosemantic features. Opening the Black Box invokes it as the first public demonstration that the superposition problem has a viable technical solution, opening the door to the subsequent work on production models.