Member-only story
Anthropic drops an amazing report on LLM interpretability
Circuit Tracing: Revealing Computational Graphs in Language Models:
On the Biology of a Large Language Model:
Deep learning models have had a persistent “black box” problem with results lacking sufficient explainability. This has implications for everything from trust to evaluating robustness to having guidance for their improvement. Work on this problem is carried out in the field of mechanistic interpretability. A few days ago, an Anthropic team dropped a truly fabulous pair of papers on their latest work tackling the problem.
The Anthropic team’s goal, extending previous research in deep learning and neuroscience, was to develop a means of tracing the circuits underlying specific types of reasoning. The problem is that a network’s neurons are polysemantic; they carry several meanings across them, as there are more concepts to hold than available neurons. To get around this, the team built a replacement model using cross-layer transcoders to represent circuits more sparsely, making them more interpretable.