Circuit Tracer is an open-source tool developed by Anthropic to enhance AI transparency by helping researchers interpret the internal computations of large language models (LLMs). It generates attribution graphs that visually trace the steps a model takes to arrive at a specific output, offering insights into its decision-making processes. The tool supports popular open-weight models and integrates seamlessly with Neuronpedia, where users can interactively explore, annotate, and share these graphs.
Researchers can use Circuit Tracer to test hypotheses by modifying feature values and observing how outputs change, enabling deeper analysis of complex behaviors such as multi-step reasoning and multilingual representations. The tool has already been applied to study models like Gemma-2-2b and Llama-3.2-1b, showcasing its potential in uncovering intricate circuits within AI systems. A demo notebook provides examples and additional graphs for further exploration.
Developed as part of Anthropic’s Fellows program in collaboration with Decode Research, Circuit Tracer aims to bridge the gap between AI capabilities and interpretability. By open-sourcing the library and its Neuronpedia interface, Anthropic invites the research community to contribute to understanding and improving AI systems, fostering collaboration and innovation in model transparency.