Skip to content

Sparse autoencoder analysis for neural network interpretability

Notifications You must be signed in to change notification settings

armaan-abraham/sparse-autoencoder

Repository files navigation

Sparse Autoencoder for Interpreting LLMs

This repository contains code for training a sparse autoencoder on activations of LLMs, as in Anthropic's Towards Monosemanticity, as well as an analysis of how feature directions depend on both cooccurrence and LLM output similarity.

See analysis.ipynb.

About

Sparse autoencoder analysis for neural network interpretability

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published