Multi-Layer Sparse Autoencoders (MLSAE)

Note

This repository accompanies the preprint Residual Stream Analysis with Multi-Layer SAEs (https://arxiv.org/abs/2409.04185). See References for related work.

Pretrained MLSAEs

We define two types of model: plain PyTorch MLSAE modules, which are relatively small; and PyTorch Lightning MLSAETransformer modules, which include the underlying transformer. HuggingFace collections for both are here:

We assume that pretrained MLSAEs have repo_ids with this naming convention:

tim-lawson/mlsae-pythia-70m-deduped-x{expansion_factor}-k{k}
tim-lawson/mlsae-pythia-70m-deduped-x{expansion_factor}-k{k}-tfm

The Weights & Biases project for the paper is here.

Installation

Install Python dependencies with Poetry:

poetry env use 3.12
poetry install

Install Python dependencies with pip:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Install Node.js dependencies:

cd app
npm install

Training

Train a single MLSAE:

python train.py --help
python train.py --model_name EleutherAI/pythia-70m-deduped --expansion_factor 64 -k 32

Analysis

Test a single pretrained MLSAE:

Warning

We assume that the test split of monology/pile-uncopyrighted is already downloaded and stored in data/test.jsonl.zst.

python test.py --help
python test.py --model_name EleutherAI/pythia-70m-deduped --expansion_factor 64 -k 32

Compute the distributions of latent activations over layers for a single pretrained MLSAE (HuggingFace datasets):

python -m mlsae.analysis.dists --help
python -m mlsae.analysis.dists --repo_id tim-lawson/mlsae-pythia-70m-deduped-x64-k32-tfm --max_tokens 100_000_000

Compute the maximally activating examples for each combination of latent and layer for a single pretrained MLSAE (HuggingFace datasets):

python -m mlsae.analysis.examples --help
python -m mlsae.analysis.examples --repo_id tim-lawson/mlsae-pythia-70m-deduped-x64-k32-tfm --max_tokens 1_000_000

Interactive visualizations

Run the interactive web application for a single pretrained MLSAE:

python -m mlsae.api --help
python -m mlsae.api --repo_id tim-lawson/mlsae-pythia-70m-deduped-x64-k32-tfm

cd app
npm run dev

Navigate to http://localhost:3000, enter a prompt, and click 'Submit'.

Alternatively, navigate to http://localhost:3000/prompt/foobar.

Figures

Compute the mean cosine similarities between residual stream activation vectors at adjacent layers of a single pretrained transformer:

python figures/resid_cos_sim.py --help
python figures/resid_cos_sim.py --model_name EleutherAI/pythia-70m-deduped

Save heatmaps of the distributions of latent activations over layers for multiple pretrained MLSAEs:

python figures/dists_heatmaps.py --help
python figures/dists_heatmaps.py --expansion_factor 32 64 128 -k 16 32 64

Save a CSV of the mean standard deviations of the distributions of latent activations over layers for multiple pretrained MLSAEs:

python figures/dists_layer_std.py --help
python figures/dists_layer_std.py --expansion_factor 32 64 128 -k 16 32 64

Save heatmaps of the maximum latent activations for a given prompt and multiple pretrained MLSAEs:

python figures/prompt_heatmaps.py --help
python figures/prompt_heatmaps.py --expansion_factor 32 64 128 -k 16 32 64

Save a CSV of the Mean Max Cosine Similarity (MMCS) for multiple pretrained MLSAEs:

python figures/mmcs.py --help
python figures/mmcs.py --expansion_factor 32 64 128 -k 16 32 64

References

Code

Papers

Gao et al. [2024] https://cdn.openai.com/papers/sparse-autoencoders.pdf
Bricken et al. [2023] https://transformer-circuits.pub/2023/monosemantic-features/index.html

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.vscode		.vscode
app		app
figures		figures
mlsae		mlsae
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
citation.bib		citation.bib
figures.py		figures.py
layer_dists.py		layer_dists.py
layer_tests.py		layer_tests.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
test.py		test.py
tests.py		tests.py
train.py		train.py
upload.py		upload.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-Layer Sparse Autoencoders (MLSAE)

Pretrained MLSAEs

Installation

Training

Analysis

Interactive visualizations

Figures

References

Code

Papers

About

Releases

Packages

Languages

License

tim-lawson/mlsae

Folders and files

Latest commit

History

Repository files navigation

Multi-Layer Sparse Autoencoders (MLSAE)

Pretrained MLSAEs

Installation

Training

Analysis

Interactive visualizations

Figures

References

Code

Papers

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages