Note
This repository accompanies the preprint Residual Stream Analysis with Multi-Layer SAEs (https://arxiv.org/abs/2409.04185). See References for related work.
We define two types of model: plain PyTorch MLSAE modules, which are relatively small; and PyTorch Lightning MLSAETransformer modules, which include the underlying transformer. HuggingFace collections for both are here:
We assume that pretrained MLSAEs have repo_ids with this naming convention:
tim-lawson/mlsae-pythia-70m-deduped-x{expansion_factor}-k{k}
tim-lawson/mlsae-pythia-70m-deduped-x{expansion_factor}-k{k}-tfm
The Weights & Biases project for the paper is here.
Install Python dependencies with Poetry:
poetry env use 3.12
poetry install
Install Python dependencies with pip:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Install Node.js dependencies:
cd app
npm install
Train a single MLSAE:
python train.py --help
python train.py --model_name EleutherAI/pythia-70m-deduped --expansion_factor 64 -k 32
Test a single pretrained MLSAE:
Warning
We assume that the test split of monology/pile-uncopyrighted
is already downloaded
and stored in data/test.jsonl.zst
.
python test.py --help
python test.py --model_name EleutherAI/pythia-70m-deduped --expansion_factor 64 -k 32
Compute the distributions of latent activations over layers for a single pretrained MLSAE (HuggingFace datasets):
python -m mlsae.analysis.dists --help
python -m mlsae.analysis.dists --repo_id tim-lawson/mlsae-pythia-70m-deduped-x64-k32-tfm --max_tokens 100_000_000
Compute the maximally activating examples for each combination of latent and layer for a single pretrained MLSAE (HuggingFace datasets):
python -m mlsae.analysis.examples --help
python -m mlsae.analysis.examples --repo_id tim-lawson/mlsae-pythia-70m-deduped-x64-k32-tfm --max_tokens 1_000_000
Run the interactive web application for a single pretrained MLSAE:
python -m mlsae.api --help
python -m mlsae.api --repo_id tim-lawson/mlsae-pythia-70m-deduped-x64-k32-tfm
cd app
npm run dev
Navigate to http://localhost:3000, enter a prompt, and click 'Submit'.
Alternatively, navigate to http://localhost:3000/prompt/foobar.
Compute the mean cosine similarities between residual stream activation vectors at adjacent layers of a single pretrained transformer:
python figures/resid_cos_sim.py --help
python figures/resid_cos_sim.py --model_name EleutherAI/pythia-70m-deduped
Save heatmaps of the distributions of latent activations over layers for multiple pretrained MLSAEs:
python figures/dists_heatmaps.py --help
python figures/dists_heatmaps.py --expansion_factor 32 64 128 -k 16 32 64
Save a CSV of the mean standard deviations of the distributions of latent activations over layers for multiple pretrained MLSAEs:
python figures/dists_layer_std.py --help
python figures/dists_layer_std.py --expansion_factor 32 64 128 -k 16 32 64
Save heatmaps of the maximum latent activations for a given prompt and multiple pretrained MLSAEs:
python figures/prompt_heatmaps.py --help
python figures/prompt_heatmaps.py --expansion_factor 32 64 128 -k 16 32 64
Save a CSV of the Mean Max Cosine Similarity (MMCS) for multiple pretrained MLSAEs:
python figures/mmcs.py --help
python figures/mmcs.py --expansion_factor 32 64 128 -k 16 32 64
- https://github.com/openai/sparse_autoencoder
- https://github.com/EleutherAI/sae
- https://github.com/ai-safety-foundation/sparse_autoencoder
- https://github.com/callummcdougall/sae_vis
- Gao et al. [2024] https://cdn.openai.com/papers/sparse-autoencoders.pdf
- Bricken et al. [2023] https://transformer-circuits.pub/2023/monosemantic-features/index.html