Interpreting Culture

This project was built in part as an application for MATS program

This project builds upon Francois Fleuret culture (with his permission).

Report

Setup

# install my interp library & culture
pip install -r requirements.txt
pip install -e .

# patch TransformerLens
# culture MyGPT does not include a final layer norm, need to add support for this
git clone https://github.com/TransformerLensOrg/TransformerLens.git
cd TransformerLens
git apply ../transfomer_lens_final_ln.patch
pip install -e .

Motivation / why

I've been following this project from Francois Fleuret for a while now.

https://fleuret.org/public/culture/draft-paper.pdf

The hypothesis is that intelligence emerges from "social competition" of different agents. The experiment trains 5 GPTs on programmatically generated 2D "world" quizzes, then once the models have sufficiently learned the task (accuracy > 95%) they attempt to generate their own "culture" quizzes. These quizzes are kept if 4/5 of the models agree on the answer, and one gets it wrong (to make it correct, but sufficiently difficult).

The idea is that the models will start producing progressively more difficult quizzes as a result, and (ideally) new and unique concepts through social interaction.

How this relates to interpretability

I thought 5 GPTs trained in this group setting would be an interesting project from an interpretability perspective:

Are there universal features shared across these models? like in Universal Neurons in GPT2 Language Model
We can compare and contrast features learned using an SAE
- This is similar version to the "train with different seeds" and compare features
Since the quizzes are synthetic, it should be easier to "interpret" the behaviour of these models, since I can easily partition the data into the different tasks, and dynamically generate new unseen tasks.
Fundamentally these models are the same as a normal GPT, so I can use the same interpretability tools to understand the models.
The feature visualizations on these grid tasks (rather than text sentences) would look pretty cool

Problem outline

Quite ambitious since the models were not compatible with TransformerLens library, and everything was implemented in a unique way. So there was a bit of a standard software engineering challenge to integrate it with existing interpretability tools.

I always try to do something like this with my own projects anyway -- implementing from scratch helps me learn much better than just AutoModel.from_pretrained.

Reproducing Results

There's a few "gotachs" in running the GPTs:

Added sinusoidal support for TransformerLens
use_past_kv_cache is buggy, I think this is from hacking in sinusoidal positional encoding?
There's no final layer norm in MyGPT, so I had to patch TransformerLens to support this too
You must prepend the input with a 0 as a BOS token (the models generate the entire sequence when creating new quizzes, but not for eval)

Find model weights here

Repository

`culture/`

This is a fork of the original culture project, by Francois Fleuret.

Model weights are stored here: tommyp111/culture-gpt

`culture.py`

Contains most of the lib functionality, including:

generate function (using model.generate)
load_culture: Loading the MyGPT models
load_hooked: Converting MyGPT weights into a HookedTransformer
load_quizzes: Loading the QuizMachine (culture quizzes)
run_tests: Running tests on the models

Run python -m interp.culture 0 1 2 3 --num_test_samples 100 to test most of the library functionality, and run accuracy tests on the models. Setting --num_test_samples to 2000 is standard for eval, and should achieve ~95% accuracy. I've found that using a smaller number of samples can give a lower accuracy (should be at least 80s).

`grid_tokenizer.py`

Create a HF tokenizer for the models. required for train_sae.py

Can find it on HF here: tommyp111/culture-tokenizer

repr_grid for pretty printing.
sinusoidal_positional_encoding impl
TOK_PREPROCESS & prep_quiz for preprocessing quizzes

`dataset.py`

Create a 1M element HF dataset. Again used for train_sae.py

`train_sae.py`

Train a sparse autoencoder on the models using sae_lens.SAETrainingRunner.

Load pretrained SAE from here: tommyp111/culture-sae

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
assets		assets
culture		culture
interp		interp
.gitignore		.gitignore
01_induction_heads.ipynb		01_induction_heads.ipynb
02_sae.ipynb		02_sae.ipynb
README.md		README.md
demo_feature_dashboards.html		demo_feature_dashboards.html
environment.yml		environment.yml
report.md		report.md
setup.py		setup.py
transformer_lens_final_ln.patch		transformer_lens_final_ln.patch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Interpreting Culture

Report

Setup

Motivation / why

How this relates to interpretability

Problem outline

Reproducing Results

Repository

`culture/`

`culture.py`

`grid_tokenizer.py`

`dataset.py`

`train_sae.py`

About

Releases

Packages

Languages

tom-pollak/interpretability-culture

Folders and files

Latest commit

History

Repository files navigation

Interpreting Culture

Report

Setup

Motivation / why

How this relates to interpretability

Problem outline

Reproducing Results

Repository

culture/

culture.py

grid_tokenizer.py

dataset.py

train_sae.py

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`culture/`

`culture.py`

`grid_tokenizer.py`

`dataset.py`

`train_sae.py`

Packages