Explaining Latents in Turing-LLM-1.0-254M with Pre-Defined Function Types

Interpreting the latent representations within large language models (LLMs) remains a significant challenge in advancing AI transparency and control. This study introduces a novel framework for explaining latents in the Turing-LLM-1.0-254M model based on a predefined set of function types, allowing for a more human-readable “source code” of the model’s internal mechanisms. By categorising latents using multiple function types, we move towards mechanistic interpretability that does not rely on potentially unreliable explanations generated by other language models. Evaluation strategies include generating unseen sequences using Meta-Llama-3-8B-Instruct provided by GoodFire AI to test the activation patterns of latents, thereby validating the accuracy and reliability of the explanations.

Please do not hessitate to contact me on danieljamesdavies12@gmail.com if you have any issues or queries.

How to Run

To run eval.py, one only needs to download this project’s research. However, to run the entire project, all files in the links below need to be downloaded. Several .txt files are placed in this repo to guide the placement of downloaded files.

To download Turing-LLM-1.0-254M: https://www.kaggle.com/models/danieljamesdavies/turing-llm-1.0-254m Then place model_1722550239_03986.pt in ./TuringLLM/.

To download Turing-LLM-1.0-254M Sparse Autoencoders: https://www.kaggle.com/datasets/danieljamesdavies/turing-llm-sparse-autoencoders Then place sae/ in ./SAE/.

To download Turing-LLM Synthetic Dataset: https://www.kaggle.com/datasets/danieljamesdavies/turing-llm-synthetic-dataset Then place phi-3-mini, data-augmentation, and dataset_paths.txt in ./input_data/synthetic_dataset/.

To download Turing-LLM Latent Top Sequences: https://www.kaggle.com/datasets/danieljamesdavies/turing-llm-latent-top-sequences Then place latents_sae_tokens_from_sequence.h5, and latents_sae_values_from_sequence.h5 in ./input_data/latent_top_sequences/.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
SAE		SAE
TuringLLM		TuringLLM
input_data		input_data
latent_data		latent_data
.gitignore		.gitignore
Fractions of Successful Explanations over Layers.png		Fractions of Successful Explanations over Layers.png
README.md		README.md
Success of Function Types over Layers.png		Success of Function Types over Layers.png
Success of SpecificToken() by Model Type.png		Success of SpecificToken() by Model Type.png
display_eval_results.py		display_eval_results.py
eval.py		eval.py
get_data.py		get_data.py
get_eval_inputs.py		get_eval_inputs.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Explaining Latents in Turing-LLM-1.0-254M with Pre-Defined Function Types

How to Run

About

Releases

Packages

Contributors 2

Languages

DanielJamesDavies/Explaining-Latents-with-Function-Types

Folders and files

Latest commit

History

Repository files navigation

Explaining Latents in Turing-LLM-1.0-254M with Pre-Defined Function Types

How to Run

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages