Skip to content

DanielJamesDavies/Explaining-Latents-with-Function-Types

Repository files navigation

Explaining Latents in Turing-LLM-1.0-254M with Pre-Defined Function Types

Interpreting the latent representations within large language models (LLMs) remains a significant challenge in advancing AI transparency and control. This study introduces a novel framework for explaining latents in the Turing-LLM-1.0-254M model based on a predefined set of function types, allowing for a more human-readable “source code” of the model’s internal mechanisms. By categorising latents using multiple function types, we move towards mechanistic interpretability that does not rely on potentially unreliable explanations generated by other language models. Evaluation strategies include generating unseen sequences using Meta-Llama-3-8B-Instruct provided by GoodFire AI to test the activation patterns of latents, thereby validating the accuracy and reliability of the explanations.

Please do not hessitate to contact me on danieljamesdavies12@gmail.com if you have any issues or queries.

How to Run

To run eval.py, one only needs to download this project’s research. However, to run the entire project, all files in the links below need to be downloaded. Several .txt files are placed in this repo to guide the placement of downloaded files.

To download Turing-LLM-1.0-254M: https://www.kaggle.com/models/danieljamesdavies/turing-llm-1.0-254m Then place model_1722550239_03986.pt in ./TuringLLM/.

To download Turing-LLM-1.0-254M Sparse Autoencoders: https://www.kaggle.com/datasets/danieljamesdavies/turing-llm-sparse-autoencoders Then place sae/ in ./SAE/.

To download Turing-LLM Synthetic Dataset: https://www.kaggle.com/datasets/danieljamesdavies/turing-llm-synthetic-dataset Then place phi-3-mini, data-augmentation, and dataset_paths.txt in ./input_data/synthetic_dataset/.

To download Turing-LLM Latent Top Sequences: https://www.kaggle.com/datasets/danieljamesdavies/turing-llm-latent-top-sequences Then place latents_sae_tokens_from_sequence.h5, and latents_sae_values_from_sequence.h5 in ./input_data/latent_top_sequences/.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages