Interpreting the latent representations within large language models (LLMs) remains a significant challenge in advancing AI transparency and control. This study introduces a novel framework for explaining latents in the Turing-LLM-1.0-254M model based on a predefined set of function types, allowing for a more human-readable “source code” of the model’s internal mechanisms. By categorising latents using multiple function types, we move towards mechanistic interpretability that does not rely on potentially unreliable explanations generated by other language models. Evaluation strategies include generating unseen sequences using Meta-Llama-3-8B-Instruct provided by GoodFire AI to test the activation patterns of latents, thereby validating the accuracy and reliability of the explanations.
Please do not hessitate to contact me on danieljamesdavies12@gmail.com
if you have any issues or queries.
To run eval.py, one only needs to download this project’s research. However, to run the entire project, all files in the links below need to be downloaded. Several .txt files are placed in this repo to guide the placement of downloaded files.
To download Turing-LLM-1.0-254M:
https://www.kaggle.com/models/danieljamesdavies/turing-llm-1.0-254m
Then place model_1722550239_03986.pt
in ./TuringLLM/
.
To download Turing-LLM-1.0-254M Sparse Autoencoders:
https://www.kaggle.com/datasets/danieljamesdavies/turing-llm-sparse-autoencoders
Then place sae/
in ./SAE/
.
To download Turing-LLM Synthetic Dataset:
https://www.kaggle.com/datasets/danieljamesdavies/turing-llm-synthetic-dataset
Then place phi-3-mini
, data-augmentation
, and dataset_paths.txt
in ./input_data/synthetic_dataset/
.
To download Turing-LLM Latent Top Sequences:
https://www.kaggle.com/datasets/danieljamesdavies/turing-llm-latent-top-sequences
Then place latents_sae_tokens_from_sequence.h5
, and latents_sae_values_from_sequence.h5
in ./input_data/latent_top_sequences/
.