This repository serves as a diary of my journey into the world of AI interpretability. The repository contains the code and notes on intepretabily of AI especially in the context of language models. I am particularly interested in mechanistic interpretability, which is the ability to explain the decisions made by AI models in terms of the underlying mechanisms but I will also explore other forms of interpretability such as post-hoc interpretability, which is the ability to explain the decisions made by AI models after the fact.
Interpretability is the ability to explain or to present in understandable terms to a human. In the context of AI, interpretability refers to the ability to explain the decisions made by AI models. Interpretability is important for the following reasons:
- Trust: Interpretability helps in building trust in AI models. If the decisions made by AI models are interpretable, then it is easier for humans to trust the decisions made by AI models.
- Debugging: Interpretability helps in debugging the AI models. If the decisions made by AI models are interpretable, then it is easier to debug the AI models.
- Regulatory Compliance: Interpretability helps in regulatory compliance. Many regulations require that the decisions made by AI models are interpretable.
- Bias Detection: Interpretability helps in detecting bias in AI models. If the decisions made by AI models are interpretable, then it is easier to detect bias in AI models.
Language models are AI models that are trained to predict the next word in a sequence of words. Language models are used in a variety of applications such as machine translation, speech recognition, and text generation.
- Interpretable Machine Learning
- Interpretable AI
- Interpretable Machine Learning: A Guide for Making Black Box Models Explainable
Inspiration
- The works of Chris Olah and Neel Nanda; researchers who have made significant contributions to the field of AI interpretability.