AI Intepretability

Introduction

This repository serves as a diary of my journey into the world of AI interpretability. The repository contains the code and notes on intepretabily of AI especially in the context of language models. I am particularly interested in mechanistic interpretability, which is the ability to explain the decisions made by AI models in terms of the underlying mechanisms but I will also explore other forms of interpretability such as post-hoc interpretability, which is the ability to explain the decisions made by AI models after the fact.

Interpretability

Interpretability is the ability to explain or to present in understandable terms to a human. In the context of AI, interpretability refers to the ability to explain the decisions made by AI models. Interpretability is important for the following reasons:

Trust: Interpretability helps in building trust in AI models. If the decisions made by AI models are interpretable, then it is easier for humans to trust the decisions made by AI models.
Debugging: Interpretability helps in debugging the AI models. If the decisions made by AI models are interpretable, then it is easier to debug the AI models.
Regulatory Compliance: Interpretability helps in regulatory compliance. Many regulations require that the decisions made by AI models are interpretable.
Bias Detection: Interpretability helps in detecting bias in AI models. If the decisions made by AI models are interpretable, then it is easier to detect bias in AI models.

Language Models

Language models are AI models that are trained to predict the next word in a sequence of words. Language models are used in a variety of applications such as machine translation, speech recognition, and text generation.

References

Inspiration

The works of Chris Olah and Neel Nanda; researchers who have made significant contributions to the field of AI interpretability.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
neural-net-numpy		neural-net-numpy
sae		sae
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Intepretability

Introduction

Interpretability

Language Models

References

About

Releases

Packages

Languages

License

ashioyajotham/interp

Folders and files

Latest commit

History

Repository files navigation

AI Intepretability

Introduction

Interpretability

Language Models

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages