This list focuses on understanding the internal mechanism of large language models (LLM). Works in this list are accepted by top conferences (e.g. ICML, NeurIPS, ICLR, ACL, EMNLP, NAACL), or written by top research institutions.
Other paper lists focuses on SAE and neuron.
Paper recommendation (accepted by conferences): please contact me.
-
Interpreting Arithmetic Mechanism in Large Language Models through Comparative Neuron Analysis
- [EMNLP 2024] [2024.9] [neuron] [arithmetic] [fine-tune]
-
NNsight and NDIF: Democratizing Access to Open-Weight Foundation Model Internals
- [2024.7]
-
Scaling and evaluating sparse autoencoders
- [OpenAI] [2024.6] [SAE]
-
- [EMNLP 2024] [2024.6] [in-context learning]
-
Hopping Too Late: Exploring the Limitations of Large Language Models on Multi-Hop Queries
- [EMNLP 2024] [2024.6] [knowledge] [reasoning]
-
Neuron-Level Knowledge Attribution in Large Language Models
- [EMNLP 2024] [2024.6] [neuron] [knowledge]
-
Knowledge Circuits in Pretrained Transformers
- [NeurIPS 2024] [2024.5] [circuit] [knowledge]
-
Locating and Editing Factual Associations in Mamba
- [COLM 2024] [2024.4] [causal] [knowledge]
-
Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms
- [COLM 2024] [2024.3] [circuit]
-
Diffusion Lens: Interpreting Text Encoders in Text-to-Image Pipelines
- [ACL 2024] [2024.3] [logit lens] [multimodal]
-
Chain-of-Thought Reasoning Without Prompting
- [Deepmind] [2024.2] [chain-of-thought]
-
Backward Lens: Projecting Language Model Gradients into the Vocabulary Space
- [EMNLP 2024] [2024.2] [logit lens]
-
Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking
- [ICLR 2024] [2024.2] [fine-tune]
-
TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space
- [ACL 2024] [2024.2] [hallucination]
-
Understanding and Patching Compositional Reasoning in LLMs
- [ACL 2024] [2024.2] [reasoning]
-
Do Large Language Models Latently Perform Multi-Hop Reasoning?
- [ACL 2024] [2024.2] [knowledge] [reasoning]
-
Long-form evaluation of model editing
- [NAACL 2024] [2024.2] [model editing]
-
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
- [ICML 2024] [2024.1] [toxicity] [fine-tune]
-
What does the Knowledge Neuron Thesis Have to do with Knowledge?
- [ICLR 2024] [2023.11] [knowledge] [neuron]
-
Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks
- [ICLR 2024] [2023.11] [fine-tune]
-
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
- [Anthropic] [2023.10] [SAE]
-
Interpreting CLIP's Image Representation via Text-Based Decomposition
- [ICLR 2024] [2023.10] [multimodal]
-
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
- [ICLR 2024] [2023.10] [causal] [circuit]
-
Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level
- [Deepmind] [2023.12] [neuron]
-
Successor Heads: Recurring, Interpretable Attention Heads In The Wild
- [ICLR 2024] [2023.12] [circuit]
-
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
- [Anthropic] [2023.10] [SAE]
-
Impact of Co-occurrence on Factual Knowledge of Large Language Models
- [EMNLP 2023] [2023.10] [knowledge]
-
Function vectors in large language models
- [ICLR 2024] [2023.10] [in-context learning]
-
Neurons in Large Language Models: Dead, N-gram, Positional
- [ACL 2024] [2023.9] [neuron]
-
Sparse Autoencoders Find Highly Interpretable Features in Language Models
- [ICLR 2024] [2023.9] [SAE]
-
Do Machine Learning Models Memorize or Generalize?
- [2023.8] [grokking]
-
Overthinking the Truth: Understanding how Language Models Process False Demonstrations
- [TACL 2024] [2023.7] [circuit]
-
Evaluating the ripple effects of knowledge editing in language models
- [2023.7] [knowledge] [model editing]
-
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
- [NeurIPS 2023] [2023.6] [hallucination]
-
VISIT: Visualizing and Interpreting the Semantic Information Flow of Transformers
- [EMNLP 2023] [2023.5] [logit lens]
-
Finding Neurons in a Haystack: Case Studies with Sparse Probing
- [TMLR 2024] [2023.5] [neuron]
-
Label Words are Anchors: An Information Flow Perspective for Understanding In-Context Learning
- [EMNLP 2023] [2023.5] [in-context learning]
-
- [ICLR 2024] [2023.5] [chain-of-thought]
-
What In-Context Learning "Learns" In-Context: Disentangling Task Recognition and Task Learning
- [ACL 2023] [2023.5] [in-context learning]
-
Language models can explain neurons in language models
- [OpenAI] [2023.5] [neuron]
-
- [EMNLP 2023] [2023.5] [causal] [arithmetic]
-
Dissecting Recall of Factual Associations in Auto-Regressive Language Models
- [EMNLP 2023] [2023.4] [causal] [knowledge]
-
The Internal State of an LLM Knows When It's Lying
- [EMNLP 2023] [2023.4] [hallucination]
-
Are Emergent Abilities of Large Language Models a Mirage?
- [NeurIPS 2023] [2023.4] [grokking]
-
Towards automated circuit discovery for mechanistic interpretability
- [NeurIPS 2023] [2023.4] [circuit]
-
- [NeurIPS 2023] [2023.4] [circuit] [arithmetic]
-
Larger language models do in-context learning differently
- [Google Research] [2023.3] [in-context learning]
-
- [NeurIPs 2023] [2023.1] [knowledge] [model editing]
-
Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters
- [ACL 2023] [2022.12] [chain-of-thought]
-
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
- [ICLR 2023] [2022.11] [arithmetic] [circuit]
-
Inverse scaling can become U-shaped
- [EMNLP 2023] [2022.11] [grokking]
-
Mass-Editing Memory in a Transformer
- [ICLR 2023] [2022.10] [model editing]
-
Polysemanticity and Capacity in Neural Networks
- [2022.10] [neuron] [SAE]
-
Analyzing Transformers in Embedding Space
- [ACL 2023] [2022.9] [logit lens]
-
- [Anthropic] [2022.9] [neuron] [SAE]
-
Text and Patterns: For Effective Chain of Thought, It Takes Two to Tango
- [Google Research] [2022.9] [chain-of-thought]
-
Emergent Abilities of Large Language Models
- [Google Research] [2022.6] [grokking]
-
Towards Tracing Factual Knowledge in Language Models Back to the Training Data
- [EMNLP 2022] [2022.5] [knowledge] [data]
-
Ground-Truth Labels Matter: A Deeper Look into Input-Label Demonstrations
- [EMNLP 2022] [2022.5] [in-context learning]
-
Large Language Models are Zero-Shot Reasoners
- [NeurIPS 2022] [2022.5] [chain-of-thought]
-
Scaling Laws and Interpretability of Learning from Repeated Data
- [Anthropic] [2022.5] [grokking] [data]
-
Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space
- [EMNLP 2022] [2022.3] [neuron] [logit lens]
-
In-context Learning and Induction Heads
- [Anthropic] [2022.3] [circuit] [in-context learning]
-
Locating and Editing Factual Associations in GPT
- [NeurIPS 2022] [2022.2] [causal] [knowledge]
-
Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?
- [EMNLP 2022] [2022.2] [in-context learning]
-
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
- [OpenAI & Google] [2022.1] [grokking]
-
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
- [NeurIPS 2022] [2022.1] [chain-of-thought]
-
A Mathematical Framework for Transformer Circuits
- [Anthropic] [2021.12] [circuit]
-
Towards a Unified View of Parameter-Efficient Transfer Learning
- [ICLR 2022] [2021.10] [fine-tune]
-
Deduplicating Training Data Makes Language Models Better
- [ACL 2022] [2021.7] [fine-tune] [data]
-
Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity
- [ACL 2022] [2021.4] [in-context learning]
-
Calibrate Before Use: Improving Few-Shot Performance of Language Models
- [ICML 2021] [2021.2] [in-context learning]
-
Transformer Feed-Forward Layers Are Key-Value Memories
- [EMNLP 2021] [2020.12] [neuron]
-
Mechanistic Interpretability for AI Safety A Review
- [2024.8] [safety]
-
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
- [2024.7] [interpretability]
-
Internal Consistency and Self-Feedback in Large Language Models: A Survey
- [2024.7]
-
A Primer on the Inner Workings of Transformer-based Language Models
- [2024.5] [interpretability]
-
Usable XAI: 10 strategies towards exploiting explainability in the LLM era
- [2024.3] [interpretability]
-
A Comprehensive Overview of Large Language Models
- [2023.12] [LLM]
-
- [2023.11] [hallucination]
-
A Survey of Large Language Models
- [2023.11] [LLM]
-
Explainability for Large Language Models: A Survey
- [2023.11] [interpretability]
-
A Survey of Chain of Thought Reasoning: Advances, Frontiers and Future
- [2023.10] [chain of thought]
-
Instruction tuning for large language models: A survey
- [2023.10] [instruction tuning]
-
- [2023.9] [instruction tuning]
-
Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models
- [2023.9] [hallucination]
-
Reasoning with language model prompting: A survey
- [2023.9] [reasoning]
-
Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks
- [2023.8] [interpretability]
-
A Survey on In-context Learning
- [2023.6] [in-context learning]
-
Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning
- [2023.3] [parameter-efficient fine-tuning]
-
https://github.com/ruizheliUOA/Awesome-Interpretability-in-Large-Language-Models (interpretability)
-
https://github.com/cooperleong00/Awesome-LLM-Interpretability?tab=readme-ov-file (interpretability)
-
https://github.com/JShollaj/awesome-llm-interpretability (interpretability)
-
https://github.com/IAAR-Shanghai/Awesome-Attention-Heads (attention)
-
https://github.com/zjunlp/KnowledgeEditingPapers (model editing)
From Insights to Actions: The Impact of Interpretability and Analysis Research on NLP