Task:
Better diagnostic tool for visualization and debugging of transformers (vs. raw attention weights)
Motivation:
Information originating from different tokens gets increasingly mixed
- In higher layers the attention weights are rather uniform
- Raw attention weights are unreliable as explanation probes.
- Attention weights do not necessarily correspond to the relative importance of input tokens.
Method:
Proposes 2 alternatives
View attention graph as DAG
1) Attention rollout
Assumption: identities of input tokens are linearly combined through the layers based on attention weights
-> Compute how much of the information is propagated through a particular path by multiplying the weights of all edges in that path
-> Sum over all possible paths between two nodes.
2) Attention flow
Motivated by graph theory, Max flow
-> Treats attention graph as flow network
---> Capacities of the edges== attention weights
-> Use max flow algorithm to compute the maximum attention flow from any node in any of the layers to any of the input layers
-> The weight of a single path is the minimum value of the weights of the edges in the path instead of product of weights
Analysis:
Attention rollout/flow has better interpretability of the results of the model
Attention rollout/flow has higher SpearmanR correlation of attention based importance with input gradients
총평: 5-6페이지짜리 페이퍼라 어디에 제출된 것 같진 않으며, 아이디어가 심플하고 visual하게 잘 보여줍니다. 하지만 짧은만큼 ablation이나 결과가 충분하지 않으며, method가 아주 세세하게 설명되어있진 않아서 완벽하게 이해하기는 어려웠습니다. 코드가 공개되어있다고 하니 transformer model를 사용할 때 요긴하게 사용할 수 있을 것 같긴 합니다. 다만 여러 simplifying assumption이 들어가있어서 정확한 interpretation이라고 보기는 어려우며, 이 부분을 약점을 내세우며 제가 앞전의 소개드린 논문이 전개되는 것으로 보입니다.
장점으로는 task/architecture agnostic하다는 점이 있습니다. high-dimension에서 transformer를 사용할때도 interpretation을 위해 사용해도 될 것으로 보입니다.