Skip to content

Latest commit

 

History

History
45 lines (34 loc) · 2.39 KB

File metadata and controls

45 lines (34 loc) · 2.39 KB

Quantifying Attention Flow in Transformers

Samira Abnar et al., University of Amsterdam - Arxiv 2020

Summarized by Seungwook Kim


Task:

Better diagnostic tool for visualization and debugging of transformers (vs. raw attention weights)

Motivation:

Information originating from different tokens gets increasingly mixed

  • In higher layers the attention weights are rather uniform
  • Raw attention weights are unreliable as explanation probes.
  • Attention weights do not necessarily correspond to the relative importance of input tokens.

Method:

Proposes 2 alternatives
View attention graph as DAG

1) Attention rollout
Assumption: identities of input tokens are linearly combined through the layers based on attention weights
-> Compute how much of the information is propagated through a particular path by multiplying the weights of all edges in that path
-> Sum over all possible paths between two nodes.

2) Attention flow
Motivated by graph theory, Max flow
-> Treats attention graph as flow network
---> Capacities of the edges== attention weights
-> Use max flow algorithm to compute the maximum attention flow from any node in any of the layers to any of the input layers
-> The weight of a single path is the minimum value of the weights of the edges in the path instead of product of weights

Analysis:
Attention rollout/flow has better interpretability of the results of the model
Attention rollout/flow has higher SpearmanR correlation of attention based importance with input gradients

총평: 5-6페이지짜리 페이퍼라 어디에 제출된 것 같진 않으며, 아이디어가 심플하고 visual하게 잘 보여줍니다. 하지만 짧은만큼 ablation이나 결과가 충분하지 않으며, method가 아주 세세하게 설명되어있진 않아서 완벽하게 이해하기는 어려웠습니다. 코드가 공개되어있다고 하니 transformer model를 사용할 때 요긴하게 사용할 수 있을 것 같긴 합니다. 다만 여러 simplifying assumption이 들어가있어서 정확한 interpretation이라고 보기는 어려우며, 이 부분을 약점을 내세우며 제가 앞전의 소개드린 논문이 전개되는 것으로 보입니다.

장점으로는 task/architecture agnostic하다는 점이 있습니다. high-dimension에서 transformer를 사용할때도 interpretation을 위해 사용해도 될 것으로 보입니다.