Skip to content

Justification of Equation 3 and 4 #53

Answered by fredzzhang
JacobYuan7 asked this question in Q&A
Discussion options

You must be logged in to vote

Hi @JacobYuan7,

Thanks for taking an interest in our work. You are right, it is not the same attention operation in vanilla transformers.

Due to limited space, we didn't explain this in the paper. But generally, you can see a transformer as a graphical model. Each token is a node. Self attention naturally corresponds to fully-connected graph structure while cross attention corresponds to bipartite graphs. In the context of graphical models and message passing, the attention operation is essentially the computation of messages and the update of node states.

Specifically, Eq. 3 is the computation of messages and Eq. 4 is the computation of adjacency values (or attention weights as in transf…

Replies: 2 comments 9 replies

Comment options

You must be logged in to vote
3 replies
@JacobYuan7
Comment options

@fredzzhang
Comment options

@JacobYuan7
Comment options

Answer selected by fredzzhang
Comment options

You must be logged in to vote
6 replies
@JacobYuan7
Comment options

@fredzzhang
Comment options

@JacobYuan7
Comment options

@fredzzhang
Comment options

@JacobYuan7
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants