Justification of Equation 3 and 4 #53
-
Hi, I think this is a very compelling work. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 9 replies
-
Hi @JacobYuan7, Thanks for taking an interest in our work. You are right, it is not the same attention operation in vanilla transformers. Due to limited space, we didn't explain this in the paper. But generally, you can see a transformer as a graphical model. Each token is a node. Self attention naturally corresponds to fully-connected graph structure while cross attention corresponds to bipartite graphs. In the context of graphical models and message passing, the attention operation is essentially the computation of messages and the update of node states. Specifically, To understand the motivations behind the exact mathematical details, think of it this way. For transformers, there is an alternative way to compute the attention weights. For each pair of tokens, concatenate them and put them through an MLP with output dimension as Let me know if you have other questions. Cheers, |
Beta Was this translation helpful? Give feedback.
-
@fredzzhang i) In the paper, you wrote, "We fine-tune the DETR model on the HICO-DET and VCOCO datasets prior to training and then freeze its weights." I wonder how many epochs you use to fine-tune the detection part. ii) I am trying to understand how Figure 5 is drawn. I assume the x-axis are reference scores from the reference model, and the y-axis is the score changes for every triplet. After performing detection, we would define those rightly-localized triplets as positive ones and others as negative ones, and record their interaction scores from the reference model. After adding layers, we would see how these scores change and thus calculate Delta. Is my understanding correct? iii) The reason behind "The unary encoder layer preferentially increases the predicted interaction scores for positive examples, while the pairwise encoder layer suppresses the negative examples." I am trying to understand this point but fail to do so. Can you explain this briefly? What are the reasons that causes these effects? Sorry about exhaustively asking questions because it's a good paper, and I am trying to understand it better. |
Beta Was this translation helpful? Give feedback.
Hi @JacobYuan7,
Thanks for taking an interest in our work. You are right, it is not the same attention operation in vanilla transformers.
Due to limited space, we didn't explain this in the paper. But generally, you can see a transformer as a graphical model. Each token is a node. Self attention naturally corresponds to fully-connected graph structure while cross attention corresponds to bipartite graphs. In the context of graphical models and message passing, the attention operation is essentially the computation of messages and the update of node states.
Specifically,
Eq. 3
is the computation of messages andEq. 4
is the computation of adjacency values (or attention weights as in transf…