Justification of Equation 3 and 4 #53

JacobYuan7 · 2022-07-22T07:46:18Z

JacobYuan7
Jul 22, 2022

Hi, I think this is a very compelling work.
But I have a question with the Cooperative layer. What is the main idea behind Equation 3 and 4? Although softmax fucntion is used, it does not seem to be an attention operation. Is there any related work for this?

Answered by fredzzhang

Jul 22, 2022

Hi @JacobYuan7,

Thanks for taking an interest in our work. You are right, it is not the same attention operation in vanilla transformers.

Due to limited space, we didn't explain this in the paper. But generally, you can see a transformer as a graphical model. Each token is a node. Self attention naturally corresponds to fully-connected graph structure while cross attention corresponds to bipartite graphs. In the context of graphical models and message passing, the attention operation is essentially the computation of messages and the update of node states.

Specifically, Eq. 3 is the computation of messages and Eq. 4 is the computation of adjacency values (or attention weights as in transf…

View full answer

fredzzhang · 2022-07-22T08:46:59Z

fredzzhang
Jul 22, 2022
Maintainer

Hi @JacobYuan7,

Thanks for taking an interest in our work. You are right, it is not the same attention operation in vanilla transformers.

Due to limited space, we didn't explain this in the paper. But generally, you can see a transformer as a graphical model. Each token is a node. Self attention naturally corresponds to fully-connected graph structure while cross attention corresponds to bipartite graphs. In the context of graphical models and message passing, the attention operation is essentially the computation of messages and the update of node states.

Specifically, Eq. 3 is the computation of messages and Eq. 4 is the computation of adjacency values (or attention weights as in transformers). The node update is then performed by fusing the weighted incoming messages and the previous node state. In our previous work named spatially conditioned graphs, we've shown that the spatial information serving as pairwise edge features plays an integral part in the graphical models for HOI recognition. Therefore, in this work, considering that transformers are essentially graphs, we would like to incorporate the spatial information as well. However, the vanilla transformer only takes one input, the tokens. In other words, transformers, as graphical models, only have node features, but no edge features. This is where the modification is due.

To understand the motivations behind the exact mathematical details, think of it this way. For transformers, there is an alternative way to compute the attention weights. For each pair of tokens, concatenate them and put them through an MLP with output dimension as 1. In fact, Vaswani et al. mentioned that they didn't choose this method simply because matrix dot product is better optimised and much faster than running an MLP. For us, however, dot product is impossible because as I mentioned, we have an additional input. Therefore, we have to revert to the MLP method, which is what Eq. 4 describes. Essentially, we concatenate the pair of tokens as well as their pairwise positional encodings and put them through an MLP to compute the attention weights. As for Eq. 3, it is kind of a similar story. In transformers, since you only have node states, the outgoing messages (a.k.a. values in k-q-v triplet) can then be computed as a linear transformation of that. For our case, it should also include the edge features. So, we concatenate the node features of the sender nodes and the edge features, put them through an MLP to get the messages (values). With all that, you have Eq. 3.

Let me know if you have other questions.

Cheers,
Fred.

3 replies

JacobYuan7 Jul 22, 2022
Author

@fredzzhang
Many thanks for your detailed answer! I think I get your main idea.
But I have further thoughts.
(1) The choice of V. I wonder if you tried other design choices for the calculation of V in Eq.3. However, a choice that makes more sense to me is to compute V by the multiplication of \ddot{X} and Y. (\ddot{X} needs linear projection to match its dimension to Y.)
(2) MLP method for attention caculation. I still do not know the exact reason for using the MLP method. From my perspective, you could compute the pairwise dot-product of every element in \ddot{X}⊕Y. Why can we not do this instead? Because every element is of identical dimension.

fredzzhang Jul 22, 2022
Maintainer

The choice of V. I wonder if you tried other design choices for the calculation of V in Eq.3. However, a choice that makes more sense to me is to compute V by the multiplication of \ddot{X} and Y. (\ddot{X} needs linear projection to match its dimension to Y.)

We did a comparison amongst product, sum and concatenation in Table 6 of the previous work (SCG) and observed no major difference. So we sort of assumed that would be the case here since transformers are graphs as well.

MLP method for attention caculation. I still do not know the exact reason for using the MLP method. From my perspective, you could compute the pairwise dot-product of every element in \ddot{X}⊕Y. Why can we not do this instead? Because every element is of identical dimension.

Here $\ddot{X}⊕Y \in \mathbb{R}^{n \times n \times 3m}$ is already a pairwise term, not a unary term. It contains the concatenated node features for a pair of nodes $\ddot{X}$ and their edge features $Y$. The dot-product attention weights are computed from unary terms. Say you have a set of features $Z \in \mathbb{R} ^{n \times d}$. Then you can pair them up and compute $Z Z^T \in \mathbb{R}^{n \times n}$.

JacobYuan7 Jul 23, 2022
Author

Many thanks for this! I think I understand your point. @fredzzhang

JacobYuan7 · 2022-07-23T06:53:44Z

JacobYuan7
Jul 23, 2022
Author

@fredzzhang
After reading the paper more thoroughly, I raised three new questions. Would you mind giving me any clues?

i) In the paper, you wrote, "We fine-tune the DETR model on the HICO-DET and VCOCO datasets prior to training and then freeze its weights." I wonder how many epochs you use to fine-tune the detection part.

ii) I am trying to understand how Figure 5 is drawn. I assume the x-axis are reference scores from the reference model, and the y-axis is the score changes for every triplet. After performing detection, we would define those rightly-localized triplets as positive ones and others as negative ones, and record their interaction scores from the reference model. After adding layers, we would see how these scores change and thus calculate Delta. Is my understanding correct?

iii) The reason behind "The unary encoder layer preferentially increases the predicted interaction scores for positive examples, while the pairwise encoder layer suppresses the negative examples." I am trying to understand this point but fail to do so. Can you explain this briefly? What are the reasons that causes these effects?

Sorry about exhaustively asking questions because it's a good paper, and I am trying to understand it better.

6 replies

JacobYuan7 Jul 23, 2022
Author

@fredzzhang
Thanks for the explanation!

Add another point about the score suppression in the pairwise layer. Since we use Focal loss for the optimisation of verbs in a homogeneous graph, it will try to lower scores because Focal loss will punish those hard negatives for being extremely large, and we have too many of hard negatives. By doing so, Focal loss could be optimised to be lower generally.

Btw, do you know the mAP and mRec of an Faster RCNN model fine-tuned on HICO-DET?

fredzzhang Jul 26, 2022
Maintainer

Hi @JacobYuan7,

I have in fact fine-tuned the Faster R-CNN myself before. You can find a post here.

I was attempting to reproduce the fine-tuned detector from a previous paper (DRG) but failed to do so. I later found out they used a different configuration (Cascaded R-CNN with ResNeXt 152).

Fred.

JacobYuan7 Aug 11, 2022
Author

@fredzzhang
Apologies for this late reply!
According to the DRG paper, it used a Faster RCNN.... Weird thing.

Anyway, thanks for your efforts to reproduce the results! I have a question about HOI detection. From your results, it seems that the reason for DETR models' high performance stems from their superior detection performance rather than the contextualized embeddings [1]. What do you think of it?

[1] QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information, CVPR 2021.

fredzzhang Aug 11, 2022
Maintainer

According to the DRG paper, it used a Faster RCNN.... Weird thing.

I have corresponded with the authors a few times and have confirmed that the details stated in the paper were incorrect.

From your results, it seems that the reason for DETR models' high performance stems from their superior detection performance rather than the contextualized embeddings.

The reason for a model to perform well could be manyfold. And in most papers, only intuitions are provided, which in themselves are not rigorous enough to be a scientific proof. So in my humble opinion, there isn't a categorical answer to the model's high performance. It's usually a result of many factors, including both aspects you mentioned.

JacobYuan7 Aug 11, 2022
Author

@fredzzhang
Agreed! Thanks for your reply.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Justification of Equation 3 and 4 #53

{{title}}

Replies: 2 comments 9 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Justification of Equation 3 and 4 #53

JacobYuan7 Jul 22, 2022

Replies: 2 comments · 9 replies

fredzzhang Jul 22, 2022 Maintainer

JacobYuan7 Jul 22, 2022 Author

fredzzhang Jul 22, 2022 Maintainer

JacobYuan7 Jul 23, 2022 Author

JacobYuan7 Jul 23, 2022 Author

JacobYuan7 Jul 23, 2022 Author

fredzzhang Jul 26, 2022 Maintainer

JacobYuan7 Aug 11, 2022 Author

fredzzhang Aug 11, 2022 Maintainer

JacobYuan7 Aug 11, 2022 Author

JacobYuan7
Jul 22, 2022

Replies: 2 comments 9 replies

fredzzhang
Jul 22, 2022
Maintainer

JacobYuan7 Jul 22, 2022
Author

fredzzhang Jul 22, 2022
Maintainer

JacobYuan7 Jul 23, 2022
Author

JacobYuan7
Jul 23, 2022
Author

JacobYuan7 Jul 23, 2022
Author

fredzzhang Jul 26, 2022
Maintainer

JacobYuan7 Aug 11, 2022
Author

fredzzhang Aug 11, 2022
Maintainer

JacobYuan7 Aug 11, 2022
Author