Why the query tokens at the backbone ? #4

IISCAditayTripathi · 2022-03-10T15:15:52Z

In the DETR model, the query tokens are used in the decoder part only, however, in VIDT the query tokens are also used for at the backbone. What is the reason behind this and what would happen if you use query tokens at the decoder only?

songhwanjun · 2022-04-06T07:08:57Z

Yes, the query tokens are used for the decoder in DETR. This is a good design choice because there is an independent Transformer encoder in-between the backbone and the Transformer decoder. The encoder transforms the feature (originally for the image classification) into a more suitable form for detection. (In detail, the classification model mainly focuses on the discriminative parts, such as the legs or head, of the scene for classification. But, detection needs to see the whole area of the target object. Thus, the encoder is necessary for feature transfer).

However, by moving the object queries into the backbone, we can directly extract detection features from the Swin Transformers. The backbone is trained to be an object detector by adding the query tokens and fine-tuning.

If we use the query tokens only at the decoder layer, the performance significantly drops.

Best,

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why the query tokens at the backbone ? #4

Why the query tokens at the backbone ? #4

IISCAditayTripathi commented Mar 10, 2022

songhwanjun commented Apr 6, 2022 •

edited

Loading

Why the query tokens at the backbone ? #4

Why the query tokens at the backbone ? #4

Comments

IISCAditayTripathi commented Mar 10, 2022

songhwanjun commented Apr 6, 2022 • edited Loading

songhwanjun commented Apr 6, 2022 •

edited

Loading