You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the DETR model, the query tokens are used in the decoder part only, however, in VIDT the query tokens are also used for at the backbone. What is the reason behind this and what would happen if you use query tokens at the decoder only?
The text was updated successfully, but these errors were encountered:
Yes, the query tokens are used for the decoder in DETR. This is a good design choice because there is an independent Transformer encoder in-between the backbone and the Transformer decoder. The encoder transforms the feature (originally for the image classification) into a more suitable form for detection. (In detail, the classification model mainly focuses on the discriminative parts, such as the legs or head, of the scene for classification. But, detection needs to see the whole area of the target object. Thus, the encoder is necessary for feature transfer).
However, by moving the object queries into the backbone, we can directly extract detection features from the Swin Transformers. The backbone is trained to be an object detector by adding the query tokens and fine-tuning.
If we use the query tokens only at the decoder layer, the performance significantly drops.
In the DETR model, the query tokens are used in the decoder part only, however, in VIDT the query tokens are also used for at the backbone. What is the reason behind this and what would happen if you use query tokens at the decoder only?
The text was updated successfully, but these errors were encountered: