You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Swin transformer uses non overlapping attention windows with local attention, which is different from this model, this was done to combat the quadratic complexity of increasing the patch numbers resulting from smaller patches.
Using flash attention, this model can directly take the 1024 pixel input which somewhat addresses that issue (patch size is stuck at 14 but allows for higher resolution images).
Now if you want to have local attention within a bigger image, nothing stops you from cropping your image in 4, 9,16... non overlapping pieces and then feeding these into the network. This would result in local attention within these pieces.
Hi,
As noted by @ccharest93 , the model architecture is simply different, you won't get the same feature map shapes as in a Swin.
If you'd like different resolutions of feature maps (eg to input to a decoder such as upernet), you can downsample high-res feature maps with avg pooling (the general idea in https://arxiv.org/abs/2203.16527)
When I pass 1024px size and get intermediate image features from Swin Transform I get in return image feature of sizes:
torch.Size([1, 128, 256, 256])
torch.Size([1, 128, 256, 256])
torch.Size([1, 256, 128, 128])
torch.Size([1, 512, 64, 64])
torch.Size([1, 1024, 32, 32])
How do I get something like this from dinov2?
The text was updated successfully, but these errors were encountered: