Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to get intermediate image features like from Swin Transformers? #48

Closed
kashyappiyush1998 opened this issue Apr 23, 2023 · 3 comments
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@kashyappiyush1998
Copy link

When I pass 1024px size and get intermediate image features from Swin Transform I get in return image feature of sizes:

torch.Size([1, 128, 256, 256])
torch.Size([1, 128, 256, 256])
torch.Size([1, 256, 128, 128])
torch.Size([1, 512, 64, 64])
torch.Size([1, 1024, 32, 32])

How do I get something like this from dinov2?

@ccharest93
Copy link

Swin transformer uses non overlapping attention windows with local attention, which is different from this model, this was done to combat the quadratic complexity of increasing the patch numbers resulting from smaller patches.

Using flash attention, this model can directly take the 1024 pixel input which somewhat addresses that issue (patch size is stuck at 14 but allows for higher resolution images).

Now if you want to have local attention within a bigger image, nothing stops you from cropping your image in 4, 9,16... non overlapping pieces and then feeding these into the network. This would result in local attention within these pieces.

@woctezuma
Copy link

woctezuma commented Apr 23, 2023

Related:

@TimDarcet TimDarcet added the documentation Improvements or additions to documentation label Apr 24, 2023
@TimDarcet TimDarcet self-assigned this Apr 24, 2023
@TimDarcet
Copy link

Hi,
As noted by @ccharest93 , the model architecture is simply different, you won't get the same feature map shapes as in a Swin.

If you'd like different resolutions of feature maps (eg to input to a decoder such as upernet), you can downsample high-res feature maps with avg pooling (the general idea in https://arxiv.org/abs/2203.16527)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

4 participants