-
Notifications
You must be signed in to change notification settings - Fork 850
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is this the right way to do inference? #2
Comments
Not sure if its correct, but hope it helps import torch dinov2_vits14 = hubconf.dinov2_vits14() img = Image.open('meta_dog.png') transform = T.Compose([ img = transform(img)[:3].unsqueeze(0) with torch.no_grad(): print(features.shape) pca = PCA(n_components=3) pca_features = pca.transform(features) plt.imshow(pca_features.reshape(16, 16, 3).astype(np.uint8)) In dinov2/models/vision_transformer.py line 290 add def forward(self, *args, is_training=False, return_patches=False, **kwargs): visualized features: |
@Suhail To generate features from the pretrained backbones, just use a transform similar to the standard one used for evaluating on image classification with the typical ImageNet normalization mean and std (see what's used in the code). Also, as noted in the model card, the model can also use image sizes that are multiple of the patch size. |
Thanks! This is what I used: image_transforms = T.Compose([
T.Resize(256, interpolation=T.InterpolationMode.BICUBIC),
T.CenterCrop(224),
T.ToTensor(),
T.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
]) Let me know if that's wrong though. |
I found this helpful, but I would say instead of needing to modify the forward function, you can just do |
What you are doing is correct. What you get with the |
I think what I want is an embedding like CLIP that contains the features/understanding of the image. Is that what I'd get from forward_features? |
If this is like DINO, any of the two features could be used as an image embedding. Edit: You can see here how it is done in Line 122 in fc49f49
See: Lines 260 to 264 in fc49f49
dinov2/dinov2/eval/log_regression.py Lines 277 to 279 in fc49f49
Lines 114 to 122 in fc49f49
|
Please note that Lines 42 to 44 in fc49f49
See: Lines 503 to 507 in fc49f49
Lines 39 to 45 in fc49f49
It was also the case with DINO: You could also do fancier stuff, e.g. "concatenate [CLS] token and GeM pooled patch tokens", as with DINO's copy detection. |
How about this??
the output is a tuple of intermediate feature maps. Then you can select which features you want from the tuple, and then you can try K-means etc etc |
Yes, You could also use GeM pooled patch tokens with this output, as in |
Sounds like this is all I need to do to get a features embedding: |
Closing as this seems resolved (and using #53 to keep track of documentation needs on feature extraction). |
hello, How to train nearest neighbors model on extracted embeddings of images from different classes of folders using dinov2 model and retrieve nearest similar image for query image ?
with above dinov2 based trained model i get around 70% accuracy on testing data for retrieving similar class images, is there a way to improve my approach in better manner to improvise the accuracy ?? |
First, for k-NN classification, have a look at Second, after a quick look at your code, I would suggest to try a different metric, e.g. Third, I believe you should use a different image pre-processing (cf. For further question, I would suggest to create a separate Github issue for this purpose. |
hey thanks, i will look into it. |
It is mentioned above: #2 (comment) dinov2/dinov2/data/transforms.py Lines 86 to 90 in c3c2683
It is similar to what you did but some values may differ, e.g.:
dinov2/dinov2/data/transforms.py Lines 80 to 84 in c3c2683
dinov2/dinov2/data/transforms.py Lines 43 to 44 in c3c2683
|
how to visualize feature like this ?? ,
with this i'm getting error
feature shape is 1024, how would i fix this ? |
|
Hi, it seems that I can get feature embedding of [1, 256, 384] for an image, then I reshape it to [1, 16, 16, 384], I can get the visualized features. But, how can I get a feature map with a larger resolution because I wonna get finer info such as texture. |
Hi @XiaominLi1997, Use Larger models.
So, you can use Vitg14 & Also Increase Input Image size in Multiple of 14. Ex: 518pix( i.e 14patchsize * 37pixels). |
Why do you need to center to (0.485,0.456,0.406)? Is anywhere mentioning this? |
@ydove0324 this is standard imagenet mean used for training. It's a common practice. |
so I'm trying to see the intermediate layers of dinov2 given an image and I want to analyze their outputs. I want an embedding, and I'm trying to see if these intermediate outputs can be used instead. so I ran this code:
error:RuntimeError Traceback (most recent call last) 8 frames RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same but my tensor shape is: torch.Size([1, 3, 224, 224]) and the get_intermediate_layers output is a tuple how do I interpret this. Also should I use the hub model or hugging face model for embeddings |
@woctezuma I'm trying to understand the embeddings generated by dinov2 torch and dinov2 in huggingface. huggingface outputs (b,257,768) where the cls embedding and patch embeddings are concatenated into a single 768 dimensional embedding? |
I presume I don't need Normalize?
The text was updated successfully, but these errors were encountered: