-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
inaccurate zero-shot inference #5
Comments
Sorry for the delayed reply. I have two messages to share with you:
|
I think the core of the problem lies in the fact that the image features output by the model are not processed by the linear layer when monocular input is given, but the model is trained with a linear layer (see model.py for details). Therefore, when performing the zero-shot task, the image features cannot be well aligned with the text features. So in addition to the method mentioned above, you can also try just using monocular features, but it requires you to simply rewrite the code as below. This is a flaw in our design because when performing the downstream tasks of linear probing and fine-tuning, whether or not the image features pass through the linear layer doesn't make a difference to the final result, so we didn't take this into account, sorry.
|
synfundus-sel.zip
I tested the 5 images in the zip archive above. These images were selected from the SynFundus (https://github.com/parap1uie-s/SynFundus-1M) synthetic dataset, and each image comes with a caption.
The corresponding meta-information of the 5 images is:
The label file is:
However, all 5 images were best matched with "Myopia 近视". This is weird, as myopia (is_pm, "is pathological myopia") is all 0 for all the 5 images, and the captions don't mention myopia.
The text was updated successfully, but these errors were encountered: