-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training CLIP-ViT #58
Comments
Did you write the training code yourself? |
yes, it's not hard. |
Are you comparing ViT-based and ResNet50-based models trained with the same dataset? Vision transformers tend to underperform ResNet-based models unless they're trained on a huge dataset, so I'd suspect that could have been the reason, rather than the initialization scheme. |
Thanks for your reply. We have trained on 100M text-image pairs. It turned out that ViTB32 model outperformed ResNet50 model on some of benchmarks. |
@Meteorix what dataset did you use? |
Hi, Meteorix: |
@dragen1860 see #83 and also other third-party implementations by @KeremTurgutlu and @lucidrains. |
Hi, I am also running ViT-B/32 and with the open-clip code on CC3M. I also notice a discrepancy between ResNet and ViT. You can see my curves at mlfoundations/open_clip#14. I am interested to know if you solve this performance gap @jongwook @Meteorix |
Hi, Meteorix: |
@jongwook Thanks for this great work!
I am trying to train CLIP VIT B/32 from scratch, but cannot get a higher score on imagenet versus CLIP resnet-50. May I ask what initialization you use in training VIT?
In the paper:
We closely follow their implementation with only the minor modification of adding an additional layer normalization to the combined patch and position embeddings before the transformer and use a slightly different initialization scheme.
The text was updated successfully, but these errors were encountered: