Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training CLIP-ViT #58

Closed
Meteorix opened this issue Mar 10, 2021 · 9 comments
Closed

Training CLIP-ViT #58

Meteorix opened this issue Mar 10, 2021 · 9 comments

Comments

@Meteorix
Copy link

@jongwook Thanks for this great work!

I am trying to train CLIP VIT B/32 from scratch, but cannot get a higher score on imagenet versus CLIP resnet-50. May I ask what initialization you use in training VIT?

In the paper: We closely follow their implementation with only the minor modification of adding an additional layer normalization to the combined patch and position embeddings before the transformer and use a slightly different initialization scheme.

@Meteorix Meteorix changed the title Vit Training CLIP-ViT Mar 10, 2021
@yuxulingche
Copy link

Did you write the training code yourself?

@Meteorix
Copy link
Author

Did you write the training code yourself?

yes, it's not hard.

@jongwook
Copy link
Collaborator

Are you comparing ViT-based and ResNet50-based models trained with the same dataset? Vision transformers tend to underperform ResNet-based models unless they're trained on a huge dataset, so I'd suspect that could have been the reason, rather than the initialization scheme.

@Meteorix
Copy link
Author

Are you comparing ViT-based and ResNet50-based models trained with the same dataset? Vision transformers tend to underperform ResNet-based models unless they're trained on a huge dataset, so I'd suspect that could have been the reason, rather than the initialization scheme.

Thanks for your reply. We have trained on 100M text-image pairs. It turned out that ViTB32 model outperformed ResNet50 model on some of benchmarks.

@ValerioB88
Copy link

@Meteorix what dataset did you use?

@dragen1860
Copy link

@jongwook Thanks for this great work!

I am trying to train CLIP VIT B/32 from scratch, but cannot get a higher score on imagenet versus CLIP resnet-50. May I ask what initialization you use in training VIT?

In the paper: We closely follow their implementation with only the minor modification of adding an additional layer normalization to the combined patch and position embeddings before the transformer and use a slightly different initialization scheme.

Hi, Meteorix:
I am also working on this exciting project. I found you have implemented your own training code. That's pretty cool. May i ask for sharing your training code? it's would be good if you can kindly send the code to me. liangqu.long@gmail.com . Thank you very much.

@jongwook
Copy link
Collaborator

jongwook commented Apr 8, 2021

@dragen1860 see #83 and also other third-party implementations by @KeremTurgutlu and @lucidrains.

@jongwook jongwook closed this as completed Apr 8, 2021
@JACKHAHA363
Copy link

Hi, I am also running ViT-B/32 and with the open-clip code on CC3M. I also notice a discrepancy between ResNet and ViT. You can see my curves at mlfoundations/open_clip#14. I am interested to know if you solve this performance gap @jongwook @Meteorix

@Shelton-Zhou
Copy link

Shelton-Zhou commented Nov 3, 2021

@jongwook Thanks for this great work!

I am trying to train CLIP VIT B/32 from scratch, but cannot get a higher score on imagenet versus CLIP resnet-50. May I ask what initialization you use in training VIT?

In the paper: We closely follow their implementation with only the minor modification of adding an additional layer normalization to the combined patch and position embeddings before the transformer and use a slightly different initialization scheme.

Hi, Meteorix:
I am also working on this exciting project. I found you have implemented your own training code. May i ask for sharing your training code? it's would be good if you can kindly send the code to me at x888632157@163.com . Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants