You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Here are my curves. RN50 roughly matchs the one shown in the repo, but the VITB/32 is worse. I am using the hyperparams from README. I am wondering could you also share the performance curves of ViTB/32 on CC?
The text was updated successfully, but these errors were encountered:
ViT-B performed worse for us on CC than a RN50. I suspect (but can not prove) this is because there's not enough data and vision transformers appear more data hungry than resnets. I don't have the accuracy off hand, but this looks comparable to what we were seeing.
This is expected and your numbers appear reasonable. In training quite a few models at the lower end recently, the ViT-B models (even the smaller ones) will underperform similar sized ResNet models for smaller data. This includes up to at least the 12-15M sample range as I was unable to push ViT-B-32 past RN50 on cc12m or yfcc15m. I feel the crossover point is probably in the 40-100M sample range but have not verified that.
One could possibly work around this by using a pretrained backbone for the vision tower. There is partial support for this right now in some (preliminary) support for timm models...
You'd then be starting with a vision tower pretrained on imagenet. It significantly speeds up reaching decent zero shot and eval rseults BUT I'd caution against using an imagenet pretrained backbone and doing zero_shot eval on Imagenet, you'd probably want an alternate zero shot test dataset
Here are my curves. RN50 roughly matchs the one shown in the repo, but the VITB/32 is worse. I am using the hyperparams from README. I am wondering could you also share the performance curves of ViTB/32 on CC?
The text was updated successfully, but these errors were encountered: