-
Hi, I wonder what training script was used when training coca models? Right now I can only see the fine-tuning script on mscoco. I thought this would be easy and directly applying the clip training script with a different model config like coca_ViT-B-32.json would do the job. But it turns out that CoCa consumes significantly more gpu memory. For example, with a script that is using ViT-B-32 and a batchsize of 4096 per gpu, I found that simply changing the model from ViT-B-32 to coca_ViT-B-32 would lead to a max batchsize per gpu of 256, which is 16x batchsize reduction. I wonder is that normal? If so, what hyper-parameters were used when training CoCa models like https://huggingface.co/laion/CoCa-ViT-B-32-laion2B-s13B-b90k or https://huggingface.co/laion/CoCa-ViT-L-14-laion2B-s13B-b90k? Right now the READMEs are all empty. Did you simply set I am not entirely sure why CoCa needs that much memory. It seems to me the text decoder is a Multimodal Transformer which is about twice the size of visual encoder. According to this post, CoCa does not need much extra compute, and the resulting contrastive captioning model works better than CLIP on similar parameters and data settings. So I figure it is worth a try. Thanks a lot! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Here's the training params we used to train all CoCa models:
We get a large batch size by using data parallel training over a large set of A100 GPU's. For example for the L/14 CoCa we had a local batch size of 235 but we trained the model on 384 A100's: 384 * 235 = 90k. The maximum possible local batch size of CoCa is naturally going to be smaller because you need to fit a whole other transformer onto the same GPU and depending on what precision you use that's going to take up the memory. Run the numbers for yourself if you want, for B/32 CoCa decoder that's 100M params and you need weights, activations, gradients (take into account the precision you're using). |
Beta Was this translation helpful? Give feedback.
Here's the training params we used to train all CoCa models: