Hyperparameters for training CoCa #455

zw615 · 2023-03-02T04:29:30Z

zw615
Mar 2, 2023

Hi, I wonder what training script was used when training coca models? Right now I can only see the fine-tuning script on mscoco. I thought this would be easy and directly applying the clip training script with a different model config like coca_ViT-B-32.json would do the job. But it turns out that CoCa consumes significantly more gpu memory.

For example, with a script that is using ViT-B-32 and a batchsize of 4096 per gpu, I found that simply changing the model from ViT-B-32 to coca_ViT-B-32 would lead to a max batchsize per gpu of 256, which is 16x batchsize reduction. I wonder is that normal? If so, what hyper-parameters were used when training CoCa models like https://huggingface.co/laion/CoCa-ViT-B-32-laion2B-s13B-b90k or https://huggingface.co/laion/CoCa-ViT-L-14-laion2B-s13B-b90k? Right now the READMEs are all empty. Did you simply set --accum-freq=16 to offset the batchsize reduction? Or any other hyper-parameter tuning is necessary?

I am not entirely sure why CoCa needs that much memory. It seems to me the text decoder is a Multimodal Transformer which is about twice the size of visual encoder. According to this post, CoCa does not need much extra compute, and the resulting contrastive captioning model works better than CLIP on similar parameters and data settings. So I figure it is worth a try.

Thanks a lot!

Answered by iejMac

Mar 2, 2023

Here's the training params we used to train all CoCa models:

srun --comment openclip --cpu_bind=v --accel-bind=gn python -m training.main \
    --save-frequency 1 \
    --save-most-recent \
    --logs "/scratch" \
    --remote-sync "s3://s-laion/coca_checkpoints/coca_ViT-L-14_run1/" \
    --train-data "pipe:aws s3 cp s3://s-datasets/laion5b/laion2B-data/{000000..231349}.tar -" \
    --train-num-samples 135646078 \
    --dataset-type webdataset \
    --precision amp_bfloat16 \
    --warmup 2000 \
    --batch-size=235 \
    --epochs=97 \
    --dataset-resampled \
    --lr 1e-3 \
    --workers=12 \
    --report-to wandb \
    --model "coca_ViT-L-14" \
    --seed 0 \
    --ddp-static-graph \
…

View full answer

iejMac · 2023-03-02T23:28:15Z

iejMac
Mar 2, 2023

Here's the training params we used to train all CoCa models:

srun --comment openclip --cpu_bind=v --accel-bind=gn python -m training.main \
    --save-frequency 1 \
    --save-most-recent \
    --logs "/scratch" \
    --remote-sync "s3://s-laion/coca_checkpoints/coca_ViT-L-14_run1/" \
    --train-data "pipe:aws s3 cp s3://s-datasets/laion5b/laion2B-data/{000000..231349}.tar -" \
    --train-num-samples 135646078 \
    --dataset-type webdataset \
    --precision amp_bfloat16 \
    --warmup 2000 \
    --batch-size=235 \
    --epochs=97 \
    --dataset-resampled \
    --lr 1e-3 \
    --workers=12 \
    --report-to wandb \
    --model "coca_ViT-L-14" \
    --seed 0 \
    --ddp-static-graph \
    --local-loss \
    --log-every-n-steps 5 \
    --gather-with-grad \
    --grad-checkpointing

We get a large batch size by using data parallel training over a large set of A100 GPU's. For example for the L/14 CoCa we had a local batch size of 235 but we trained the model on 384 A100's: 384 * 235 = 90k.

The maximum possible local batch size of CoCa is naturally going to be smaller because you need to fit a whole other transformer onto the same GPU and depending on what precision you use that's going to take up the memory. Run the numbers for yourself if you want, for B/32 CoCa decoder that's 100M params and you need weights, activations, gradients (take into account the precision you're using).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hyperparameters for training CoCa #455

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Hyperparameters for training CoCa #455

zw615 Mar 2, 2023

Replies: 1 comment

iejMac Mar 2, 2023

zw615
Mar 2, 2023

iejMac
Mar 2, 2023