Learning rate scaling for distributed training? #8

RahulBhalley · 2023-02-18T04:26:59Z

Hi @lucidrains, thanks for this implementation.

I wonder if you're using distributed training for your experiments. If so, as noted in Accelerate's docs, do you scale your learning rate (on top of downscaling for LION optimizer, even if you're not using Accelerate) based on number of processes (GPUs).

If you don't scale learning rate, do you recommend doing so?

RahulBhalley · 2023-02-18T09:48:08Z

Btw LION seems to update parameters per train step faster than Adam or AdamW.

simasima121 · 2023-02-24T18:48:07Z

I have run a few experiments and got unexpected results.

It seems as if Lion doesn't follow the traditional scaling law so far.

With a batch size of 64 across multiple GPUs, it doesn't matter how much we scale the LR; the training is only 10-20% faster.

For example, iteration 100's training loss with 1 GPU is 0.001 with bs of 64; if running on 4 GPUs (effective bs of 256), iteration 100 loss would be 0.0009 if LR is 4x bigger.

I have tried experiments with making LR the same, 2x bigger, 4x bigger and much larger, but it doesn't help.

Using Adam, the scaling laws would apply.

I would appreciate any ideas here.

Thanks

RahulBhalley · 2023-04-18T07:29:31Z

I have run a few experiments and got unexpected results.

It seems as if Lion doesn't follow the traditional scaling law so far.

With a batch size of 64 across multiple GPUs, it doesn't matter how much we scale the LR; the training is only 10-20% faster.

For example, iteration 100's training loss with 1 GPU is 0.001 with bs of 64; if running on 4 GPUs (effective bs of 256), iteration 100 loss would be 0.0009 if LR is 4x bigger.

I have tried experiments with making LR the same, 2x bigger, 4x bigger and much larger, but it doesn't help.

Using Adam, the scaling laws would apply.

I would appreciate any ideas here.

Thanks

@simasima121 That's interesting. Thanks! I wonder if you get any better results with LION.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Learning rate scaling for distributed training? #8

Learning rate scaling for distributed training? #8

RahulBhalley commented Feb 18, 2023

RahulBhalley commented Feb 18, 2023

simasima121 commented Feb 24, 2023

RahulBhalley commented Apr 18, 2023

Learning rate scaling for distributed training? #8

Learning rate scaling for distributed training? #8

Comments

RahulBhalley commented Feb 18, 2023

RahulBhalley commented Feb 18, 2023

simasima121 commented Feb 24, 2023

RahulBhalley commented Apr 18, 2023