-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Learning rate scaling for distributed training? #8
Comments
Btw LION seems to update parameters per train step faster than Adam or AdamW. |
I have run a few experiments and got unexpected results. It seems as if Lion doesn't follow the traditional scaling law so far. With a batch size of 64 across multiple GPUs, it doesn't matter how much we scale the LR; the training is only 10-20% faster. For example, iteration 100's training loss with 1 GPU is 0.001 with bs of 64; if running on 4 GPUs (effective bs of 256), iteration 100 loss would be 0.0009 if LR is 4x bigger. I have tried experiments with making LR the same, 2x bigger, 4x bigger and much larger, but it doesn't help. Using Adam, the scaling laws would apply. I would appreciate any ideas here. Thanks |
@simasima121 That's interesting. Thanks! I wonder if you get any better results with LION. |
Hi @lucidrains, thanks for this implementation.
I wonder if you're using distributed training for your experiments. If so, as noted in Accelerate's docs, do you scale your learning rate (on top of downscaling for LION optimizer, even if you're not using Accelerate) based on number of processes (GPUs).
If you don't scale learning rate, do you recommend doing so?
The text was updated successfully, but these errors were encountered: