-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PPOTrainer
] Support generic optimizers
#78
[PPOTrainer
] Support generic optimizers
#78
Conversation
The documentation is not available anymore as the PR was closed or merged. |
Regarding 8-bit Adam, it is quite hard to make it converge. I have found that the model falls rapidly in a collapse mode: https://wandb.ai/distill-bloom/trl/runs/k7vogzao?workspace=user-younesbelkada let me know if it still makes sense to add the example |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, just a small nit. Do you want add the scheduler
here or in a new PR?
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Thanks! Let's address the scheduler in a follow up PR! |
@younesbelkada FYI, 8bit adam converges only after you do a lot of stuff with reward normalization. CarperAI/trlx#53 see here. We also had significant issues getting it working. There was also a recent bug in computing values that we found that I believe was carried over from TRL, I'll have to double check with one of my engineers on this. |
Nevermind, it appears like the bug is a non issue for TRL. |
This PR adds the support of generic optimizers. Before this PR, the
PPOTrainer
was only support adam optimizer. Users are now free to use any optimizer.Added also an example that leverages
8bitAdam
which is lighter & faster than classicAdam
optimizer.cc @lewtun @lvwerra @edbeeching
as a side note, 8bitAdam should support DP out of the box