Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RLHF with PPO #1005
RLHF with PPO #1005
Changes from 28 commits
11d88a2
2849ec5
f0c1410
57c67bf
03cba4b
f50f047
b034af7
466b683
928037d
68b6162
65ca12a
9d8c5a8
b99102c
a1cde1c
b032778
f126e9a
04d514a
4854908
c885833
57d57fa
cce5548
c289566
a3fa1ea
c3db142
589bf7d
346c30b
2e9d779
46b75be
0fd885e
1942b0f
58d92ab
1fbb6dc
65ef9dc
c7bbff1
1129f9e
fe87dfb
662ab2c
76b124f
dc4887c
ef85dba
fd87fe6
ba365a8
e76304c
4e6be43
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just out of curiosity, does bf16 have different implications for training stability in a PPO training loop as compared to SFT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. I'm not sure I have the expertise to answer this based on experience, but intuitively I'd say that there's several factors in PPO optimisation which try to help stabilise training and reduce variance in gradient updates, so the impact of reduced precision may be smaller than in SFT scenarios.
The PPO loss also isn't a "distance" like in SFT, so you may not have the same loss landscapes because gradient updates point in the direction of maximising the reward. This means the smoothness of the loss landscape is more related to the amount of variance in your trajectories e.g. if your generations are significantly different to eachother (due to e.g. generation args), your reward model isn't well-calibrated, or something as simple as your batch size being too small.
Empirically, the reference TRL results I compared against below used fp32, and my results were in bf16.