You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We know that KL is used in the loss as a constraint for the difference between the original gpt2 and the active gpt2 which produces responses for rewards feedbacks.
How can I can tune the parameters to mitigate this constraint? I mean I want the active gpt2 can deviate much from the original reference gpt2, as I find in my experiments that the rewards do not improve as expected, possibly due to this constraint.
I am new to PPO. Hoping for some suggestions.
Thanks.
The text was updated successfully, but these errors were encountered:
Hi,
We know that KL is used in the loss as a constraint for the difference between the original gpt2 and the active gpt2 which produces responses for rewards feedbacks.
How can I can tune the parameters to mitigate this constraint? I mean I want the active gpt2 can deviate much from the original reference gpt2, as I find in my experiments that the rewards do not improve as expected, possibly due to this constraint.
I am new to PPO. Hoping for some suggestions.
Thanks.
The text was updated successfully, but these errors were encountered: