You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I notice that in other implementations, the old policy network updates as well (but slower than the active policy network). In your implementation, the old policy network doesn't update and always keeps the same as the checkpoint of GPT-2. Am I right?
Could you please explain these questions? Thanks again :)
The text was updated successfully, but these errors were encountered:
Richar-Du
changed the title
Question
Question about the loss function and reference model
Dec 25, 2022
Thanks for your awesome work! I'm studying your code and want to implement it in my system. I have the following two questions:
When calculating the delta, I wonder why the discount factor
gamma
timesnextvalues
instead ofrewards[t+1]
? I think the cumulated reward has no relation to the value.https://github.com/lvwerra/trl/blob/44fb7326fc2440756f27e38be5220dd668fc92bc/trl/ppo.py#L237
I notice that in other implementations, the old policy network updates as well (but slower than the active policy network). In your implementation, the old policy network doesn't update and always keeps the same as the checkpoint of GPT-2. Am I right?
Could you please explain these questions? Thanks again :)
The text was updated successfully, but these errors were encountered: