Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about the loss function and reference model #57

Closed
Richar-Du opened this issue Dec 25, 2022 · 1 comment
Closed

Question about the loss function and reference model #57

Richar-Du opened this issue Dec 25, 2022 · 1 comment

Comments

@Richar-Du
Copy link

Richar-Du commented Dec 25, 2022

Thanks for your awesome work! I'm studying your code and want to implement it in my system. I have the following two questions:

  1. When calculating the delta, I wonder why the discount factor gamma times nextvalues instead of rewards[t+1]? I think the cumulated reward has no relation to the value.
    https://github.com/lvwerra/trl/blob/44fb7326fc2440756f27e38be5220dd668fc92bc/trl/ppo.py#L237

  2. I notice that in other implementations, the old policy network updates as well (but slower than the active policy network). In your implementation, the old policy network doesn't update and always keeps the same as the checkpoint of GPT-2. Am I right?

Could you please explain these questions? Thanks again :)

@Richar-Du Richar-Du changed the title Question Question about the loss function and reference model Dec 25, 2022
@lvwerra
Copy link
Member

lvwerra commented Jan 13, 2023

Regarding 1: see equation (11) in https://arxiv.org/abs/1506.02438 and 2) yes you are correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants