Question about the loss function and reference model #57

Richar-Du · 2022-12-25T12:52:19Z

Thanks for your awesome work! I'm studying your code and want to implement it in my system. I have the following two questions:

When calculating the delta, I wonder why the discount factor gamma times nextvalues instead of rewards[t+1]? I think the cumulated reward has no relation to the value.
https://github.com/lvwerra/trl/blob/44fb7326fc2440756f27e38be5220dd668fc92bc/trl/ppo.py#L237
I notice that in other implementations, the old policy network updates as well (but slower than the active policy network). In your implementation, the old policy network doesn't update and always keeps the same as the checkpoint of GPT-2. Am I right?

Could you please explain these questions? Thanks again :)

The text was updated successfully, but these errors were encountered:

lvwerra · 2023-01-13T15:39:35Z

Regarding 1: see equation (11) in https://arxiv.org/abs/1506.02438 and 2) yes you are correct.

Richar-Du changed the title ~~Question~~ Question about the loss function and reference model Dec 25, 2022

lvwerra closed this as completed Jan 13, 2023

PRASANTH-1427 mentioned this issue Sep 13, 2023

getting error while running the sft_llama2.py #762

Closed

August-murr mentioned this issue Jan 6, 2025

onlinedpo error when use deepspeed zero3 August-murr/trl#7

Open

9 tasks

Provide feedback