Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

custom reward function support for ppo trainer #2540

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

August-murr
Copy link
Collaborator

What does this PR do?

Fixes #2518

Adding support for a custom reward function for the PPO trainer.

How it works

Write a custom function that takes a list of texts as input, representing a batch of responses, and outputs a list of scores.

def custom_reward_function(texts: list) -> list:
    """
    Custom reward function that applies a given reward logic to each item in the list.
    
    Args:
        items (list): List of items to evaluate.
    
    Returns:
        list: List of rewards based on the provided reward logic.
    """
    rewards = [reward_logic(item) for item in items]

I will add more documentation and explanations later after running several tests to make sure the implementation is functional.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines.
  • Did you write any new necessary tests?

Who can review?

@qgallouedec

@@ -1049,14 +1049,20 @@ def first_true_indices(bools: torch.Tensor, dtype=torch.long):


def get_reward(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where the primary change are:
modifying the get_reward function to work with both a nn.Module and a Callable.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@August-murr
Copy link
Collaborator Author

Question:

Does this theoretically work? I'm asking because I haven't read the PPO papers. When the PPO trainer is training, it outputs: query, model_response, and score, with the score being the tensor logits from the reward model. I have tested this branch and the changes, and it looks normal and functional.

For example, let's say the custom reward function is based on the count of a specific word, like "Good":

def reward_function(texts):
    rewards = [text.count("good") for text in texts]
    return rewards

and the printed output is just the count of the word good in the text and it looks normal since it's in the same format.

But is there more to it? theoretically?

Comment on lines 725 to 730
_, score, _ = get_reward(
self.reward_model, postprocessed_query_response, processing_class.pad_token_id, context_length
self.reward_model,
processing_class,
postprocessed_query_response,
processing_class.pad_token_id,
context_length,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move the if isinstance(model, torch.nn.Module): here? I would allow not to introduce breaking change in get_reward

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to clarify what you mean.

Copy link
Member

@qgallouedec qgallouedec Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, it wasn't clear:

something like this instead:

if isinstance(model, torch.nn.Module):
    full_value, _, _ = get_reward(
        unwrapped_value_model, query_response, processing_class.pad_token_id, context_length
    )
else:
    full_value = ...

doing such we don't introduce a breaking change in get_reward.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean I changed get_reward to work either way with both a callable and an nn.Module.
So you want to add if isinstance(model, torch.nn.Module) there and keep get_reward as it is without change?

@qgallouedec
Copy link
Member

Can you add a test as well?

@August-murr
Copy link
Collaborator Author

Can you add a test as well?

I'll take that as a yes.

Yes I will add the test and the docs later, maybe a blogpost or something to show how it works if I don't run out of resources.

@Superskyyy
Copy link
Contributor

Thanks for the contribution! We look forward to this flexibility added!

@August-murr
Copy link
Collaborator Author

@qgallouedec
just pushed 11484e7 to do the same thing without modifying get_reward
is it better?

"""
This function ensures that the custom reward function produces the correct output structure for integration with the trainer script.
"""
texts = processor.batch_decode(query_responses)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we skip special tokens here?

Copy link
Collaborator Author

@August-murr August-murr Jan 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good point.
another concern:
Currently, the postprocessed_query_response includes both the prompt and the generated response, which are then scored by the reward model(or custom reward function). Should the reward model only score the generated response, or should it score both the prompt and the response together?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it score both the prompt and the response together

Yes, and the reason is quite intuitive. For example, consider a prompt like 2+3= with the generated response being 6. If the reward model only has access to the generated response (6), how could it determine whether the calculation is correct without knowing the prompt?

@qgallouedec
Copy link
Member

Yes it looks better imo!

Comment on lines +463 to +466
unwrapped_value_model,
query_response,
processing_class.pad_token_id,
context_length,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
unwrapped_value_model,
query_response,
processing_class.pad_token_id,
context_length,
unwrapped_value_model, query_response, processing_class.pad_token_id, context_length

@qgallouedec
Copy link
Member

You could also add a test, with, eg, this reward fund

def reward_func(text):
    return float(len(text))

@Superskyyy
Copy link
Contributor

Superskyyy commented Jan 19, 2025

Hi, does this same change work for the RLOO trainer? (aka do the same change to the rloo trainer)

@August-murr
Copy link
Collaborator Author

Hi, does this same change work for the RLOO trainer? (aka do the same change to the rloo trainer)

We will apply the same approach to RLOO after conducting some tests on this.

@August-murr
Copy link
Collaborator Author

@qgallouedec
When using a custom reward function, what will be the role of the value_model? Do I need to modify anything related to the value_model?

correct me if I'm wrong, but the score should represent the reward given by the custom reward function to generated text by the model
then what will be the full_value when using custom reward functions?

full_value, _, _ = get_reward(
unwrapped_value_model, query_response, processing_class.pad_token_id, context_length
)

@Superskyyy
Copy link
Contributor

I tested the branch locally and seems like it works fine.

@August-murr
Copy link
Collaborator Author

I tested the branch locally and seems like it works fine.

wut?🤔

Could you share the code you're using that works for you?

I'm currently considering how the value model would be replaced or used, what the full_value should be or how it should be calculated for the trainer to function correctly.
It seems like it currently "shouldn't" work.

@qgallouedec
Copy link
Member

Do I need to modify anything related to the value_model?

No, I don't think you need to modify anything related to the value function here.

then what will be the full_value when using custom reward functions?

In fact, it's the same with any reward function. The value model is trained to estimates the value of a state (= token), i.e. the expectation of the discounted future rewards. It seems to me that the current implementation is sufficient.

@August-murr
Copy link
Collaborator Author

August-murr commented Jan 21, 2025

Do I need to modify anything related to the value_model?

No, I don't think you need to modify anything related to the value function here.

then what will be the full_value when using custom reward functions?

In fact, it's the same with any reward function. The value model is trained to estimates the value of a state (= token), i.e. the expectation of the discounted future rewards. It seems to me that the current implementation is sufficient.

@qgallouedec
When using it, what should I set as the value_model? It's says [optional],

value_model: Optional[nn.Module] = None,

but it it breaks at PolicyAndValueWrapper if I don't provide a value_model. I also can't use the reward function as the value_model.

@Superskyyy
Copy link
Contributor

Superskyyy commented Jan 21, 2025

Do I need to modify anything related to the value_model?

No, I don't think you need to modify anything related to the value function here.

then what will be the full_value when using custom reward functions?

In fact, it's the same with any reward function. The value model is trained to estimates the value of a state (= token), i.e. the expectation of the discounted future rewards. It seems to me that the current implementation is sufficient.

@qgallouedec When using it, what should I set as the value_model? It's says [optional],

value_model: Optional[nn.Module] = None,

but it it breaks at PolicyAndValueWrapper if I don't provide a value_model. I also can't use the reward function as the value_model.

Oh i get what you mean, @August-murr @qgallouedec I think typically the value model is initialized same as the reward model trained, but when we dont have the reward model what do we specify as the value model?

image

@Superskyyy
Copy link
Contributor

In the meanwhile I implemented the same for RLOO and can confirm it works (apparently)

@musab-mk
Copy link

musab-mk commented Feb 4, 2025

What @Superskyyy said is a real concern. We still need Value model. And probably trained just like a reward model?

Would using the SFT pre-trained base as Value model work? Anyone has experimented with this?

@Superskyyy
Copy link
Contributor

What @Superskyyy said is a real concern. We still need Value model. And probably trained just like a reward model?

Would using the SFT pre-trained base as Value model work? Anyone has experimented with this?

In theory it should. There are two ways of initiating a value model, either from policy or the trained reward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[question] best way to have my own reward model which is backed by rules
5 participants