custom reward function support for ppo trainer #2540

August-murr · 2025-01-03T09:41:10Z

What does this PR do?

Adding support for a custom reward function for the PPO trainer.

How it works

Write a custom function that takes a list of texts as input, representing a batch of responses, and outputs a list of scores.

def custom_reward_function(texts: list) -> list:
    """
    Custom reward function that applies a given reward logic to each item in the list.
    
    Args:
        items (list): List of items to evaluate.
    
    Returns:
        list: List of rewards based on the provided reward logic.
    """
    rewards = [reward_logic(item) for item in items]

I will add more documentation and explanations later after running several tests to make sure the implementation is functional.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines.
Did you write any new necessary tests?

Who can review?

@qgallouedec

August-murr · 2025-01-03T09:43:34Z

trl/trainer/utils.py

@@ -1049,14 +1049,20 @@ def first_true_indices(bools: torch.Tensor, dtype=torch.long):


 def get_reward(


This is where the primary change are:
modifying the get_reward function to work with both a nn.Module and a Callable.

HuggingFaceDocBuilderDev · 2025-01-03T09:44:45Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

August-murr · 2025-01-09T13:10:24Z

Question:

Does this theoretically work? I'm asking because I haven't read the PPO papers. When the PPO trainer is training, it outputs: query, model_response, and score, with the score being the tensor logits from the reward model. I have tested this branch and the changes, and it looks normal and functional.

For example, let's say the custom reward function is based on the count of a specific word, like "Good":

def reward_function(texts):
    rewards = [text.count("good") for text in texts]
    return rewards

and the printed output is just the count of the word good in the text and it looks normal since it's in the same format.

But is there more to it? theoretically?

qgallouedec · 2025-01-09T13:26:48Z

trl/trainer/ppo_trainer.py

                    _, score, _ = get_reward(
-                        self.reward_model, postprocessed_query_response, processing_class.pad_token_id, context_length
+                        self.reward_model,
+                        processing_class,
+                        postprocessed_query_response,
+                        processing_class.pad_token_id,
+                        context_length,


Can we move the if isinstance(model, torch.nn.Module): here? I would allow not to introduce breaking change in get_reward

You need to clarify what you mean.

sorry, it wasn't clear:

something like this instead:

if isinstance(model, torch.nn.Module): full_value, _, _ = get_reward( unwrapped_value_model, query_response, processing_class.pad_token_id, context_length ) else: full_value = ...

doing such we don't introduce a breaking change in get_reward.

I mean I changed get_reward to work either way with both a callable and an nn.Module.
So you want to add if isinstance(model, torch.nn.Module) there and keep get_reward as it is without change?

qgallouedec · 2025-01-09T13:28:07Z

Can you add a test as well?

August-murr · 2025-01-09T13:40:25Z

Can you add a test as well?

I'll take that as a yes.

Yes I will add the test and the docs later, maybe a blogpost or something to show how it works if I don't run out of resources.

Superskyyy · 2025-01-18T02:49:17Z

Thanks for the contribution! We look forward to this flexibility added!

August-murr · 2025-01-18T12:51:40Z

@qgallouedec
just pushed 11484e7 to do the same thing without modifying get_reward
is it better?

qgallouedec · 2025-01-18T16:58:12Z

trl/trainer/utils.py

+    """
+    This function ensures that the custom reward function produces the correct output structure for integration with the trainer script.
+    """
+    texts = processor.batch_decode(query_responses)


Should we skip special tokens here?

Yes, good point.
another concern:
Currently, the postprocessed_query_response includes both the prompt and the generated response, which are then scored by the reward model(or custom reward function). Should the reward model only score the generated response, or should it score both the prompt and the response together?

should it score both the prompt and the response together

Yes, and the reason is quite intuitive. For example, consider a prompt like 2+3= with the generated response being 6. If the reward model only has access to the generated response (6), how could it determine whether the calculation is correct without knowing the prompt?

qgallouedec · 2025-01-18T16:59:31Z

Yes it looks better imo!

qgallouedec · 2025-01-18T20:19:46Z

trl/trainer/ppo_trainer.py

+                        unwrapped_value_model,
+                        query_response,
+                        processing_class.pad_token_id,
+                        context_length,


Suggested change

unwrapped_value_model,

query_response,

processing_class.pad_token_id,

context_length,

unwrapped_value_model, query_response, processing_class.pad_token_id, context_length

qgallouedec · 2025-01-18T20:23:10Z

You could also add a test, with, eg, this reward fund

def reward_func(text):
    return float(len(text))

Superskyyy · 2025-01-19T01:11:08Z

Hi, does this same change work for the RLOO trainer? (aka do the same change to the rloo trainer)

August-murr · 2025-01-19T07:30:01Z

Hi, does this same change work for the RLOO trainer? (aka do the same change to the rloo trainer)

We will apply the same approach to RLOO after conducting some tests on this.

August-murr · 2025-01-19T14:07:25Z

@qgallouedec
When using a custom reward function, what will be the role of the value_model? Do I need to modify anything related to the value_model?

correct me if I'm wrong, but the score should represent the reward given by the custom reward function to generated text by the model
then what will be the full_value when using custom reward functions?

trl/trl/trainer/ppo_trainer.py

Lines 459 to 461 in 88514d5

    
           full_value, _, _ = get_reward( 
        
               unwrapped_value_model, query_response, processing_class.pad_token_id, context_length 
        
           )

Superskyyy · 2025-01-20T19:00:29Z

I tested the branch locally and seems like it works fine.

August-murr · 2025-01-20T19:31:12Z

I tested the branch locally and seems like it works fine.

wut?🤔

Could you share the code you're using that works for you?

I'm currently considering how the value model would be replaced or used, what the full_value should be or how it should be calculated for the trainer to function correctly.
It seems like it currently "shouldn't" work.

qgallouedec · 2025-01-20T20:43:04Z

Do I need to modify anything related to the value_model?

No, I don't think you need to modify anything related to the value function here.

then what will be the full_value when using custom reward functions?

In fact, it's the same with any reward function. The value model is trained to estimates the value of a state (= token), i.e. the expectation of the discounted future rewards. It seems to me that the current implementation is sufficient.

August-murr · 2025-01-21T06:37:02Z

Do I need to modify anything related to the value_model?

No, I don't think you need to modify anything related to the value function here.

then what will be the full_value when using custom reward functions?

In fact, it's the same with any reward function. The value model is trained to estimates the value of a state (= token), i.e. the expectation of the discounted future rewards. It seems to me that the current implementation is sufficient.

@qgallouedec
When using it, what should I set as the value_model? It's says [optional],

trl/trl/trainer/ppo_trainer.py

Line 119 in d9f0568

value_model: Optional[nn.Module] = None,

but it it breaks at PolicyAndValueWrapper if I don't provide a value_model. I also can't use the reward function as the value_model.

Superskyyy · 2025-01-21T23:42:42Z

Do I need to modify anything related to the value_model?

No, I don't think you need to modify anything related to the value function here.

then what will be the full_value when using custom reward functions?

In fact, it's the same with any reward function. The value model is trained to estimates the value of a state (= token), i.e. the expectation of the discounted future rewards. It seems to me that the current implementation is sufficient.

@qgallouedec When using it, what should I set as the value_model? It's says [optional],

trl/trl/trainer/ppo_trainer.py

Line 119 in d9f0568

value_model: Optional[nn.Module] = None,

but it it breaks at PolicyAndValueWrapper if I don't provide a value_model. I also can't use the reward function as the value_model.

Oh i get what you mean, @August-murr @qgallouedec I think typically the value model is initialized same as the reward model trained, but when we dont have the reward model what do we specify as the value model?

Superskyyy · 2025-01-23T04:29:50Z

In the meanwhile I implemented the same for RLOO and can confirm it works (apparently)

musab-mk · 2025-02-04T14:27:35Z

What @Superskyyy said is a real concern. We still need Value model. And probably trained just like a reward model?

Would using the SFT pre-trained base as Value model work? Anyone has experimented with this?

Superskyyy · 2025-02-06T14:05:29Z

What @Superskyyy said is a real concern. We still need Value model. And probably trained just like a reward model?

Would using the SFT pre-trained base as Value model work? Anyone has experimented with this?

In theory it should. There are two ways of initiating a value model, either from policy or the trained reward.

custom reward function support for ppo trainer

a7b91ba

August-murr requested a review from qgallouedec January 3, 2025 09:41

August-murr commented Jan 3, 2025

View reviewed changes

movinng the custom reward tensor to device

0c4a98a

qgallouedec reviewed Jan 9, 2025

View reviewed changes

alternative approach to avoid modifying get_reward

11484e7

qgallouedec reviewed Jan 18, 2025

View reviewed changes

August-murr mentioned this pull request Jan 23, 2025

🍭 Custom reward function for RLOO #2612

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

custom reward function support for ppo trainer #2540

custom reward function support for ppo trainer #2540

August-murr commented Jan 3, 2025

August-murr Jan 3, 2025

HuggingFaceDocBuilderDev commented Jan 3, 2025

August-murr commented Jan 9, 2025

qgallouedec Jan 9, 2025

August-murr Jan 9, 2025

qgallouedec Jan 9, 2025 •

edited

Loading

August-murr Jan 9, 2025

qgallouedec commented Jan 9, 2025

August-murr commented Jan 9, 2025

Superskyyy commented Jan 18, 2025

August-murr commented Jan 18, 2025

qgallouedec Jan 18, 2025

August-murr Jan 18, 2025 •

edited

Loading

qgallouedec Jan 18, 2025

qgallouedec commented Jan 18, 2025

qgallouedec Jan 18, 2025

qgallouedec commented Jan 18, 2025

Superskyyy commented Jan 19, 2025 •

edited

Loading

August-murr commented Jan 19, 2025

August-murr commented Jan 19, 2025

Superskyyy commented Jan 20, 2025

August-murr commented Jan 20, 2025

qgallouedec commented Jan 20, 2025

August-murr commented Jan 21, 2025 •

edited

Loading

Superskyyy commented Jan 21, 2025 •

edited

Loading

Superskyyy commented Jan 23, 2025

musab-mk commented Feb 4, 2025 •

edited

Loading

Superskyyy commented Feb 6, 2025

		@@ -1049,14 +1049,20 @@ def first_true_indices(bools: torch.Tensor, dtype=torch.long):


		def get_reward(

custom reward function support for ppo trainer #2540

Are you sure you want to change the base?

custom reward function support for ppo trainer #2540

Conversation

August-murr commented Jan 3, 2025

What does this PR do?

How it works

Before submitting

Who can review?

August-murr Jan 3, 2025

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Jan 3, 2025

August-murr commented Jan 9, 2025

Question:

qgallouedec Jan 9, 2025

Choose a reason for hiding this comment

August-murr Jan 9, 2025

Choose a reason for hiding this comment

qgallouedec Jan 9, 2025 • edited Loading

Choose a reason for hiding this comment

August-murr Jan 9, 2025

Choose a reason for hiding this comment

qgallouedec commented Jan 9, 2025

August-murr commented Jan 9, 2025

Superskyyy commented Jan 18, 2025

August-murr commented Jan 18, 2025

qgallouedec Jan 18, 2025

Choose a reason for hiding this comment

August-murr Jan 18, 2025 • edited Loading

Choose a reason for hiding this comment

qgallouedec Jan 18, 2025

Choose a reason for hiding this comment

qgallouedec commented Jan 18, 2025

qgallouedec Jan 18, 2025

Choose a reason for hiding this comment

qgallouedec commented Jan 18, 2025

Superskyyy commented Jan 19, 2025 • edited Loading

August-murr commented Jan 19, 2025

August-murr commented Jan 19, 2025

Superskyyy commented Jan 20, 2025

August-murr commented Jan 20, 2025

qgallouedec commented Jan 20, 2025

August-murr commented Jan 21, 2025 • edited Loading

Superskyyy commented Jan 21, 2025 • edited Loading

Superskyyy commented Jan 23, 2025

musab-mk commented Feb 4, 2025 • edited Loading

Superskyyy commented Feb 6, 2025

qgallouedec Jan 9, 2025 •

edited

Loading

August-murr Jan 18, 2025 •

edited

Loading

Superskyyy commented Jan 19, 2025 •

edited

Loading

August-murr commented Jan 21, 2025 •

edited

Loading

Superskyyy commented Jan 21, 2025 •

edited

Loading

musab-mk commented Feb 4, 2025 •

edited

Loading