New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[Feature] RLHF Reward Model #1315

Closed

tcbegley wants to merge 10 commits into pytorch:rlhf_data from tcbegley:rlhf-reward-model

Contributor

tcbegley commented Jun 26, 2023 •

edited

Loading

This PR builds on top of #1309, adding the GPT2RewardModel class and tests. Changes from that PR are needed so that we can use the dataloaders in the tests.

tcbegley added 3 commits

June 26, 2023 09:17


          Add GPT2RewardModel


          Add reward model test

402a750


          Add docstrings

d9d702a

facebook-github-bot added the CLA Signed label

tcbegley changed the base branch from main to rlhf_data

June 26, 2023 10:35

apbard reviewed

View reviewed changes

Contributor

apbard left a comment

shouldn't we rebase on #1309 to make sure tests are green?

torchrl/modules/models/rlhf.py Outdated Show resolved Hide resolved

torchrl/modules/models/rlhf.py Show resolved Hide resolved


          Update torchrl/modules/models/rlhf.py

21aa71e

Co-authored-by: Alessandro Pietro Bardelli <apbard@users.noreply.github.com>

tcbegley mentioned this pull request

[Feature, NOMERGE] RLHF Rollouts #1316

Closed

apbard approved these changes

View reviewed changes

Contributor

apbard left a comment

left couple of comments. but LGTM! thanks!

torchrl/modules/models/rlhf.py Show resolved Hide resolved

torchrl/modules/models/rlhf.py Show resolved Hide resolved


          Apply suggestions from code review

67518e3

Co-authored-by: Alessandro Pietro Bardelli <apbard@users.noreply.github.com>

vmoens reviewed

View reviewed changes

Contributor

vmoens left a comment

Great work thanks a lot for this.
A couple of comments regarding efficiency: I dont think we need to tackle this now but let's make sure we keep track somewhere that there could be room for improvement

torchrl/modules/models/rlhf.py Outdated Show resolved Hide resolved

torchrl/modules/models/rlhf.py Outdated

Comment on lines 32 to 33

		""" Returns a tuple (rewards, end_scores) where `rewards` contains all rewards computed at each timestep, `end_scores` contains the reward computed at the last-non-padding token
		"""

Contributor

vmoens Jun 26, 2023

Suggested change

      
                    """ Returns a tuple (rewards, end_scores) where `rewards` contains all rewards computed at each timestep, `end_scores` contains the reward computed at the last-non-padding token
          
                    """
          
                    """Computes the rewards associated with some encoded sequence of tokens.
          
                    Returns a tuple (rewards, end_scores) where `rewards` contains all rewards computed at each timestep, `end_scores` contains the reward computed at the last-non-padding token
          
                    """

torchrl/modules/models/rlhf.py Outdated Show resolved Hide resolved

torchrl/modules/models/rlhf.py Show resolved Hide resolved

torchrl/modules/models/rlhf.py Outdated

Comment on lines 35 to 45

+                      hidden_states = outputs[0]
+                      rewards = self.lm_head(hidden_states).squeeze(-1)
+                      end_scores = []
+                      bs = input_ids.shape[0]
+                      for i in range(bs):
+                          pad_inds = (input_ids[i] == self.PAD_ID).nonzero()
+                          first_pad_ind = (
+                              pad_inds[0].item() if len(pad_inds) > 0 else input_ids.shape[1]
+                          )
+                          end_scores.append(rewards[i, first_pad_ind - 1])

Contributor

vmoens Jun 26, 2023

My intuition is that there is a better (more efficient way) of coding that loop.
Let's move it to a private method such that we can easily refactor this piece of code and compare the results

Contributor

vmoens Jun 26, 2023

For instance, assuming that padding is only on the right:

>>> import torch
>>> z = torch.arange(12).view(3, 4)
>>> z[0, 2:] = 100
>>> z[1, 3:] = 100
>>> z[2, 1:] = 100
>>> mask = z == 100
>>> mask = torch.cat([mask, torch.ones_like(mask[..., :1])], -1) # make sure that there is one True on each row
>>> first_pad = mask[..., :-1] ^ mask[..., 1:]
>>> first_pad = first_pad.nonzero()
>>> first_pad = first_pad[:, -1]

Contributor Author

tcbegley Jun 27, 2023

Thanks! I've factored the slow code out into a private method as you suggested. Did I understand right that we should land as is and follow up with speed improvements?

torchrl/modules/models/rlhf.py

+                      for i in range(bs):
+                          # Check if there is any padding otherwise take length of sequence
+                          c_inds = (chosen_ids[i] == pad_token_id).nonzero()

Contributor

vmoens Jun 26, 2023

in general nonzero is expensive and shouldn't be called to often

torchrl/modules/models/rlhf.py

+                          end_ind = max(c_ind, r_ind)
+                          # Retrieve first index where trajectories diverge
+                          divergence_ind = (chosen_ids[i] != rejected_ids[i]).nonzero()[0]

Contributor

vmoens Jun 26, 2023

ditto, nonzero is expensive

torchrl/modules/models/rlhf.py Show resolved Hide resolved

torchrl/modules/models/rlhf.py Outdated

+                      The loss is computed as loss = -log_sigmoid(chosen_reward - rejected_reward).
+                      This loss is small when the reward model favours the chosen data and large if
+                      the model favours the rejected data.
+                      Note: the loss is computed excluding the common "prefix" subsequence to effectively disregard contribution of the original prompt.

Contributor

vmoens Jun 26, 2023

Perhaps an example of a call to this?

torchrl/modules/models/rlhf.py Show resolved Hide resolved

vmoens and others added 4 commits

June 27, 2023 08:34


          Merge remote-tracking branch 'origin/rlhf_data' into rlhf-reward-model

035241e

# Conflicts:
#	test/test_rlhf.py


          amend


          fix dataset

72f78b4


          Apply suggestions from code review

a96cd0d

Co-authored-by: Vincent Moens <vincentmoens@gmail.com>

apbard mentioned this pull request

[Feature] RLHF networks #1319

Merged


          Address comments

a051bf6

apbard mentioned this pull request

[Example] RLHF end to end example #1324

Closed

vmoens deleted the branch pytorch:rlhf_data

June 27, 2023 17:16

vmoens closed this

tcbegley changed the title ~~[Feature, NOMERGE] RLHF Reward Model~~ [Feature] RLHF Reward Model

This was referenced Jun 28, 2023

[Feature] RLHF Reward Model (reopened) #1328

Merged

[Feature, NOMERGE] RLHF Rollouts (reopened) #1329

Merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels