💾 Reduce memory peak in GRPO by adding `max_prompt_length` and loop usage in logp computation #2598

qgallouedec · 2025-01-21T13:48:56Z

What does this PR do?

from datasets import load_dataset
from peft import LoraConfig
from trl import GRPOConfig, GRPOTrainer

# Load the dataset
dataset = load_dataset("trl-lib/tldr", split="train")

training_args = GRPOConfig(
    output_dir="Qwen2-0.5B-GRPO",
    learning_rate=1e-5,
    logging_steps=2,
    gradient_accumulation_steps=8,
    max_completion_length=32,
    num_generations=8,
)
trainer = GRPOTrainer(
    model="Qwen/Qwen2-0.5B-Instruct",
    reward_model="weqweasdas/RM-Gemma-2B",
    args=training_args,
    train_dataset=dataset,
    peft_config=LoraConfig(task_type="CAUSAL_LM"),
)

trainer.train()

Grey is the old one

Not sure why the grad norm don't perfectly match. Numerical noise probably.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2025-01-21T13:53:24Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

qgallouedec · 2025-01-21T13:54:38Z

trl/trainer/grpo_trainer.py

        def get_per_token_logps(model, input_ids):
-            logits = model(input_ids).logits
-            logits = torch.roll(logits, shifts=1, dims=1)  # Shape (B*G, L)
-            per_token_logps = torch.gather(logits.log_softmax(-1), dim=2, index=input_ids.unsqueeze(2)).squeeze(2)
-            return per_token_logps
+            logits = model(input_ids).logits  # (B, L, V)
+            logits = logits[:, :-1, :]  # (B, L-1, V), exclude the last logit: it corresponds to the next token pred
+            input_ids = input_ids[:, 1:]  # (B, L-1), exclude the first input ID since we don't have logits for it
+            # Compute the log probabilities for the input tokens. Use a loop to reduce memory peak.
+            per_token_logps = []
+            for logits_row, input_ids_row in zip(logits, input_ids):
+                log_probs = logits_row.log_softmax(dim=-1)
+                token_log_prob = torch.gather(log_probs, dim=1, index=input_ids_row.unsqueeze(1)).squeeze(1)
+                per_token_logps.append(token_log_prob)
+            return torch.stack(per_token_logps)

        per_token_logps = get_per_token_logps(model, prompt_completion_ids)
-        per_token_logps = per_token_logps[:, prompt_length:]  # get rid of the prompt
+        # Get rid of the prompt (-1 because of the shift done in get_per_token_logps)
+        per_token_logps = per_token_logps[:, prompt_length - 1 :]


import torch from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "Qwen/Qwen2.5-0.5B" model = AutoModelForCausalLM.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name) prompt = ["The quick brown fox jumps over the lazy dog."] prompt_completion = ["The quick brown fox jumps over the lazy dog. Nice to meet you!"] prompt_ids = tokenizer(prompt, return_tensors="pt").input_ids prompt_completion_ids = tokenizer(prompt_completion, return_tensors="pt").input_ids prompt_length = prompt_ids.shape[1] # Old one def get_per_token_logps(model, input_ids): logits = model(input_ids).logits logits = torch.roll(logits, shifts=1, dims=1) # Shape (B*G, L) per_token_logps = torch.gather(logits.log_softmax(-1), dim=2, index=input_ids.unsqueeze(2)).squeeze(2) return per_token_logps per_token_logps1 = get_per_token_logps(model, prompt_completion_ids) per_token_logps1 = per_token_logps1[:, prompt_length:] # get rid of the prompt # New one def get_per_token_logps(model, input_ids): logits = model(input_ids).logits # (B, L, V) logits = logits[:, :-1, :] # (B, L-1, V), exclude the last logit: it corresponds to the next token pred input_ids = input_ids[:, 1:] # (B, L-1), exclude the first input ID since we don't have logits for it # Compute the log probabilities for the input tokens. Use a loop to reduce memory peak. per_token_logps = [] for logits_row, input_ids_row in zip(logits, input_ids): log_probs = logits_row.log_softmax(dim=-1) token_log_prob = torch.gather(log_probs, dim=1, index=input_ids_row.unsqueeze(1)).squeeze(1) per_token_logps.append(token_log_prob) return torch.stack(per_token_logps) per_token_logps2 = get_per_token_logps(model, prompt_completion_ids) # Get rid of the prompt (-1 because of the shift done in get_per_token_logps) per_token_logps2 = per_token_logps2[:, prompt_length - 1 :] print(torch.allclose(per_token_logps1, per_token_logps2)) # True

qgallouedec added 2 commits January 21, 2025 13:43

add max_prompt len to config

a59355c

truncate prompt and compute log probs line by line

8ef6f5f

qgallouedec commented Jan 21, 2025

View reviewed changes

qgallouedec marked this pull request as ready for review January 21, 2025 13:58

qgallouedec requested review from kashif, edbeeching, lewtun, plaguss and August-murr January 21, 2025 13:59

kashif approved these changes Jan 21, 2025

View reviewed changes

qgallouedec changed the title ~~Reduce memory peak in GRPO~~ 💾 Reduce memory peak in GRPO by adding max_prompt_length and loop usage in logp computation. Jan 21, 2025

qgallouedec changed the title ~~💾 Reduce memory peak in GRPO by adding max_prompt_length and loop usage in logp computation.~~ 💾 Reduce memory peak in GRPO by adding max_prompt_length and loop usage in logp computation Jan 21, 2025

qgallouedec merged commit b6a084c into main Jan 21, 2025
13 of 14 checks passed

qgallouedec deleted the reduce-mem-grpo branch January 21, 2025 14:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

💾 Reduce memory peak in GRPO by adding `max_prompt_length` and loop usage in logp computation #2598

💾 Reduce memory peak in GRPO by adding `max_prompt_length` and loop usage in logp computation #2598

qgallouedec commented Jan 21, 2025 •

edited

Loading

HuggingFaceDocBuilderDev commented Jan 21, 2025

qgallouedec Jan 21, 2025 •

edited

Loading

💾 Reduce memory peak in GRPO by adding max_prompt_length and loop usage in logp computation #2598

💾 Reduce memory peak in GRPO by adding max_prompt_length and loop usage in logp computation #2598

Conversation

qgallouedec commented Jan 21, 2025 • edited Loading

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Jan 21, 2025

qgallouedec Jan 21, 2025 • edited Loading

Choose a reason for hiding this comment

💾 Reduce memory peak in GRPO by adding `max_prompt_length` and loop usage in logp computation #2598

💾 Reduce memory peak in GRPO by adding `max_prompt_length` and loop usage in logp computation #2598

qgallouedec commented Jan 21, 2025 •

edited

Loading

qgallouedec Jan 21, 2025 •

edited

Loading