[`core`] fix DP issue #222

younesbelkada · 2023-03-15T17:01:09Z

What does this PR do?

The mini batching PR introduced a small bug that led to DP crash using accelerate as we tried to all reduce tensors that did not had the same shape.
The tensors that did not had the same shape were the logprobs tensors, that derive from the inputs.

The fix is to add a check, that padds correctly the model inputs, and corresponding attention masks, and fills them with the correct value using accelerator.pad_across_processes

cc @lvwerra

run for gpt2 sentiment with DP=2: https://wandb.ai/distill-bloom/trl/runs/p7pzc7yk?workspace=user-younesbelkada
run for t5 sentiment with DP=2: https://wandb.ai/distill-bloom/trl/runs/nb9as16x?workspace=user-younesbelkada

HuggingFaceDocBuilderDev · 2023-03-15T17:04:36Z

The documentation is not available anymore as the PR was closed or merged.

lvwerra

Left two comments, generally looks good.

trl/trainer/ppo_trainer.py

younesbelkada · 2023-03-16T07:43:06Z

A CI test (test_ppo_step_with_no_ref_sgd_lr_scheduler) is surprisingly failing but completely unrelated to this PR, I thought it was an issue with PT 2.0 but still cannot reproduce locally. Will merge like this and investigate later

fix DP issue

fa6be83

younesbelkada requested a review from lvwerra March 15, 2023 17:03

lvwerra reviewed Mar 15, 2023

View reviewed changes

trl/trainer/ppo_trainer.py Outdated Show resolved Hide resolved

trl/trainer/ppo_trainer.py Outdated Show resolved Hide resolved

fix

e86d29a

younesbelkada requested a review from lvwerra March 15, 2023 17:17

oops

3e441da

lvwerra reviewed Mar 15, 2023

View reviewed changes

trl/trainer/ppo_trainer.py Outdated Show resolved Hide resolved

lvwerra approved these changes Mar 15, 2023

View reviewed changes

younesbelkada added 2 commits March 15, 2023 19:18

Empty-Commit

a09db08

skip test

82ba30f

younesbelkada merged commit 7940683 into main Mar 16, 2023

younesbelkada deleted the fix-dp branch March 16, 2023 07:43

This was referenced Mar 16, 2023

Distributed training stuck #151

Closed

[peft] Fix DP issues #221

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`core`] fix DP issue #222

[`core`] fix DP issue #222

younesbelkada commented Mar 15, 2023

HuggingFaceDocBuilderDev commented Mar 15, 2023 •

edited

Loading

lvwerra left a comment

younesbelkada commented Mar 16, 2023

[core] fix DP issue #222

[core] fix DP issue #222

Conversation

younesbelkada commented Mar 15, 2023

What does this PR do?

HuggingFaceDocBuilderDev commented Mar 15, 2023 • edited Loading

lvwerra left a comment

Choose a reason for hiding this comment

younesbelkada commented Mar 16, 2023

[`core`] fix DP issue #222

[`core`] fix DP issue #222

HuggingFaceDocBuilderDev commented Mar 15, 2023 •

edited

Loading