Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] fix DP issue #222

Merged
merged 5 commits into from
Mar 16, 2023
Merged

[core] fix DP issue #222

merged 5 commits into from
Mar 16, 2023

Conversation

younesbelkada
Copy link
Contributor

What does this PR do?

The mini batching PR introduced a small bug that led to DP crash using accelerate as we tried to all reduce tensors that did not had the same shape.
The tensors that did not had the same shape were the logprobs tensors, that derive from the inputs.

The fix is to add a check, that padds correctly the model inputs, and corresponding attention masks, and fills them with the correct value using accelerator.pad_across_processes

cc @lvwerra

run for gpt2 sentiment with DP=2: https://wandb.ai/distill-bloom/trl/runs/p7pzc7yk?workspace=user-younesbelkada
run for t5 sentiment with DP=2: https://wandb.ai/distill-bloom/trl/runs/nb9as16x?workspace=user-younesbelkada

@younesbelkada younesbelkada requested a review from lvwerra March 15, 2023 17:03
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Mar 15, 2023

The documentation is not available anymore as the PR was closed or merged.

Copy link
Member

@lvwerra lvwerra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left two comments, generally looks good.

trl/trainer/ppo_trainer.py Outdated Show resolved Hide resolved
trl/trainer/ppo_trainer.py Outdated Show resolved Hide resolved
@younesbelkada younesbelkada requested a review from lvwerra March 15, 2023 17:17
trl/trainer/ppo_trainer.py Outdated Show resolved Hide resolved
@younesbelkada
Copy link
Contributor Author

A CI test (test_ppo_step_with_no_ref_sgd_lr_scheduler) is surprisingly failing but completely unrelated to this PR, I thought it was an issue with PT 2.0 but still cannot reproduce locally. Will merge like this and investigate later

@younesbelkada younesbelkada merged commit 7940683 into main Mar 16, 2023
@younesbelkada younesbelkada deleted the fix-dp branch March 16, 2023 07:43
This was referenced Mar 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants