Fix MPMD detected error during training with TP #648

michaelbenayoun · 2024-07-03T08:43:27Z

What does this PR do?

This PR tries to fix the issue reported here aws-neuron/neuronx-distributed#24.

It seems to be linked to torch.autocast:

Before this PR, when it was supposed to be disabled we would do: with torch.autocast(enabled=False):
Now we do: with contexlib.nullcontex():.

For some reason the first approach was leading to different cores trying to execute different graphs, leading to the MPMD detected error.

HuggingFaceDocBuilderDev · 2024-07-03T08:47:04Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

dacorvo

LGTM, thanks !

Remove old code

789727d

Fix MPMD error

5afecdb

michaelbenayoun marked this pull request as ready for review July 3, 2024 16:35

michaelbenayoun requested review from dacorvo and JingyaHuang July 3, 2024 16:35

michaelbenayoun added 3 commits July 4, 2024 10:34

Fix runner

c56deef

Edit workflow

f596197

Change dataset

839dd85

dacorvo approved these changes Jul 5, 2024

View reviewed changes

michaelbenayoun merged commit 281bad8 into main Jul 5, 2024
10 of 12 checks passed

michaelbenayoun deleted the fix_mpmd branch July 5, 2024 08:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix MPMD detected error during training with TP #648

Fix MPMD detected error during training with TP #648

michaelbenayoun commented Jul 3, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Jul 3, 2024

dacorvo left a comment

Fix MPMD detected error during training with TP #648

Fix MPMD detected error during training with TP #648

Conversation

michaelbenayoun commented Jul 3, 2024 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Jul 3, 2024

dacorvo left a comment

Choose a reason for hiding this comment

michaelbenayoun commented Jul 3, 2024 •

edited

Loading