Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZeRO3, improved parameter all-gather operation #1188
ZeRO3, improved parameter all-gather operation #1188
Changes from 33 commits
1e73e75
67b3db3
70e681f
81b4fc4
8a14e43
32c8fa7
c4728f5
e075fd4
5208508
ffd3d3b
1ed96ce
220f2e0
8f65594
0e6d8e0
691749f
88e750e
497ee7d
bd8839c
2582910
4ca0d39
56de9ad
056cf10
aac09cd
6201b29
50a9215
c554a58
b7e131d
813cb22
588d3d0
eb0a540
c092b78
e73809d
d1d3c28
62cb104
ab64b17
7a80172
f01dad8
524e609
d7fff58
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems one of the errors is coming from this line that results in the
allgather_params[param_idx]
i think in this case is just a Tensor and not a list of Tensors?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading more on the context here, sorry. I see you're trying to use
_all_gather_base
, what version of pytorch was this introduced? The CI runs in question here are running with torch 1.8.2 LTS.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_all_gather_base
is available on pytorch master, probably version 1.10+that's why I have used a
try
clause to make the all-gather falls back todistributed.all_gather
call.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeffra @tjruwase
hey sorry for the late reply, I just confirmed this PR is able to work with pytorch-1.8.0.
And I checked the log again at the failure, where it indeed
raise RuntimeError
at line 935, while this exception is caught by theexcept
clause, in which it creates a list of Tensors for all-gather API.