-
Notifications
You must be signed in to change notification settings - Fork 22.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BE][DTensor] fix DTensor equal op #99014
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/99014
Note: Links to docs will display an error until the docs builds have been completed. ❌ 3 FailuresAs of commit f259a47: NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
[ghstack-poisoned]
ghstack-source-id: 2ce007350bf07010b33a78c535c513024220a5e2 Pull Request resolved: #99014
[ghstack-poisoned]
[ghstack-poisoned]
ghstack-source-id: b5445b8b47af08c6d68506b9e43dadc1d1a28f60 Pull Request resolved: #99014
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments inlined, can you add some description to the PR's summary about what was wrong before and what the fix is?
[ghstack-poisoned]
ghstack-source-id: 65ee5fb45a49bce3ed1ffb70c9ab23fecb926bad Pull Request resolved: #99014
@pytorchbot rebase |
@pytorchbot successfully started a rebase job. Check the current status here |
## What problem this PR solves? #97170 fixed `equal` operator return type (old: Tensor, now: bool) by giving it the correct sharding propagation. This is consistent with the `aten::equal` op. However, the correctness only stays at the local result level: * `equal` op returns True if the local copy of dtensor A equals to the the local copy of dtensor B This is not the correct semantic of `equal` which should return True if all local copies of A are equal to the corresponding local copies of B. ## What is this PR? 1. For non-participating ranks, if the return type is scalar, `local_results` is set to `None` which means the default value is a reduced result of participating ranks only. 2. For all ranks, if the return type is scalar and the `op_call` is `aten::equal`(because `aten::equal` is the only function that returns scalar value and needs communication), all gather the `local_results` within the `default pg` and reduce on them with `operator.and_`. The result will be the new `local_result`. ## Result/Impact For non-participating ranks and the return type is scalar: 1. op is `aten::equal`, the return value is same with all other ranks 2. op is not `aten::equal`, the return value is None. Before this PR, this will raise "NotImplementedError" but has not been tested. For participating ranks and the return type is scalar: 1. op is `aten::equal`, the return value is the equality of two dtensor operands - True if all copies are equal, False otherwise. 2. op is not `aten::equal`, simply the local computation result. [ghstack-poisoned]
Successfully rebased |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thanks for the fix! have one minor comment about lint
## What problem this PR solves? #97170 fixed `equal` operator return type (old: Tensor, now: bool) by giving it the correct sharding propagation. This is consistent with the `aten::equal` op. However, the correctness only stays at the local result level: * `equal` op returns True if the local copy of dtensor A equals to the the local copy of dtensor B This is not the correct semantic of `equal` which should return True if all local copies of A are equal to the corresponding local copies of B. ## What is this PR? 1. For non-participating ranks, if the return type is scalar, `local_results` is set to `None` which means the default value is a reduced result of participating ranks only. 2. For all ranks, if the return type is scalar and the `op_call` is `aten::equal`(because `aten::equal` is the only function that returns scalar value and needs communication), all gather the `local_results` within the `default pg` and reduce on them with `operator.and_`. The result will be the new `local_result`. ## Result/Impact For non-participating ranks and the return type is scalar: 1. op is `aten::equal`, the return value is same with all other ranks 2. op is not `aten::equal`, the return value is None. Before this PR, this will raise "NotImplementedError" but has not been tested. For participating ranks and the return type is scalar: 1. op is `aten::equal`, the return value is the equality of two dtensor operands - True if all copies are equal, False otherwise. 2. op is not `aten::equal`, simply the local computation result. [ghstack-poisoned]
@pytorchbot rebase |
@pytorchbot successfully started a rebase job. Check the current status here |
Tried to rebase and push PR #99014, but it was already up to date |
## What problem this PR solves? #97170 fixed `equal` operator return type (old: Tensor, now: bool) by giving it the correct sharding propagation. This is consistent with the `aten::equal` op. However, the correctness only stays at the local result level: * `equal` op returns True if the local copy of dtensor A equals to the the local copy of dtensor B This is not the correct semantic of `equal` which should return True if all local copies of A are equal to the corresponding local copies of B. ## What is this PR? 1. For non-participating ranks, if the return type is scalar, `local_results` is set to `None` which means the default value is a reduced result of participating ranks only. 2. For all ranks, if the return type is scalar and the `op_call` is `aten::equal`(because `aten::equal` is the only function that returns scalar value and needs communication), all gather the `local_results` within the `default pg` and reduce on them with `operator.and_`. The result will be the new `local_result`. ## Result/Impact For non-participating ranks and the return type is scalar: 1. op is `aten::equal`, the return value is same with all other ranks 2. op is not `aten::equal`, the return value is None. Before this PR, this will raise "NotImplementedError" but has not been tested. For participating ranks and the return type is scalar: 1. op is `aten::equal`, the return value is the equality of two dtensor operands - True if all copies are equal, False otherwise. 2. op is not `aten::equal`, simply the local computation result. [ghstack-poisoned]
ghstack-source-id: eced00444301025c602f49362f542fbac2439d79 Pull Request resolved: #99014
@pytorchmergebot merge -f "CI failure not related to PR" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
What problem this PR solves?
#97170 fixed
equal
operator return type (old: Tensor, now: bool) by giving it the correct sharding propagation. This is consistent with theaten::equal
op. However, the correctness only stays at the local result level:equal
op returns True if the local copy of dtensor A equals to the the local copy of dtensor BThis is not the correct semantic of
equal
which should return True if all local copies of A are equal to the corresponding local copies of B.What is this PR?
local_results
is set toNone
which means the default value is a reduced result of participating ranks only.op_call
isaten::equal
(becauseaten::equal
is the only function that returns scalar value and needs communication), all gather thelocal_results
within thedefault pg
and reduce on them withoperator.and_
. The result will be the newlocal_result
.Result/Impact
For non-participating ranks and the return type is scalar:
aten::equal
, the return value is same with all other ranksaten::equal
, the return value is None. Before this PR, this will raise "NotImplementedError" but has not been tested.For participating ranks and the return type is scalar:
aten::equal
, the return value is the equality of two dtensor operands - True if all copies are equal, False otherwise.aten::equal
, simply the local computation result.Stack from ghstack (oldest at bottom):