[BE][DTensor] fix DTensor equal op #99014

XilunWu · 2023-04-13T05:05:42Z

What problem this PR solves?

#97170 fixed equal operator return type (old: Tensor, now: bool) by giving it the correct sharding propagation. This is consistent with the aten::equal op. However, the correctness only stays at the local result level:

equal op returns True if the local copy of dtensor A equals to the the local copy of dtensor B

This is not the correct semantic of equal which should return True if all local copies of A are equal to the corresponding local copies of B.

What is this PR?

For non-participating ranks, if the return type is scalar, local_results is set to None which means the default value is a reduced result of participating ranks only.
For all ranks, if the return type is scalar and the op_call is aten::equal(because aten::equal is the only function that returns scalar value and needs communication), all gather the local_results within the default pg and reduce on them with operator.and_. The result will be the new local_result.

Result/Impact

For non-participating ranks and the return type is scalar:

op is aten::equal, the return value is same with all other ranks
op is not aten::equal, the return value is None. Before this PR, this will raise "NotImplementedError" but has not been tested.

For participating ranks and the return type is scalar:

op is aten::equal, the return value is the equality of two dtensor operands - True if all copies are equal, False otherwise.
op is not aten::equal, simply the local computation result.

Stack from ghstack (oldest at bottom):

-> [BE][DTensor] fix DTensor equal op #99014

[ghstack-poisoned]

pytorch-bot · 2023-04-13T05:05:44Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/99014

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 Failures

As of commit f259a47:

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

ghstack-source-id: 2ce007350bf07010b33a78c535c513024220a5e2 Pull Request resolved: #99014

[ghstack-poisoned]

ghstack-source-id: b5445b8b47af08c6d68506b9e43dadc1d1a28f60 Pull Request resolved: #99014

wanchaol

See comments inlined, can you add some description to the PR's summary about what was wrong before and what the fix is?

torch/distributed/_tensor/dispatch.py

[ghstack-poisoned]

ghstack-source-id: 65ee5fb45a49bce3ed1ffb70c9ab23fecb926bad Pull Request resolved: #99014

XilunWu · 2023-04-17T17:20:02Z

@pytorchbot rebase

pytorchmergebot · 2023-04-17T17:22:31Z

@pytorchbot successfully started a rebase job. Check the current status here

## What problem this PR solves? #97170 fixed `equal` operator return type (old: Tensor, now: bool) by giving it the correct sharding propagation. This is consistent with the `aten::equal` op. However, the correctness only stays at the local result level: * `equal` op returns True if the local copy of dtensor A equals to the the local copy of dtensor B This is not the correct semantic of `equal` which should return True if all local copies of A are equal to the corresponding local copies of B. ## What is this PR? 1. For non-participating ranks, if the return type is scalar, `local_results` is set to `None` which means the default value is a reduced result of participating ranks only. 2. For all ranks, if the return type is scalar and the `op_call` is `aten::equal`(because `aten::equal` is the only function that returns scalar value and needs communication), all gather the `local_results` within the `default pg` and reduce on them with `operator.and_`. The result will be the new `local_result`. ## Result/Impact For non-participating ranks and the return type is scalar: 1. op is `aten::equal`, the return value is same with all other ranks 2. op is not `aten::equal`, the return value is None. Before this PR, this will raise "NotImplementedError" but has not been tested. For participating ranks and the return type is scalar: 1. op is `aten::equal`, the return value is the equality of two dtensor operands - True if all copies are equal, False otherwise. 2. op is not `aten::equal`, simply the local computation result. [ghstack-poisoned]

pytorchmergebot · 2023-04-17T17:22:52Z

Successfully rebased gh/XilunWu/26/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/99014)

wanchaol

lgtm, thanks for the fix! have one minor comment about lint

torch/distributed/_tensor/dispatch.py

## What problem this PR solves? #97170 fixed `equal` operator return type (old: Tensor, now: bool) by giving it the correct sharding propagation. This is consistent with the `aten::equal` op. However, the correctness only stays at the local result level: * `equal` op returns True if the local copy of dtensor A equals to the the local copy of dtensor B This is not the correct semantic of `equal` which should return True if all local copies of A are equal to the corresponding local copies of B. ## What is this PR? 1. For non-participating ranks, if the return type is scalar, `local_results` is set to `None` which means the default value is a reduced result of participating ranks only. 2. For all ranks, if the return type is scalar and the `op_call` is `aten::equal`(because `aten::equal` is the only function that returns scalar value and needs communication), all gather the `local_results` within the `default pg` and reduce on them with `operator.and_`. The result will be the new `local_result`. ## Result/Impact For non-participating ranks and the return type is scalar: 1. op is `aten::equal`, the return value is same with all other ranks 2. op is not `aten::equal`, the return value is None. Before this PR, this will raise "NotImplementedError" but has not been tested. For participating ranks and the return type is scalar: 1. op is `aten::equal`, the return value is the equality of two dtensor operands - True if all copies are equal, False otherwise. 2. op is not `aten::equal`, simply the local computation result. [ghstack-poisoned]

XilunWu · 2023-04-17T20:33:49Z

@pytorchbot rebase

pytorchmergebot · 2023-04-17T20:36:03Z

@pytorchbot successfully started a rebase job. Check the current status here

pytorchmergebot · 2023-04-17T20:36:14Z

Tried to rebase and push PR #99014, but it was already up to date

## What problem this PR solves? #97170 fixed `equal` operator return type (old: Tensor, now: bool) by giving it the correct sharding propagation. This is consistent with the `aten::equal` op. However, the correctness only stays at the local result level: * `equal` op returns True if the local copy of dtensor A equals to the the local copy of dtensor B This is not the correct semantic of `equal` which should return True if all local copies of A are equal to the corresponding local copies of B. ## What is this PR? 1. For non-participating ranks, if the return type is scalar, `local_results` is set to `None` which means the default value is a reduced result of participating ranks only. 2. For all ranks, if the return type is scalar and the `op_call` is `aten::equal`(because `aten::equal` is the only function that returns scalar value and needs communication), all gather the `local_results` within the `default pg` and reduce on them with `operator.and_`. The result will be the new `local_result`. ## Result/Impact For non-participating ranks and the return type is scalar: 1. op is `aten::equal`, the return value is same with all other ranks 2. op is not `aten::equal`, the return value is None. Before this PR, this will raise "NotImplementedError" but has not been tested. For participating ranks and the return type is scalar: 1. op is `aten::equal`, the return value is the equality of two dtensor operands - True if all copies are equal, False otherwise. 2. op is not `aten::equal`, simply the local computation result. [ghstack-poisoned]

ghstack-source-id: eced00444301025c602f49362f542fbac2439d79 Pull Request resolved: #99014

XilunWu · 2023-04-18T03:20:55Z

@pytorchmergebot merge -f "CI failure not related to PR"

pytorchmergebot · 2023-04-18T03:22:39Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

[BE][DTensor] fix DTensor equal op

ca589d7

[ghstack-poisoned]

XilunWu requested review from mrshenli, zhaojuanmao, rohan-varma, H-Huang, awgu, kwen2501, wanchaol and fegin as code owners April 13, 2023 05:05

XilunWu requested review from kiukchung and d4l3k as code owners April 13, 2023 05:05

XilunWu mentioned this pull request Apr 13, 2023

[BE][DTensor] merge random init test to test_random_ops.py #98874

Closed

XilunWu added the release notes: distributed (dtensor) release notes category label Apr 13, 2023

Update on "[BE][DTensor] fix DTensor equal op"

16d9949

[ghstack-poisoned]

XilunWu added a commit that referenced this pull request Apr 13, 2023

[BE][DTensor] fix DTensor equal op

19f9f85

ghstack-source-id: 2ce007350bf07010b33a78c535c513024220a5e2 Pull Request resolved: #99014

XilunWu marked this pull request as draft April 13, 2023 17:54

Update on "[BE][DTensor] fix DTensor equal op"

560b6b2

[ghstack-poisoned]

Update on "[BE][DTensor] fix DTensor equal op"

81c7252

[ghstack-poisoned]

XilunWu added a commit that referenced this pull request Apr 13, 2023

[BE][DTensor] fix DTensor equal op

b5e716a

ghstack-source-id: b5445b8b47af08c6d68506b9e43dadc1d1a28f60 Pull Request resolved: #99014

XilunWu marked this pull request as ready for review April 13, 2023 21:06

wanchaol reviewed Apr 13, 2023

View reviewed changes

torch/distributed/_tensor/dispatch.py Outdated Show resolved Hide resolved

torch/distributed/_tensor/dispatch.py Outdated Show resolved Hide resolved

torch/distributed/_tensor/dispatch.py Outdated Show resolved Hide resolved

Update on "[BE][DTensor] fix DTensor equal op"

01b5c60

[ghstack-poisoned]

XilunWu added a commit that referenced this pull request Apr 13, 2023

[BE][DTensor] fix DTensor equal op

7128000

ghstack-source-id: 65ee5fb45a49bce3ed1ffb70c9ab23fecb926bad Pull Request resolved: #99014

XilunWu requested a review from wanchaol April 14, 2023 17:11

wanchaol approved these changes Apr 17, 2023

View reviewed changes

torch/distributed/_tensor/dispatch.py Outdated Show resolved Hide resolved

XilunWu added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 17, 2023

XilunWu linked an issue Apr 17, 2023 that may be closed by this pull request

[threaded pg] Allow test thread to terminate on detection of failure and clean shared resources on exit #90504

Closed

XilunWu removed a link to an issue Apr 17, 2023

[threaded pg] Allow test thread to terminate on detection of failure and clean shared resources on exit #90504

Closed

XilunWu removed the ciflow/trunk Trigger trunk jobs on your pull request label Apr 17, 2023

XilunWu added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 17, 2023

XilunWu added a commit that referenced this pull request Apr 17, 2023

[BE][DTensor] fix DTensor equal op

b42e7db

ghstack-source-id: eced00444301025c602f49362f542fbac2439d79 Pull Request resolved: #99014

pytorchmergebot added the merging label Apr 18, 2023

pytorchmergebot added the Merged label Apr 18, 2023

pytorchmergebot closed this in 964c7e3 Apr 18, 2023

XilunWu deleted the gh/XilunWu/26/head branch April 18, 2023 03:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BE][DTensor] fix DTensor equal op #99014

[BE][DTensor] fix DTensor equal op #99014

XilunWu commented Apr 13, 2023 •

edited by pytorchmergebot

Loading

pytorch-bot bot commented Apr 13, 2023 •

edited

Loading

wanchaol left a comment

XilunWu commented Apr 17, 2023

pytorchmergebot commented Apr 17, 2023

pytorchmergebot commented Apr 17, 2023

wanchaol left a comment

XilunWu commented Apr 17, 2023

pytorchmergebot commented Apr 17, 2023

pytorchmergebot commented Apr 17, 2023

XilunWu commented Apr 18, 2023

pytorchmergebot commented Apr 18, 2023

[BE][DTensor] fix DTensor equal op #99014

[BE][DTensor] fix DTensor equal op #99014

Conversation

XilunWu commented Apr 13, 2023 • edited by pytorchmergebot Loading

What problem this PR solves?

What is this PR?

Result/Impact

pytorch-bot bot commented Apr 13, 2023 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/99014

❌ 3 Failures

wanchaol left a comment

Choose a reason for hiding this comment

XilunWu commented Apr 17, 2023

pytorchmergebot commented Apr 17, 2023

pytorchmergebot commented Apr 17, 2023

wanchaol left a comment

Choose a reason for hiding this comment

XilunWu commented Apr 17, 2023

pytorchmergebot commented Apr 17, 2023

pytorchmergebot commented Apr 17, 2023

XilunWu commented Apr 18, 2023

pytorchmergebot commented Apr 18, 2023

Merge started

XilunWu commented Apr 13, 2023 •

edited by pytorchmergebot

Loading

pytorch-bot bot commented Apr 13, 2023 •

edited

Loading