Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BE][DTensor] fix DTensor equal op #99014

Closed
wants to merge 8 commits into from

Conversation

XilunWu
Copy link
Contributor

@XilunWu XilunWu commented Apr 13, 2023

What problem this PR solves?

#97170 fixed equal operator return type (old: Tensor, now: bool) by giving it the correct sharding propagation. This is consistent with the aten::equal op. However, the correctness only stays at the local result level:

  • equal op returns True if the local copy of dtensor A equals to the the local copy of dtensor B

This is not the correct semantic of equal which should return True if all local copies of A are equal to the corresponding local copies of B.

What is this PR?

  1. For non-participating ranks, if the return type is scalar, local_results is set to None which means the default value is a reduced result of participating ranks only.
  2. For all ranks, if the return type is scalar and the op_call is aten::equal(because aten::equal is the only function that returns scalar value and needs communication), all gather the local_results within the default pg and reduce on them with operator.and_. The result will be the new local_result.

Result/Impact

For non-participating ranks and the return type is scalar:

  1. op is aten::equal, the return value is same with all other ranks
  2. op is not aten::equal, the return value is None. Before this PR, this will raise "NotImplementedError" but has not been tested.

For participating ranks and the return type is scalar:

  1. op is aten::equal, the return value is the equality of two dtensor operands - True if all copies are equal, False otherwise.
  2. op is not aten::equal, simply the local computation result.

Stack from ghstack (oldest at bottom):

@pytorch-bot
Copy link

pytorch-bot bot commented Apr 13, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/99014

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 Failures

As of commit f259a47:

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

XilunWu added a commit that referenced this pull request Apr 13, 2023
ghstack-source-id: 2ce007350bf07010b33a78c535c513024220a5e2
Pull Request resolved: #99014
@XilunWu XilunWu marked this pull request as draft April 13, 2023 17:54
XilunWu added a commit that referenced this pull request Apr 13, 2023
ghstack-source-id: b5445b8b47af08c6d68506b9e43dadc1d1a28f60
Pull Request resolved: #99014
@XilunWu XilunWu marked this pull request as ready for review April 13, 2023 21:06
Copy link
Contributor

@wanchaol wanchaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments inlined, can you add some description to the PR's summary about what was wrong before and what the fix is?

torch/distributed/_tensor/dispatch.py Outdated Show resolved Hide resolved
torch/distributed/_tensor/dispatch.py Outdated Show resolved Hide resolved
torch/distributed/_tensor/dispatch.py Outdated Show resolved Hide resolved
XilunWu added a commit that referenced this pull request Apr 13, 2023
ghstack-source-id: 65ee5fb45a49bce3ed1ffb70c9ab23fecb926bad
Pull Request resolved: #99014
@XilunWu XilunWu requested a review from wanchaol April 14, 2023 17:11
@XilunWu
Copy link
Contributor Author

XilunWu commented Apr 17, 2023

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a rebase job. Check the current status here

## What problem this PR solves?
#97170 fixed `equal` operator return type (old: Tensor, now: bool) by giving it the correct sharding propagation. This is consistent with the `aten::equal` op. However, the correctness only stays at the local result level:
* `equal` op returns True if the local copy of dtensor A equals to the the local copy of dtensor B

This is not the correct semantic of `equal` which should return True if all local copies of A are equal to the corresponding local copies of B.

## What is this PR?

1. For non-participating ranks, if the return type is scalar, `local_results` is set to `None` which means the default value is a reduced result of participating ranks only.
2. For all ranks, if the return type is scalar and the `op_call` is `aten::equal`(because `aten::equal` is the only function that returns scalar value and needs communication), all gather the `local_results` within the `default pg` and reduce on them with `operator.and_`. The result will be the new `local_result`.

## Result/Impact
For non-participating ranks and the return type is scalar:

1. op is `aten::equal`, the return value is same with all other ranks
2. op is not `aten::equal`, the return value is None. Before this PR, this will raise "NotImplementedError" but has not been tested.

For participating ranks and the return type is scalar:

1. op is `aten::equal`, the return value is the equality of two dtensor operands - True if all copies are equal, False otherwise.
2. op is not `aten::equal`, simply the local computation result.




[ghstack-poisoned]
@pytorchmergebot
Copy link
Collaborator

Successfully rebased gh/XilunWu/26/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/99014)

Copy link
Contributor

@wanchaol wanchaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks for the fix! have one minor comment about lint

torch/distributed/_tensor/dispatch.py Outdated Show resolved Hide resolved
## What problem this PR solves?
#97170 fixed `equal` operator return type (old: Tensor, now: bool) by giving it the correct sharding propagation. This is consistent with the `aten::equal` op. However, the correctness only stays at the local result level:
* `equal` op returns True if the local copy of dtensor A equals to the the local copy of dtensor B

This is not the correct semantic of `equal` which should return True if all local copies of A are equal to the corresponding local copies of B.

## What is this PR?

1. For non-participating ranks, if the return type is scalar, `local_results` is set to `None` which means the default value is a reduced result of participating ranks only.
2. For all ranks, if the return type is scalar and the `op_call` is `aten::equal`(because `aten::equal` is the only function that returns scalar value and needs communication), all gather the `local_results` within the `default pg` and reduce on them with `operator.and_`. The result will be the new `local_result`.

## Result/Impact
For non-participating ranks and the return type is scalar:

1. op is `aten::equal`, the return value is same with all other ranks
2. op is not `aten::equal`, the return value is None. Before this PR, this will raise "NotImplementedError" but has not been tested.

For participating ranks and the return type is scalar:

1. op is `aten::equal`, the return value is the equality of two dtensor operands - True if all copies are equal, False otherwise.
2. op is not `aten::equal`, simply the local computation result.




[ghstack-poisoned]
@XilunWu
Copy link
Contributor Author

XilunWu commented Apr 17, 2023

@pytorchbot rebase

@XilunWu XilunWu added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 17, 2023
@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a rebase job. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Tried to rebase and push PR #99014, but it was already up to date

## What problem this PR solves?
#97170 fixed `equal` operator return type (old: Tensor, now: bool) by giving it the correct sharding propagation. This is consistent with the `aten::equal` op. However, the correctness only stays at the local result level:
* `equal` op returns True if the local copy of dtensor A equals to the the local copy of dtensor B

This is not the correct semantic of `equal` which should return True if all local copies of A are equal to the corresponding local copies of B.

## What is this PR?

1. For non-participating ranks, if the return type is scalar, `local_results` is set to `None` which means the default value is a reduced result of participating ranks only.
2. For all ranks, if the return type is scalar and the `op_call` is `aten::equal`(because `aten::equal` is the only function that returns scalar value and needs communication), all gather the `local_results` within the `default pg` and reduce on them with `operator.and_`. The result will be the new `local_result`.

## Result/Impact
For non-participating ranks and the return type is scalar:

1. op is `aten::equal`, the return value is same with all other ranks
2. op is not `aten::equal`, the return value is None. Before this PR, this will raise "NotImplementedError" but has not been tested.

For participating ranks and the return type is scalar:

1. op is `aten::equal`, the return value is the equality of two dtensor operands - True if all copies are equal, False otherwise.
2. op is not `aten::equal`, simply the local computation result.




[ghstack-poisoned]
XilunWu added a commit that referenced this pull request Apr 17, 2023
ghstack-source-id: eced00444301025c602f49362f542fbac2439d79
Pull Request resolved: #99014
@XilunWu
Copy link
Contributor Author

XilunWu commented Apr 18, 2023

@pytorchmergebot merge -f "CI failure not related to PR"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@XilunWu XilunWu deleted the gh/XilunWu/26/head branch April 18, 2023 03:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged merging release notes: distributed (dtensor) release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants