Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[c10d] make ProcessGroupNCCL work.wait() respect timeout #100162

Closed
wants to merge 5 commits into from

Conversation

cdzhan
Copy link
Contributor

@cdzhan cdzhan commented Apr 27, 2023

Fixes #83486

TestDistBackendWithSpawn.test_monitored_barrier_allreduce_hang and NcclErrorHandlingTest.test_nccl_timeout passed.

@pytorch-bot
Copy link

pytorch-bot bot commented Apr 27, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/100162

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 1 Pending

As of commit d76b32e:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: distributed (c10d) release notes category label Apr 27, 2023
@cdzhan cdzhan marked this pull request as draft April 27, 2023 15:01
@cdzhan cdzhan marked this pull request as draft April 27, 2023 15:01
@cdzhan cdzhan marked this pull request as ready for review April 28, 2023 05:48
@ezyang ezyang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 9, 2023
@ezyang
Copy link
Contributor

ezyang commented May 9, 2023

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 9, 2023
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: This PR is too stale; the last push date was more than 3 days ago. Please rebase and try again. You can rebase and merge by leaving the following comment on this PR:
@pytorchbot merge -r
Or just rebase by leaving @pytorchbot rebase comment

Details for Dev Infra team Raised by workflow job

@cdzhan
Copy link
Contributor Author

cdzhan commented May 10, 2023

@pytorchbot merge -r

@cdzhan
Copy link
Contributor Author

cdzhan commented May 10, 2023

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a rebase job. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased fix_worktimeout onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout fix_worktimeout && git pull --rebase)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a rebase job. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Tried to rebase and push PR #100162, but it was already up to date

@ezyang
Copy link
Contributor

ezyang commented May 10, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled. If you believe this is a mistake,then you can re trigger it through pytorch-bot.

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@cdzhan cdzhan deleted the fix_worktimeout branch May 19, 2023 00:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged open source release notes: distributed (c10d) release notes category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ProcessGroupNCCL work.wait() timeout not respected even for NCCL_BLOCKING_WAIT
4 participants