-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid stale CommContext in explicit comms #1451
Avoid stale CommContext in explicit comms #1451
Conversation
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
ce3957a
to
756f7ff
Compare
This PR updates the CommContext caching to be keyed by some information about the cluster, rather than a single global. This prevents us from using a stale comms object after the cluster changes (add or remove workers) or is recreated entirely. Closes rapidsai#1450
756f7ff
to
5a51877
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really nice @TomAugspurger - Thanks!
I left a few minor comments. One additional question: Do you think we need to test an explicit-comms shuffle after the comms context has changed? These tests show that the comms context can be refreshed after we scale up or down workers, but I suppose we could also make sure that the refreshed context "works" as expected?
Yeah, that would be good. I'll add that. |
The CI failure appears to be an unrelated timeout: https://github.com/rapidsai/dask-cuda/actions/runs/13332958081/job/37241713827?pr=1451#step:10:5333 But I'm going to spend a bit of time to try to understand what's going on. |
I certainly appreciate that. It's also fine to try rerunning that check given that a |
I didn't figure out much. This test logs stuff at various points in time:
Success logs from a run yesterday. Failure logs from this run. This table shows very roughly how long some stages took:
So for whatever reason, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - Thanks @TomAugspurger !
/merge |
This PR updates the CommContext caching to be keyed by some information about the cluster, rather than a single global. This prevents us from using a stale comms object after the cluster changes (add or remove workers) or is recreated entirely.
Closes #1450