Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid stale CommContext in explicit comms #1451

Merged

Conversation

TomAugspurger
Copy link
Contributor

This PR updates the CommContext caching to be keyed by some information about the cluster, rather than a single global. This prevents us from using a stale comms object after the cluster changes (add or remove workers) or is recreated entirely.

Closes #1450

Copy link

copy-pr-bot bot commented Feb 14, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added the python python code needed label Feb 14, 2025
@TomAugspurger TomAugspurger force-pushed the tom/fix/comm-context-state branch from ce3957a to 756f7ff Compare February 14, 2025 13:33
This PR updates the CommContext caching to be keyed by some information
about the cluster, rather than a single global. This prevents us from
using a stale comms object after the cluster changes (add or remove
workers) or is recreated entirely.

Closes rapidsai#1450
@TomAugspurger TomAugspurger force-pushed the tom/fix/comm-context-state branch from 756f7ff to 5a51877 Compare February 14, 2025 13:33
@TomAugspurger TomAugspurger added non-breaking Non-breaking change bug Something isn't working labels Feb 14, 2025
@TomAugspurger TomAugspurger marked this pull request as ready for review February 14, 2025 13:34
@TomAugspurger TomAugspurger requested a review from a team as a code owner February 14, 2025 13:34
Copy link
Member

@rjzamora rjzamora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really nice @TomAugspurger - Thanks!

I left a few minor comments. One additional question: Do you think we need to test an explicit-comms shuffle after the comms context has changed? These tests show that the comms context can be refreshed after we scale up or down workers, but I suppose we could also make sure that the refreshed context "works" as expected?

@TomAugspurger
Copy link
Contributor Author

Do you think we need to test an explicit-comms shuffle after the comms context has changed

Yeah, that would be good. I'll add that.

@TomAugspurger
Copy link
Contributor Author

The CI failure appears to be an unrelated timeout: https://github.com/rapidsai/dask-cuda/actions/runs/13332958081/job/37241713827?pr=1451#step:10:5333

But I'm going to spend a bit of time to try to understand what's going on.

@rjzamora
Copy link
Member

But I'm going to spend a bit of time to try to understand what's going on.

I certainly appreciate that. It's also fine to try rerunning that check given that a Timeout error like that is probably unrelated to the PR (and spilling tests are known to be flaky in CI :/ ).

@TomAugspurger
Copy link
Contributor Author

I didn't figure out much. This test logs stuff at various points in time:

  1. cluster startup
  2. Unmanaged memory warnings (I think from imports)
  3. worker_assert (x2)
  4. assert_host_chunks
  5. assert_disk_chunks

Success logs from a run yesterday.

Failure logs from this run.

This table shows very roughly how long some stages took:

stage failure success
unmanaged memory baseline baseline
worker_assert +6s +1s
worker_assert2 +8 +0.5s
assert_host_chunks +0 +0
assert_disk_chunks +0 +0

So for whatever reason, the worker_asserts took a while on the failed run. That does involve an RPC between the client and the worker, but most of the time should be spent in device_host_file_size_matches. I'd need to look a bit more closely about where stuff could be taking time there (it does interact with the filesystem). Probably not today, so I'll restart that one failure.

Copy link
Member

@rjzamora rjzamora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - Thanks @TomAugspurger !

@TomAugspurger
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 412ef58 into rapidsai:branch-25.04 Feb 19, 2025
33 checks passed
@TomAugspurger TomAugspurger deleted the tom/fix/comm-context-state branch February 19, 2025 18:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working non-breaking Non-breaking change python python code needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Explicit Comms Object Not Cleared After Cluster State Change or Restart
2 participants