Avoid stale CommContext in explicit comms #1451

TomAugspurger · 2025-02-14T13:30:09Z

This PR updates the CommContext caching to be keyed by some information about the cluster, rather than a single global. This prevents us from using a stale comms object after the cluster changes (add or remove workers) or is recreated entirely.

Closes #1450

copy-pr-bot · 2025-02-14T13:30:15Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

This PR updates the CommContext caching to be keyed by some information about the cluster, rather than a single global. This prevents us from using a stale comms object after the cluster changes (add or remove workers) or is recreated entirely. Closes rapidsai#1450

rjzamora

This is really nice @TomAugspurger - Thanks!

I left a few minor comments. One additional question: Do you think we need to test an explicit-comms shuffle after the comms context has changed? These tests show that the comms context can be refreshed after we scale up or down workers, but I suppose we could also make sure that the refreshed context "works" as expected?

dask_cuda/explicit_comms/comms.py

dask_cuda/tests/test_explicit_comms.py

TomAugspurger · 2025-02-14T15:00:53Z

Do you think we need to test an explicit-comms shuffle after the comms context has changed

Yeah, that would be good. I'll add that.

TomAugspurger · 2025-02-14T19:16:33Z

The CI failure appears to be an unrelated timeout: https://github.com/rapidsai/dask-cuda/actions/runs/13332958081/job/37241713827?pr=1451#step:10:5333

But I'm going to spend a bit of time to try to understand what's going on.

rjzamora · 2025-02-14T19:31:25Z

But I'm going to spend a bit of time to try to understand what's going on.

I certainly appreciate that. It's also fine to try rerunning that check given that a Timeout error like that is probably unrelated to the PR (and spilling tests are known to be flaky in CI :/ ).

TomAugspurger · 2025-02-14T19:59:27Z

I didn't figure out much. This test logs stuff at various points in time:

cluster startup
Unmanaged memory warnings (I think from imports)
worker_assert (x2)
assert_host_chunks
assert_disk_chunks

Success logs from a run yesterday.

Failure logs from this run.

This table shows very roughly how long some stages took:

stage	failure	success
unmanaged memory	baseline	baseline
worker_assert	+6s	+1s
worker_assert2	+8	+0.5s
assert_host_chunks	+0	+0
assert_disk_chunks	+0	+0

So for whatever reason, the worker_asserts took a while on the failed run. That does involve an RPC between the client and the worker, but most of the time should be spent in device_host_file_size_matches. I'd need to look a bit more closely about where stuff could be taking time there (it does interact with the filesystem). Probably not today, so I'll restart that one failure.

rjzamora

LGTM - Thanks @TomAugspurger !

TomAugspurger · 2025-02-19T18:52:41Z

/merge

github-actions bot added the python python code needed label Feb 14, 2025

TomAugspurger force-pushed the tom/fix/comm-context-state branch from ce3957a to 756f7ff Compare February 14, 2025 13:33

TomAugspurger force-pushed the tom/fix/comm-context-state branch from 756f7ff to 5a51877 Compare February 14, 2025 13:33

TomAugspurger added non-breaking Non-breaking change bug Something isn't working labels Feb 14, 2025

TomAugspurger marked this pull request as ready for review February 14, 2025 13:34

TomAugspurger requested a review from a team as a code owner February 14, 2025 13:34

TomAugspurger mentioned this pull request Feb 14, 2025

Explicit Comms Object Not Cleared After Cluster State Change or Restart #1450

Closed

fixup

4130af5

rjzamora reviewed Feb 14, 2025

View reviewed changes

run a shuffle after resizing

2503022

rjzamora mentioned this pull request Feb 19, 2025

Support Distributed in cudf-polars tests and IR evaluation rapidsai/cudf#17364

Open

3 tasks

rjzamora approved these changes Feb 19, 2025

View reviewed changes

rapids-bot bot merged commit 412ef58 into rapidsai:branch-25.04 Feb 19, 2025
33 checks passed

TomAugspurger deleted the tom/fix/comm-context-state branch February 19, 2025 18:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid stale CommContext in explicit comms #1451

Avoid stale CommContext in explicit comms #1451

TomAugspurger commented Feb 14, 2025

copy-pr-bot bot commented Feb 14, 2025

rjzamora left a comment

TomAugspurger commented Feb 14, 2025

TomAugspurger commented Feb 14, 2025

rjzamora commented Feb 14, 2025

TomAugspurger commented Feb 14, 2025

rjzamora left a comment

TomAugspurger commented Feb 19, 2025

Avoid stale CommContext in explicit comms #1451

Avoid stale CommContext in explicit comms #1451

Conversation

TomAugspurger commented Feb 14, 2025

copy-pr-bot bot commented Feb 14, 2025

rjzamora left a comment

Choose a reason for hiding this comment

TomAugspurger commented Feb 14, 2025

TomAugspurger commented Feb 14, 2025

rjzamora commented Feb 14, 2025

TomAugspurger commented Feb 14, 2025

rjzamora left a comment

Choose a reason for hiding this comment

TomAugspurger commented Feb 19, 2025