Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Register UCX close callback #5474

Merged
merged 4 commits into from
Nov 1, 2021
Merged

Conversation

pentschev
Copy link
Member

@pentschev pentschev commented Oct 28, 2021

With UCX, an endpoint is considered invalid to write immediately after it's closed or an error is raised due to the remote endpoint having closed. However, some messages may have already arrived and enqueued at the UCX endpoint but not having been read by the application. We must ensure that an endpoint is not valid for writing anymore immediately after an error has happened, but reading can still be done while there are enqueued, unread messages. By registering a close callback we can achieve that to prevent unclean closing of Distributed comms.

This change depends on rapidsai/ucx-py#795 for clean closing, but checks ensure using an older UCX-Py build without that feature still works.

Closes rapidsai/dask-cuda#713 .

Register a close callback function with UCX to prevent writing when the
endpoint has already closed. This prevents errors often raised when a
remote process closes too quickly before the local process is able to
send the close message.

Closes rapidsai/dask-cuda#713
Ensure the scheduler has some time to close BatchedSend before closing
local BatchedSend instances in Worker.
@@ -1510,6 +1510,13 @@ async def close(

self.stop_services()

# Give some time for a UCX scheduler to complete closing endpoints
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed comments here

@quasiben
Copy link
Member

quasiben commented Nov 1, 2021

Thanks for continuing to push on EP handling and worker closing issues with UCX. The failures here are unrelated to the PR. Merging in

@quasiben quasiben merged commit 8cc4284 into dask:main Nov 1, 2021
@pentschev pentschev deleted the ucx-close-callback branch November 2, 2021 20:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

UCX Connection reset by remote peer error at cluster shutdown
2 participants