Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCX Connection reset by remote peer error at cluster shutdown #713

Closed
pentschev opened this issue Aug 27, 2021 · 2 comments · Fixed by dask/distributed#5474
Closed

UCX Connection reset by remote peer error at cluster shutdown #713

pentschev opened this issue Aug 27, 2021 · 2 comments · Fixed by dask/distributed#5474

Comments

@pentschev
Copy link
Member

I have just found some errors when running benchmarks with UCX and distributed>=2021.8.1:

distributed.batched - INFO - Batched Comm Closed <UCX (closed) Client->Scheduler local=None remote=ucx://127.0.0.1:39119>
Traceback (most recent call last):
  File "/datasets/pentschev/miniconda3/envs/ucx-111-112-21.10.210827-sgkit/lib/python3.8/site-packages/distributed/comm/ucx.py", line 224, in write
    await self.ep.send(struct.pack("?Q", False, nframes))
  File "/datasets/pentschev/miniconda3/envs/ucx-111-112-21.10.210827-sgkit/lib/python3.8/site-packages/ucp/core.py", line 608, in send
    self._ep.raise_on_error()
  File "ucp/_libs/ucx_endpoint.pyx", line 263, in ucp._libs.ucx_api.UCXEndpoint.raise_on_error
ucp.exceptions.UCXConnectionReset: Endpoint 0x7f5440fb3d00 error: Connection reset by remote peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/datasets/pentschev/miniconda3/envs/ucx-111-112-21.10.210827-sgkit/lib/python3.8/site-packages/distributed/batched.py", line 93, in _background_send
    nbytes = yield self.comm.write(
  File "/datasets/pentschev/miniconda3/envs/ucx-111-112-21.10.210827-sgkit/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/datasets/pentschev/miniconda3/envs/ucx-111-112-21.10.210827-sgkit/lib/python3.8/site-packages/distributed/comm/ucx.py", line 246, in write
    raise CommClosedError("While writing, the connection was closed")
distributed.comm.core.CommClosedError: While writing, the connection was closed

This can be reproduced on a DGX-2 with the following command:

UCX_MAX_RNDV_RAILS=1 python local_cudf_shuffle.py -t gpu -p ucx --enable-tcp-over-ucx --enable-nvlink --disable-infiniband --disable-rdmacm --runs 3 --in-parts 16 --partition-size 1GB --rmm-pool-size 22G --no-silence-logs -d 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15

Although the benchmark succeeds, the unclean exit doesn't occur with distributed=2021.8.0 and this needs to be tracked down and fixed.

@pentschev
Copy link
Member Author

It turns out the error was there before but it was very subtle and easy to ignore, dask/distributed#5209 has made it more verbose which is why we now see it clearly. The issue happens when using UCX 1.11+ and the client is closing and client.close() is not called before the scheduler closes.

pentschev added a commit to pentschev/distributed that referenced this issue Oct 27, 2021
Register a close callback function with UCX to prevent writing when the
endpoint has already closed. This prevents errors often raised when a
remote process closes too quickly before the local process is able to
send the close message.

Closes rapidsai/dask-cuda#713
@pentschev
Copy link
Member Author

This should be fixed by dask/distributed#5474, also requires rapidsai/ucx-py#795 .

@pentschev pentschev changed the title UCX Connection reset by remote peer error with distributed>=2021.8.1 UCX Connection reset by remote peer error at cluster shutdown Oct 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant