You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have just found some errors when running benchmarks with UCX and distributed>=2021.8.1:
distributed.batched - INFO - Batched Comm Closed <UCX (closed) Client->Scheduler local=None remote=ucx://127.0.0.1:39119>
Traceback (most recent call last):
File "/datasets/pentschev/miniconda3/envs/ucx-111-112-21.10.210827-sgkit/lib/python3.8/site-packages/distributed/comm/ucx.py", line 224, in write
await self.ep.send(struct.pack("?Q", False, nframes))
File "/datasets/pentschev/miniconda3/envs/ucx-111-112-21.10.210827-sgkit/lib/python3.8/site-packages/ucp/core.py", line 608, in send
self._ep.raise_on_error()
File "ucp/_libs/ucx_endpoint.pyx", line 263, in ucp._libs.ucx_api.UCXEndpoint.raise_on_error
ucp.exceptions.UCXConnectionReset: Endpoint 0x7f5440fb3d00 error: Connection reset by remote peer
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/datasets/pentschev/miniconda3/envs/ucx-111-112-21.10.210827-sgkit/lib/python3.8/site-packages/distributed/batched.py", line 93, in _background_send
nbytes = yield self.comm.write(
File "/datasets/pentschev/miniconda3/envs/ucx-111-112-21.10.210827-sgkit/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
value = future.result()
File "/datasets/pentschev/miniconda3/envs/ucx-111-112-21.10.210827-sgkit/lib/python3.8/site-packages/distributed/comm/ucx.py", line 246, in write
raise CommClosedError("While writing, the connection was closed")
distributed.comm.core.CommClosedError: While writing, the connection was closed
This can be reproduced on a DGX-2 with the following command:
It turns out the error was there before but it was very subtle and easy to ignore, dask/distributed#5209 has made it more verbose which is why we now see it clearly. The issue happens when using UCX 1.11+ and the client is closing andclient.close() is not called before the scheduler closes.
pentschev
added a commit
to pentschev/distributed
that referenced
this issue
Oct 27, 2021
Register a close callback function with UCX to prevent writing when the
endpoint has already closed. This prevents errors often raised when a
remote process closes too quickly before the local process is able to
send the close message.
Closesrapidsai/dask-cuda#713
pentschev
changed the title
UCX Connection reset by remote peer error with distributed>=2021.8.1
UCX Connection reset by remote peer error at cluster shutdown
Oct 28, 2021
I have just found some errors when running benchmarks with UCX and
distributed>=2021.8.1
:This can be reproduced on a DGX-2 with the following command:
Although the benchmark succeeds, the unclean exit doesn't occur with
distributed=2021.8.0
and this needs to be tracked down and fixed.The text was updated successfully, but these errors were encountered: