-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_ucx_unreachable
failed on gpuCI
#6429
Comments
#6428 is just updating versions to the new release, as we're getting close to release time, so it shouldn't be related. I see that the error is actually right, we expect a |
Not on our end, if you look at distributed/distributed/comm/core.py Lines 288 to 317 in 60f0886
we are catching OSErrors, retry as long as time is left on the timeout and once the timeout is hit, an OSError is raised. Did |
Thanks @fjetter for the details. Yes, I can confirm it's indeed unrelated. After trying out several different combinations I finally figured that gpuCI was until yesterday picking UCX 1.12.0 from the cc @jakirkham @quasiben in case you have ideas of what could be different. |
I opened #6434 to treat this. What happened is conda-forge packages don't have IB/RDMACM support, and my local builds did. Depending on the connection manager being used in UCX, the actual underlying error may differ due to different timeout configurations. What I did now was to generalize any UCX exceptions and leave it for Distributed to raise the exception back to the user based on its communication internals. |
We got a gpuCI failure on #6423. I'm pretty convinced this is unrelated
https://gpuci.gpuopenanalytics.com/job/dask/job/distributed/job/prb/job/distributed-prb/3200/CUDA_VER=11.5,LINUX_VER=ubuntu18.04,PYTHON_VER=3.9,RAPIDS_VER=22.06/testReport/junit/distributed.comm.tests/test_ucx/test_ucx_unreachable/
cc @dask/gpu
The text was updated successfully, but these errors were encountered: