Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_ucx_unreachable failed on gpuCI #6429

Closed
fjetter opened this issue May 24, 2022 · 5 comments · Fixed by #6434
Closed

test_ucx_unreachable failed on gpuCI #6429

fjetter opened this issue May 24, 2022 · 5 comments · Fixed by #6434

Comments

@fjetter
Copy link
Member

fjetter commented May 24, 2022

We got a gpuCI failure on #6423. I'm pretty convinced this is unrelated

https://gpuci.gpuopenanalytics.com/job/dask/job/distributed/job/prb/job/distributed-prb/3200/CUDA_VER=11.5,LINUX_VER=ubuntu18.04,PYTHON_VER=3.9,RAPIDS_VER=22.06/testReport/junit/distributed.comm.tests/test_ucx/test_ucx_unreachable/

cc @dask/gpu

@fjetter
Copy link
Member Author

fjetter commented May 24, 2022

Got another failure on #6404
is this related to #6428?

@pentschev
Copy link
Member

#6428 is just updating versions to the new release, as we're getting close to release time, so it shouldn't be related.

I see that the error is actually right, we expect a "Destination is unreachable" exception and we do get one. What seems different is that before we were expecting a OSError and now we're getting UCX's UCXError instead. This seems to me like the way errors are being handled in Dask. Is it possible some code path used to filter errors and re-raise them as OSError and now just passing through the original exception instead?

@fjetter
Copy link
Member Author

fjetter commented May 24, 2022

Not on our end, if you look at

try:
comm = await asyncio.wait_for(
connector.connect(loc, deserialize=deserialize, **connection_args),
timeout=min(intermediate_cap, time_left()),
)
break
except FatalCommClosedError:
raise
# Note: CommClosed inherits from OSError
except (asyncio.TimeoutError, OSError) as exc:
active_exception = exc
# As descibed above, the intermediate timeout is used to distributed
# initial, bulk connect attempts homogeneously. In particular with
# the jitter upon retries we should not be worred about overloading
# any more DNS servers
intermediate_cap = timeout
# FullJitter see https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/
upper_cap = min(time_left(), backoff_base * (2**attempt))
backoff = random.uniform(0, upper_cap)
attempt += 1
logger.debug(
"Could not connect to %s, waiting for %s before retrying", loc, backoff
)
await asyncio.sleep(backoff)
else:
raise OSError(
f"Timed out trying to connect to {addr} after {timeout} s"
) from active_exception

we are catching OSErrors, retry as long as time is left on the timeout and once the timeout is hit, an OSError is raised.

Did UCXError inherit from OSError but doesn't any longer? If it doesn't inherit from OSError we're just raising. This logic hasn't changed any behavior since its original implementation ~2 years ago

@pentschev
Copy link
Member

Thanks @fjetter for the details. Yes, I can confirm it's indeed unrelated. After trying out several different combinations I finally figured that gpuCI was until yesterday picking UCX 1.12.0 from the rapidsai conda channel, and now it moved to picking 1.12.1 from conda-forge instead. However, if I build 1.12.1 from source locally I can't reproduce the issue, so it seems related to the conda-forge package only. I'll keep investigating.

cc @jakirkham @quasiben in case you have ideas of what could be different.

@pentschev
Copy link
Member

I opened #6434 to treat this. What happened is conda-forge packages don't have IB/RDMACM support, and my local builds did. Depending on the connection manager being used in UCX, the actual underlying error may differ due to different timeout configurations. What I did now was to generalize any UCX exceptions and leave it for Distributed to raise the exception back to the user based on its communication internals.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants