`test_ucx_unreachable` failed on gpuCI #6429

fjetter · 2022-05-24T10:51:02Z

We got a gpuCI failure on #6423. I'm pretty convinced this is unrelated

https://gpuci.gpuopenanalytics.com/job/dask/job/distributed/job/prb/job/distributed-prb/3200/CUDA_VER=11.5,LINUX_VER=ubuntu18.04,PYTHON_VER=3.9,RAPIDS_VER=22.06/testReport/junit/distributed.comm.tests/test_ucx/test_ucx_unreachable/

cc @dask/gpu

fjetter · 2022-05-24T10:57:50Z

Got another failure on #6404
is this related to #6428?

pentschev · 2022-05-24T11:31:10Z

#6428 is just updating versions to the new release, as we're getting close to release time, so it shouldn't be related.

I see that the error is actually right, we expect a "Destination is unreachable" exception and we do get one. What seems different is that before we were expecting a OSError and now we're getting UCX's UCXError instead. This seems to me like the way errors are being handled in Dask. Is it possible some code path used to filter errors and re-raise them as OSError and now just passing through the original exception instead?

fjetter · 2022-05-24T12:06:23Z

Not on our end, if you look at

distributed/distributed/comm/core.py

Lines 288 to 317 in 60f0886

    
               try: 
        
                   comm = await asyncio.wait_for( 
        
                       connector.connect(loc, deserialize=deserialize, **connection_args), 
        
                       timeout=min(intermediate_cap, time_left()), 
        
                   ) 
        
                   break 
        
               except FatalCommClosedError: 
        
                   raise 
        
               # Note: CommClosed inherits from OSError 
        
               except (asyncio.TimeoutError, OSError) as exc: 
        
                   active_exception = exc 
        
                   # As descibed above, the intermediate timeout is used to distributed 
        
                   # initial, bulk connect attempts homogeneously. In particular with 
        
                   # the jitter upon retries we should not be worred about overloading 
        
                   # any more DNS servers 
        
                   intermediate_cap = timeout 
        
                   # FullJitter see https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/ 
        
                   upper_cap = min(time_left(), backoff_base * (2**attempt)) 
        
                   backoff = random.uniform(0, upper_cap) 
        
                   attempt += 1 
        
                   logger.debug( 
        
                       "Could not connect to %s, waiting for %s before retrying", loc, backoff 
        
                   ) 
        
                   await asyncio.sleep(backoff) 
        
           else: 
        
               raise OSError( 
        
                   f"Timed out trying to connect to {addr} after {timeout} s" 
        
               ) from active_exception

we are catching OSErrors, retry as long as time is left on the timeout and once the timeout is hit, an OSError is raised.

Did UCXError inherit from OSError but doesn't any longer? If it doesn't inherit from OSError we're just raising. This logic hasn't changed any behavior since its original implementation ~2 years ago

pentschev · 2022-05-24T12:48:41Z

Thanks @fjetter for the details. Yes, I can confirm it's indeed unrelated. After trying out several different combinations I finally figured that gpuCI was until yesterday picking UCX 1.12.0 from the rapidsai conda channel, and now it moved to picking 1.12.1 from conda-forge instead. However, if I build 1.12.1 from source locally I can't reproduce the issue, so it seems related to the conda-forge package only. I'll keep investigating.

cc @jakirkham @quasiben in case you have ideas of what could be different.

pentschev · 2022-05-24T15:20:59Z

I opened #6434 to treat this. What happened is conda-forge packages don't have IB/RDMACM support, and my local builds did. Depending on the connection manager being used in UCX, the actual underlying error may differ due to different timeout configurations. What I did now was to generalize any UCX exceptions and leave it for Distributed to raise the exception back to the user based on its communication internals.

hendrikmakait mentioned this issue May 24, 2022

Add a lock to distributed.profile for better concurrency control #6421

Merged

2 tasks

pentschev mentioned this issue May 24, 2022

Generalize UCX errors on connect() and correct pytest fixtures #6434

Merged

2 tasks

graingert mentioned this issue May 26, 2022

use asyncio.run to run gen_cluster, gen_test and cluster #6231

Merged

3 tasks

jacobtomlinson closed this as completed in #6434 May 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`test_ucx_unreachable` failed on gpuCI #6429

`test_ucx_unreachable` failed on gpuCI #6429

fjetter commented May 24, 2022

fjetter commented May 24, 2022

pentschev commented May 24, 2022

fjetter commented May 24, 2022 •

edited

Loading

pentschev commented May 24, 2022

pentschev commented May 24, 2022

test_ucx_unreachable failed on gpuCI #6429

test_ucx_unreachable failed on gpuCI #6429

Comments

fjetter commented May 24, 2022

fjetter commented May 24, 2022

pentschev commented May 24, 2022

fjetter commented May 24, 2022 • edited Loading

pentschev commented May 24, 2022

pentschev commented May 24, 2022

`test_ucx_unreachable` failed on gpuCI #6429

`test_ucx_unreachable` failed on gpuCI #6429

fjetter commented May 24, 2022 •

edited

Loading