Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Endpoint timeout (error code -80) seen after upgrading to UCX 1.14.0 #8971

Open
abellina opened this issue Mar 27, 2023 · 4 comments
Open

Endpoint timeout (error code -80) seen after upgrading to UCX 1.14.0 #8971

abellina opened this issue Mar 27, 2023 · 4 comments
Assignees
Labels

Comments

@abellina
Copy link

Our CI detected an issue that I didn't see while manually testing with UCX 1.14.0: NVIDIA/spark-rapids#7940

Essentially we are loosing endpoints and the only error we get in our listener is that there was a timeout.

This started to happen after we upgraded to UCX 1.14.0. The version we were using before was 1.12.1.

Any pointers on what may have changed related to different timeout (keepalive?) error handling would be great.

@evgeny-leksikov
Copy link
Contributor

@abellina is this issue still relevant?

@abellina
Copy link
Author

I have been able to repro it with UCX 1.14.0 and JUCX 1.12.1. I sent logs privately so I think it is still relevant.

@supunkamburugamuve
Copy link

We have started seeing this issue as well with an upgrade to 1.14.1

@abellina
Copy link
Author

We have done several tests to try and repro this, especially around the keepalive configuration on the host and for UCX.

At this stage we are getting 0 failures, but the system has had a reboot, and we have a version of UCX that @evgeny-leksikov had prepared. Our next step will be to move to UCX 1.15 as released, we'll update here if anything changes. Unfortunately, none of the investigation we have done has yielded a root cause.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants