Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry on Connection Errors #4120

Merged
merged 1 commit into from
Apr 26, 2023
Merged

Retry on Connection Errors #4120

merged 1 commit into from
Apr 26, 2023

Conversation

kindly
Copy link
Contributor

@kindly kindly commented Apr 24, 2023

Request errors can happen intermittently for a variety of reasons i.e patchy internet, hyper holding on to request for too long, TLS errors, broken pipes.

This increases the types of issues that can be retried.

Which issue does this PR close?

Closes #4119

Rationale for this change

Only certain request errors are retried currently, and I was experiencing very unreliable multipart uploads, because if a single chunk failed the whole upload would fail. Mulitpart uploads caused more issues due to sometimes being a delay between the put requests for each chunk and hyper, sometimes using a dead connection, or having network issues.

What changes are included in this PR?

Allow retries when reqwest fails due to request errors.

Are there any user-facing changes?

Fewer failures when getting or uploading data to object stores.

@github-actions github-actions bot added the object-store Object Store Interface label Apr 24, 2023
})
if retries == max_retries
|| now.elapsed() > retry_timeout
|| !e.is_request() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I am mistaken this will retry request and connection timeouts, along with malformed responses from the server, is this desirable?

Perhaps we could do something like

if let Some(e) = e.source().downcast_ref::<hyper::Error>() {
    if e.is_connect() || e.is_closed() {
        // Do retry
    }
}

This would allow limiting the potential blast radius of this change?

Copy link
Contributor

@crepererum crepererum Apr 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that only server side (5xx) and transport errors should be retried, not all error codes. Esp. client side errors (4xx) should NOT be blindly retried.

So I guess the system could be: if it is an HTTP error code, filter the number range to 5xx. For all other errors use the flags to figure out their nature (is_connect/is_closed etc.).

Copy link
Contributor Author

@kindly kindly Apr 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tustvold
I have basically done what you have said here. I added another condition is_incomplete_message() which was coming up for me when loading a large file, on an unstable connection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@crepererum
These retries only cover cases where there is no response at all, therefore no response codes at all. The retry condition for requests with responses is is_server_error which I assume means 5xx errors.

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to get some test coverage of this change, perhaps by extending the existing MockServer based test, as I'm somewhat apprehensive it might lead to retrying errors that don't make sense to retry - e.g. timeouts, user-initiated aborts, etc...

@kindly
Copy link
Contributor Author

kindly commented Apr 25, 2023

@tustvold I have had difficulty making a useful test case for this. The MockServer currently works on the level of a Response object, and this will not have one, only a hyper error. I have not been able to find a way to make the mock server to misbehave on purpose, like prematurely close a connection. I think I would need to get the raw socket, but even with this I am not sure how realistic a test case I could make.

@tustvold
Copy link
Contributor

I have not been able to find a way to make the mock server to misbehave on purpose, like prematurely close a connection

What happens if you MockServer::push_fn a function that panics, I think this should drop the connection without returning a response?

@kindly
Copy link
Contributor Author

kindly commented Apr 26, 2023

@tustvold yes, that did work.
It resulted in an incomplete message error on the client that is now caught and retried.

Added tests where the retry succeeded, and when it failed after max retries.

Retry when server fails unexpectedly, or if there are network issues that are not handled by
hyper.
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you

@tustvold tustvold changed the title Retry on all request errors. Retry on Connection Errors Apr 26, 2023
@tustvold tustvold merged commit b8d8cb7 into apache:master Apr 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
object-store Object Store Interface
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[object_store] Retry requests on connection error
3 participants