-
Notifications
You must be signed in to change notification settings - Fork 867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry on Connection Errors #4120
Conversation
object_store/src/client/retry.rs
Outdated
}) | ||
if retries == max_retries | ||
|| now.elapsed() > retry_timeout | ||
|| !e.is_request() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I am mistaken this will retry request and connection timeouts, along with malformed responses from the server, is this desirable?
Perhaps we could do something like
if let Some(e) = e.source().downcast_ref::<hyper::Error>() {
if e.is_connect() || e.is_closed() {
// Do retry
}
}
This would allow limiting the potential blast radius of this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that only server side (5xx) and transport errors should be retried, not all error codes. Esp. client side errors (4xx) should NOT be blindly retried.
So I guess the system could be: if it is an HTTP error code, filter the number range to 5xx. For all other errors use the flags to figure out their nature (is_connect
/is_closed
etc.).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tustvold
I have basically done what you have said here. I added another condition is_incomplete_message()
which was coming up for me when loading a large file, on an unstable connection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@crepererum
These retries only cover cases where there is no response at all, therefore no response codes at all. The retry condition for requests with responses is is_server_error
which I assume means 5xx errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to get some test coverage of this change, perhaps by extending the existing MockServer
based test, as I'm somewhat apprehensive it might lead to retrying errors that don't make sense to retry - e.g. timeouts, user-initiated aborts, etc...
@tustvold I have had difficulty making a useful test case for this. The |
What happens if you |
@tustvold yes, that did work. Added tests where the retry succeeded, and when it failed after max retries. |
Retry when server fails unexpectedly, or if there are network issues that are not handled by hyper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you
Request errors can happen intermittently for a variety of reasons i.e patchy internet, hyper holding on to request for too long, TLS errors, broken pipes.
This increases the types of issues that can be retried.
Which issue does this PR close?
Closes #4119
Rationale for this change
Only certain request errors are retried currently, and I was experiencing very unreliable multipart uploads, because if a single chunk failed the whole upload would fail. Mulitpart uploads caused more issues due to sometimes being a delay between the
put
requests for each chunk and hyper, sometimes using a dead connection, or having network issues.What changes are included in this PR?
Allow retries when reqwest fails due to request errors.
Are there any user-facing changes?
Fewer failures when getting or uploading data to object stores.