Retry on Connection Errors #4120

kindly · 2023-04-24T20:37:51Z

Request errors can happen intermittently for a variety of reasons i.e patchy internet, hyper holding on to request for too long, TLS errors, broken pipes.

This increases the types of issues that can be retried.

Which issue does this PR close?

Closes #4119

Rationale for this change

Only certain request errors are retried currently, and I was experiencing very unreliable multipart uploads, because if a single chunk failed the whole upload would fail. Mulitpart uploads caused more issues due to sometimes being a delay between the put requests for each chunk and hyper, sometimes using a dead connection, or having network issues.

What changes are included in this PR?

Allow retries when reqwest fails due to request errors.

Are there any user-facing changes?

Fewer failures when getting or uploading data to object stores.

tustvold · 2023-04-25T11:22:55Z

object_store/src/client/retry.rs

-                        })
+                        if retries == max_retries
+                            || now.elapsed() > retry_timeout
+                            || !e.is_request() {


Unless I am mistaken this will retry request and connection timeouts, along with malformed responses from the server, is this desirable?

Perhaps we could do something like

if let Some(e) = e.source().downcast_ref::<hyper::Error>() { if e.is_connect() || e.is_closed() { // Do retry } }

This would allow limiting the potential blast radius of this change?

I agree that only server side (5xx) and transport errors should be retried, not all error codes. Esp. client side errors (4xx) should NOT be blindly retried.

So I guess the system could be: if it is an HTTP error code, filter the number range to 5xx. For all other errors use the flags to figure out their nature (is_connect/is_closed etc.).

@tustvold
I have basically done what you have said here. I added another condition is_incomplete_message() which was coming up for me when loading a large file, on an unstable connection.

@crepererum
These retries only cover cases where there is no response at all, therefore no response codes at all. The retry condition for requests with responses is is_server_error which I assume means 5xx errors.

tustvold

I think it would be good to get some test coverage of this change, perhaps by extending the existing MockServer based test, as I'm somewhat apprehensive it might lead to retrying errors that don't make sense to retry - e.g. timeouts, user-initiated aborts, etc...

kindly · 2023-04-25T19:24:14Z

@tustvold I have had difficulty making a useful test case for this. The MockServer currently works on the level of a Response object, and this will not have one, only a hyper error. I have not been able to find a way to make the mock server to misbehave on purpose, like prematurely close a connection. I think I would need to get the raw socket, but even with this I am not sure how realistic a test case I could make.

tustvold · 2023-04-25T21:46:39Z

I have not been able to find a way to make the mock server to misbehave on purpose, like prematurely close a connection

What happens if you MockServer::push_fn a function that panics, I think this should drop the connection without returning a response?

kindly · 2023-04-26T07:28:28Z

@tustvold yes, that did work.
It resulted in an incomplete message error on the client that is now caught and retried.

Added tests where the retry succeeded, and when it failed after max retries.

Retry when server fails unexpectedly, or if there are network issues that are not handled by hyper.

tustvold

Thank you

github-actions bot added the object-store Object Store Interface label Apr 24, 2023

kindly mentioned this pull request Apr 24, 2023

[object_store] Retry requests on connection error #4119

Closed

tustvold reviewed Apr 25, 2023

View reviewed changes

kindly force-pushed the master branch from a34266e to 747b90a Compare April 26, 2023 07:21

Retry when no or partial response from server.

3822a3c

Retry when server fails unexpectedly, or if there are network issues that are not handled by hyper.

kindly force-pushed the master branch from 747b90a to 3822a3c Compare April 26, 2023 07:40

tustvold approved these changes Apr 26, 2023

View reviewed changes

tustvold changed the title ~~Retry on all request errors.~~ Retry on Connection Errors Apr 26, 2023

tustvold merged commit b8d8cb7 into apache:master Apr 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry on Connection Errors #4120

Retry on Connection Errors #4120

kindly commented Apr 24, 2023 •

edited

Loading

tustvold Apr 25, 2023

crepererum Apr 25, 2023 •

edited

Loading

kindly Apr 25, 2023 •

edited

Loading

kindly Apr 25, 2023

tustvold left a comment •

edited

Loading

kindly commented Apr 25, 2023 •

edited

Loading

tustvold commented Apr 25, 2023

kindly commented Apr 26, 2023

tustvold left a comment

Retry on Connection Errors #4120

Retry on Connection Errors #4120

Conversation

kindly commented Apr 24, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

tustvold Apr 25, 2023

Choose a reason for hiding this comment

crepererum Apr 25, 2023 • edited Loading

Choose a reason for hiding this comment

kindly Apr 25, 2023 • edited Loading

Choose a reason for hiding this comment

kindly Apr 25, 2023

Choose a reason for hiding this comment

tustvold left a comment • edited Loading

Choose a reason for hiding this comment

kindly commented Apr 25, 2023 • edited Loading

tustvold commented Apr 25, 2023

kindly commented Apr 26, 2023

tustvold left a comment

Choose a reason for hiding this comment

kindly commented Apr 24, 2023 •

edited

Loading

crepererum Apr 25, 2023 •

edited

Loading

kindly Apr 25, 2023 •

edited

Loading

tustvold left a comment •

edited

Loading

kindly commented Apr 25, 2023 •

edited

Loading