Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry on HTTP remote cache fetch failure #14258

Closed
wants to merge 16 commits into from

Conversation

aherrmann
Copy link
Contributor

@aherrmann aherrmann commented Nov 11, 2021

Bazel's previous behavior was to rebuild an artifact locally if fetching
it from an HTTP remote cache failed. This behavior is different from
GRPC remote cache case where Bazel will retry the fetch.

The lack of retry is an issue for multiple reasons: On one hand
rebuilding locally can be slower than fetching from the remote cache, on
the other hand if a build action is not bit reproducible, as is the case
with some compilers, then the local rebuild will trigger cache misses on
further build actions that depend on the current artifact.

This change aims to avoid theses issues by retrying the fetch in the
HTTP cache case similarly to how the GRPC cache client does it.

Some care needs to be taken due to the design of Bazel's internal remote
cache client API. For a fetch the client is given an OutputStream
object that it is expected to write the fetched data to. This may be a
temporary file on disk that will be moved to the final location after
the fetch completed. On retry, we need to be careful to not duplicate
previously written data when writing into this OutputStream. Due to
the generality of the OutputStream interface we cannot reset the file
handle or write pointer to start fresh. Instead, this change follows the
same pattern used in the GRPC cache client. Namely, keep track of the
data previously written and continue from that offset on retry.

With this change the HTTP cache client will attempt to fetch the data
from the remote cache via an HTTP range request. So that the server only
needs to send the data that is still missing. If the server replies with
a 206 Partial Content response, then we write the received data directly
into the output stream, if the server does not support range requests
and instead replies with the full data, then we drop the duplicate
prefix and only write into the output stream from the required offset.

This patch has been running successfully in production here.

cc @cocreature

TODO

  • Testing - this PR does not include any testing, yet. If anyone has any pointers on how to best test this in the Bazel code base please let me know. One option I briefly explored was to add a test-case to src/test/java/com/google/devtools/build/lib/remote/http/HttpCacheClientTest.java. But, that seemed to require implementing support for HTTP range requests in src/tools/remote/src/main/java/com/google/devtools/build/remote/worker/http/HttpCacheServerHandler.java. Perhaps there is a better way to test this change.

@aherrmann aherrmann requested a review from a team as a code owner November 11, 2021 11:40
@google-cla google-cla bot added the cla: yes label Nov 11, 2021
Comment on lines 240 to 253
if (e instanceof ClosedChannelException) {
retry = true;
} else if (e instanceof HttpException) {
retry = true;
} else if (e instanceof IOException) {
String msg = e.getMessage().toLowerCase();
if (msg.contains("connection reset by peer")) {
retry = true;
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This list of exceptions is based on what we observed on CI, see digital-asset/daml#11238 and digital-asset/daml#11445. It may be that further rules should be added.

As far as I understand, ClosedChannelException can also occur in case of build failure when Bazel aborts the build. Is there a way to avoid retry in this case and only retry on legitimate network errors?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of HttpException do you expect to catch and retry?

E.g. if server is overloaded and answer with 503 Service Unavailable, then it could make sense to do the opposite, back off and not retry, at least until what is specified by Retry-After header.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply.

An example that we saw in our logs was

RETRYING: com.google.devtools.build.lib.remote.http.HttpException: 502 Bad Gateway

E.g. if server is overloaded and answer with 503 Service Unavailable, then it could make sense to do the opposite, back off and not retry, at least until what is specified by Retry-After header.

AFAIK, RemoteRetrier currently uses an exponential backoff. So, it does already back-off. It's true that it wouldn't respect Retry-After. I'm not too familiar with the internals of RemoteRetrier. Does it support that kind of use-case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just yesterday I encountered a Bazel HttpException on a local build with this patch enabled:

RETRYING: com.google.devtools.build.lib.remote.http.HttpException: 503 Service Unavailable
Service Unavailable

The retry succeeded and the build continued from there on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the retries should be limited to a specific number of status codes in case of an HttpException?

Does it even make sense to retry for any client error as resending the request will probably not change the response?! (e.g. for 400, 401, 403 or 405)

Otherwise, retrying on 503 Service Unavailable with the back-off mechanism of the Retrier seems reasonable to me, what do you think @ulrfa?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I've added logic to only retry on certain HTTP errors.

@ulrfa
Copy link
Contributor

ulrfa commented Nov 15, 2021

Do you know the reason to the fetch failures / closed connections? Is there high package loss due to package congestion? Or overloaded server? Something else?

@gregestren gregestren added the team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. label Feb 3, 2022
@aherrmann aherrmann force-pushed the upstream-remote-retry branch from 2a7f94d to 62921a2 Compare February 9, 2022 10:39
@aherrmann
Copy link
Contributor Author

aherrmann commented Feb 9, 2022

Do you know the reason to the fetch failures / closed connections? Is there high package loss due to package congestion? Or overloaded server? Something else?

Unfortunately, I don't know what exactly caused these issues. Congestion may have been an issue in some instances. But, even with a low degree of parallelism to restrict the amount of parallel fetches we saw sporadic failures. For a bit more context: the remote cache is on GCS with CDN. Fetches happen from CI nodes in GCP as well as developer machines working from all over.

@sgowroji
Copy link
Member

Hello @aherrmann, Above PR has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days, if no further activity occurs. Thank you!

@avdv
Copy link
Contributor

avdv commented Jul 26, 2022

Hi @sgowroji,

is there anything we should do / change with this PR to get it merged?

@sgowroji
Copy link
Member

Hello @avdv, its been stale for long time. Can you contribute on the same request and send it for code review. Thank you !

@fmeum
Copy link
Collaborator

fmeum commented Jul 26, 2022

@brentleyjones @coeuvre Would one of you be available to review or suggest a reviewer? It would be really unfortunate if this effort stalled.

@sgowroji sgowroji added awaiting-review PR is awaiting review from an assigned reviewer and removed awaiting-user-response Awaiting a response from the author labels Jul 27, 2022
@Wyverald Wyverald added team-Remote-Exec Issues and PRs for the Execution (Remote) team and removed team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. labels Sep 27, 2022
@aherrmann
Copy link
Contributor Author

@coeuvre We spoke at BazelCon, I mentioned this PR to you, and you asked me to ping you on it.
What would you say is missing and what's the best way to get this merged?

Copy link
Member

@coeuvre coeuvre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR and sorry for the delay.

It seems that this PR tries to do two things:

  1. Retry the download/upload if certain errors happened.
  2. Upon retry of download, continue the last download at certain offset.

IMO, it's risky to combine these two within one PR. Can you split 2) into another PR?

Also, can you please add some tests for 1)?

Comment on lines 242 to 256
(e) -> {
boolean retry = false;
if (e instanceof ClosedChannelException) {
retry = true;
} else if (e instanceof HttpException) {
retry = true;
} else if (e instanceof IOException) {
String msg = e.getMessage().toLowerCase();
if (msg.contains("connection reset by peer")) {
retry = true;
} else if (msg.contains("operation timed out")) {
retry = true;
}
}
return retry;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe organize this into RETRIABLE_HTTP_ERRORS and put it alone RETRIABLE_GRPC_ERRORS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can factor it out. Unfortunately, I cannot place it next to the grpc error predicate as that causes a dependency cycle:

  • //.../remote/http:http depends on //.../remote:Retrier (due to the retry logic in HttpCacheClient).
  • //.../remote:Retrier depends on //.../remote/http:http (due to the dependency on HttpException).

@coeuvre coeuvre added awaiting-user-response Awaiting a response from the author and removed awaiting-review PR is awaiting review from an assigned reviewer labels Dec 14, 2022
@aherrmann
Copy link
Contributor Author

It seems that this PR tries to do two things:

1. Retry the download/upload if certain errors happened.

2. Upon retry of download, continue the last download at certain offset.

IMO, it's risky to combine these two within one PR. Can you split 2) into another PR?

Correct, it does these two things. Unfortunately, it doesn't seem to be possible to separate the two without much larger changes to the Bazel code-base as described in the PR description above:

Some care needs to be taken due to the design of Bazel's internal remote
cache client API. For a fetch the client is given an OutputStream
object that it is expected to write the fetched data to. This may be a
temporary file on disk that will be moved to the final location after
the fetch completed. On retry, we need to be careful to not duplicate
previously written data when writing into this OutputStream. Due to
the generality of the OutputStream interface we cannot reset the file
handle or write pointer to start fresh.
Instead, this change follows the
same pattern used in the GRPC cache client. Namely, keep track of the
data previously written and continue from that offset on retry.

Please let me know if you see a good way to separate the two concerns.


Also, can you please add some tests for 1)?

Without being able to separate 1) and 2) (as described above), could you give some pointers toward this question from the original PR message:

If anyone has any pointers on how to best test this in the Bazel code base please let me know. One option I briefly explored was to add a test-case to src/test/java/com/google/devtools/build/lib/remote/http/HttpCacheClientTest.java. But, that seemed to require implementing support for HTTP range requests in src/tools/remote/src/main/java/com/google/devtools/build/remote/worker/http/HttpCacheServerHandler.java. Perhaps there is a better way to test this change.

@coeuvre
Copy link
Member

coeuvre commented Dec 14, 2022

Unfortunately, it doesn't seem to be possible to separate the two without much larger changes to the Bazel code-base as described in the PR description above:

Sorry, I missed that part. Yes, you are right. Then I am ok to merge them at once if we have enough test coverage.

Without being able to separate 1) and 2) (as described above), could you give some pointers toward this question from the original PR message:

I think the best place for the tests are in src/test/java/com/google/devtools/build/lib/remote/http/HttpCacheClientTest.java. It's not necessary to implement range requests for HttpCacheServerHandler. Using mocks is fine as long as we are able to cover the edge cases added in the client code.

@ulrfa
Copy link
Contributor

ulrfa commented Dec 15, 2022

Instead, this change follows the
same pattern used in the GRPC cache client. Namely, keep track of the
data previously written and continue from that offset on retry.

The GRPC cache client seems to do that for CAS downloads, but not when retrieving AC, right? And this PR seems to do it also for AC, right?

I think it is safe to do for CAS where the key represent a specific content, but a retried download of AC key could result in different ActionResult. And concatenating chunks from different ActionResult is not reliable. Do you have ideas about how to address that?

Comment on lines 195 to 198
httpRequest.headers().set(HttpHeaderNames.ACCEPT_ENCODING, HttpHeaderValues.GZIP);
if (offset != 0) {
httpRequest.headers().set(HttpHeaderNames.RANGE, String.format("%s=%d-", HttpHeaderValues.BYTES, offset));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the RANGE implementation in this PR working in combination with the also supplied ACCEPT-ENCODING: gzip header?

It seems this PR expects ranges to be applied to the uncompressed data, right? but RFC 9110 states: “If the representation data has a content coding applied, each byte range is calculated with respect to the encoded sequence of bytes, not the sequence of underlying bytes that would be obtained after decoding.“

For the gPRC cache protocol, there were lots of discussions and considerations in bazelbuild/remote-apis#168 about combining compression with read offsets. E.g. regarding if offsets should be applied to the compressed or non-compressed stream, regarding when offsets don’t align with stored compressed chunk sizes on the server side, etc.

How often do you expect retries of already partially downloaded files? Would it be sufficient to always re-request the whole file and simply skip already received CAS chunks after receiving them again? Would that avoid much of the complexities in both code and test cases?

Bazel-remote supports zstd compression for both the gRPC and HTTP protocols. Bazel is still only supporting it for gRPC. I’m afraid the RANGE implementation part of this PR would make it harder for future work of adding support also for ACCEPT-ENCODING: zstd to the HTTP cache client.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, the offset is calculated on uncompressed bytes.

Bazel is still only supporting it for gRPC. I’m afraid the RANGE implementation part of this PR would make it harder for future work of adding support also for ACCEPT-ENCODING: zstd to the HTTP cache client.

I see, fair enough. Perhaps a solution that takes encoding into account could introduce a new step into the download pipeline before the inflater to determine the correct download offset in light of encoding.

How often do you expect retries of already partially downloaded files? Would it be sufficient to always re-request the whole file and simply skip already received CAS chunks after receiving them again? Would that avoid much of the complexities in both code and test cases?

The failures were sporadic, usually a handful in one CI pipeline. So, this may well be good enough. That would help make the tests simpler as the mock server wouldn't have to worry about range requests. I'll try this route.

@aherrmann
Copy link
Contributor Author

I think it is safe to do for CAS where the key represent a specific content, but a retried download of AC key could result in different ActionResult. And concatenating chunks from different ActionResult is not reliable. Do you have ideas about how to address that?

@ulrfa I think that's possible. The blob download takes an external output stream, so that's where we don't have control and need to continue from the offset. But, the AC download controls the output stream, so, if we perform the retry at a higher level, then we'd be able to do a fresh download.

Bazel's previous behavior was to rebuild an artifact locally if fetching
it from an HTTP remote cache failed. This behavior is different from
GRPC remote cache case where Bazel will retry the fetch.

The lack of retry is an issue for multiple reasons: On one hand
rebuilding locally can be slower than fetching from the remote cache, on
the other hand if a build action is not bit reproducible, as is the case
with some compilers, then the local rebuild will trigger cache misses on
further build actions that depend on the current artifact.

This change aims to avoid theses issues by retrying the fetch in the
HTTP cache case similarly to how the GRPC cache client does it.

Some care needs to be taken due to the design of Bazel's internal remote
cache client API. For a fetch the client is given an `OutputStream`
object that it is expected to write the fetched data to. This may be a
temporary file on disk that will be moved to the final location after
the fetch completed. On retry, we need to be careful to not duplicate
previously written data when writing into this `OutputStream`. Due to
the generality of the `OutputStream` interface we cannot reset the file
handle or write pointer to start fresh. Instead, this change follows the
same pattern used in the GRPC cache client. Namely, keep track of the
data previously written and continue from that offset on retry.

With this change the HTTP cache client will attempt to fetch the data
from the remote cache via an HTTP range request. So that the server only
needs to send the data that is still missing. If the server replies with
a 206 Partial Content response, then we write the received data directly
into the output stream, if the server does not support range requests
and instead replies with the full data, then we drop the duplicate
prefix and only write into the output stream from the required offset.
@aherrmann
Copy link
Contributor Author

@coeuvre @ulrfa @avdv Thank you for the review!

I’ve changed the retry logic such that retries of ActionResult fetches start a new download from scratch and write into a fresh buffer, whereas CAS downloads skip the previously downloaded prefix and write to the existing output stream from the offset.

I’ve removed RANGE requests. As pointed out above the implementation would not have handled compressed streams correctly. RANGE requests could still be introduced later on, but they could be seen as an optimization over this PR to avoid sending redundant bytes over the wire, and as an improvement in that fewer bytes to send should imply a lower likelihood of repeated intermittent failure.

I’ve factored out the logic deciding whether to retry into RETRIABLE_HTTP_ERRORS. And changed it to only retry on certain HTTP error codes instead of all HttpExceptions.

I’ve added tests for the retry mechanism on HttpDownloadHandler and HttpCacheClient.

I’ve also tested the changes against a test-setup and observed retries occurring as expected.

@aherrmann aherrmann requested review from ulrfa and coeuvre and removed request for ulrfa January 13, 2023 13:26
Copy link
Member

@coeuvre coeuvre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update! I agree that we should leave RANGE for another PR.

I will import after @ulrfa has reviewed.

@ulrfa
Copy link
Contributor

ulrfa commented Jan 16, 2023

Thanks for the update @aherrmann! I think the logic is fine and good test cases! Now I only have the question about the offset field, but that is only a minor implementation detail.

aherrmann added a commit to aherrmann/bazel that referenced this pull request Jan 24, 2023
@aherrmann aherrmann force-pushed the upstream-remote-retry branch from 475c5a0 to 2690f23 Compare January 24, 2023 14:37
@aherrmann aherrmann requested a review from ulrfa January 24, 2023 14:38
Copy link
Contributor

@ulrfa ulrfa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aherrmann!

I did not notice the long to int cast in the previous commit, but noticed it now when it moved to other place. 😃

@aherrmann aherrmann requested a review from ulrfa January 25, 2023 12:57
Copy link
Contributor

@ulrfa ulrfa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work @aherrmann!

@coeuvre coeuvre added awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally and removed awaiting-user-response Awaiting a response from the author labels Jan 26, 2023
@aherrmann
Copy link
Contributor Author

@coeuvre I see this is labeled as awaiting-PR-merge but not yet merged. Is it stuck on anything internally? Anything I can help with to get this merged?

@sgowroji
Copy link
Member

Hi @aherrmann, Above PR merging is in progress. @coeuvre is merging the above PR.

@sgowroji sgowroji removed the awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally label Feb 10, 2023
@aherrmann aherrmann deleted the upstream-remote-retry branch February 10, 2023 12:33
hvadehra pushed a commit that referenced this pull request Feb 14, 2023
Bazel's previous behavior was to rebuild an artifact locally if fetching
it from an HTTP remote cache failed. This behavior is different from
GRPC remote cache case where Bazel will retry the fetch.

The lack of retry is an issue for multiple reasons: On one hand
rebuilding locally can be slower than fetching from the remote cache, on
the other hand if a build action is not bit reproducible, as is the case
with some compilers, then the local rebuild will trigger cache misses on
further build actions that depend on the current artifact.

This change aims to avoid theses issues by retrying the fetch in the
HTTP cache case similarly to how the GRPC cache client does it.

Some care needs to be taken due to the design of Bazel's internal remote
cache client API. For a fetch the client is given an `OutputStream`
object that it is expected to write the fetched data to. This may be a
temporary file on disk that will be moved to the final location after
the fetch completed. On retry, we need to be careful to not duplicate
previously written data when writing into this `OutputStream`. Due to
the generality of the `OutputStream` interface we cannot reset the file
handle or write pointer to start fresh. Instead, this change follows the
same pattern used in the GRPC cache client. Namely, keep track of the
data previously written and continue from that offset on retry.

With this change the HTTP cache client will attempt to fetch the data
from the remote cache via an HTTP range request. So that the server only
needs to send the data that is still missing. If the server replies with
a 206 Partial Content response, then we write the received data directly
into the output stream, if the server does not support range requests
and instead replies with the full data, then we drop the duplicate
prefix and only write into the output stream from the required offset.

This patch has been running successfully in production [here](digital-asset/daml#11238).

cc @cocreature

Closes #14258.

PiperOrigin-RevId: 508604846
Change-Id: I10a5d2a658e9c32a9d9fcd6bd29f6a0b95e84566
coeuvre pushed a commit to coeuvre/bazel that referenced this pull request Aug 28, 2023
Bazel's previous behavior was to rebuild an artifact locally if fetching
it from an HTTP remote cache failed. This behavior is different from
GRPC remote cache case where Bazel will retry the fetch.

The lack of retry is an issue for multiple reasons: On one hand
rebuilding locally can be slower than fetching from the remote cache, on
the other hand if a build action is not bit reproducible, as is the case
with some compilers, then the local rebuild will trigger cache misses on
further build actions that depend on the current artifact.

This change aims to avoid theses issues by retrying the fetch in the
HTTP cache case similarly to how the GRPC cache client does it.

Some care needs to be taken due to the design of Bazel's internal remote
cache client API. For a fetch the client is given an `OutputStream`
object that it is expected to write the fetched data to. This may be a
temporary file on disk that will be moved to the final location after
the fetch completed. On retry, we need to be careful to not duplicate
previously written data when writing into this `OutputStream`. Due to
the generality of the `OutputStream` interface we cannot reset the file
handle or write pointer to start fresh. Instead, this change follows the
same pattern used in the GRPC cache client. Namely, keep track of the
data previously written and continue from that offset on retry.

With this change the HTTP cache client will attempt to fetch the data
from the remote cache via an HTTP range request. So that the server only
needs to send the data that is still missing. If the server replies with
a 206 Partial Content response, then we write the received data directly
into the output stream, if the server does not support range requests
and instead replies with the full data, then we drop the duplicate
prefix and only write into the output stream from the required offset.

This patch has been running successfully in production [here](digital-asset/daml#11238).

cc @cocreature

Closes bazelbuild#14258.

PiperOrigin-RevId: 508604846
Change-Id: I10a5d2a658e9c32a9d9fcd6bd29f6a0b95e84566
@iancha1992
Copy link
Member

The changes in this PR have been included in Bazel 6.4.0 RC1. Please test out the release candidate and report any issues as soon as possible. If you're using Bazelisk, you can point to the latest RC by setting USE_BAZEL_VERSION=last_rc.
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes team-Remote-Exec Issues and PRs for the Execution (Remote) team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants