-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry on HTTP remote cache fetch failure #14258
Conversation
if (e instanceof ClosedChannelException) { | ||
retry = true; | ||
} else if (e instanceof HttpException) { | ||
retry = true; | ||
} else if (e instanceof IOException) { | ||
String msg = e.getMessage().toLowerCase(); | ||
if (msg.contains("connection reset by peer")) { | ||
retry = true; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This list of exceptions is based on what we observed on CI, see digital-asset/daml#11238 and digital-asset/daml#11445. It may be that further rules should be added.
As far as I understand, ClosedChannelException
can also occur in case of build failure when Bazel aborts the build. Is there a way to avoid retry in this case and only retry on legitimate network errors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What kind of HttpException do you expect to catch and retry?
E.g. if server is overloaded and answer with 503 Service Unavailable, then it could make sense to do the opposite, back off and not retry, at least until what is specified by Retry-After header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late reply.
An example that we saw in our logs was
RETRYING: com.google.devtools.build.lib.remote.http.HttpException: 502 Bad Gateway
E.g. if server is overloaded and answer with 503 Service Unavailable, then it could make sense to do the opposite, back off and not retry, at least until what is specified by Retry-After header.
AFAIK, RemoteRetrier
currently uses an exponential backoff. So, it does already back-off. It's true that it wouldn't respect Retry-After. I'm not too familiar with the internals of RemoteRetrier
. Does it support that kind of use-case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just yesterday I encountered a Bazel HttpException
on a local build with this patch enabled:
RETRYING: com.google.devtools.build.lib.remote.http.HttpException: 503 Service Unavailable
Service Unavailable
The retry succeeded and the build continued from there on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the retries should be limited to a specific number of status codes in case of an HttpException
?
Does it even make sense to retry for any client error as resending the request will probably not change the response?! (e.g. for 400, 401, 403 or 405)
Otherwise, retrying on 503 Service Unavailable with the back-off mechanism of the Retrier
seems reasonable to me, what do you think @ulrfa?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I've added logic to only retry on certain HTTP errors.
Do you know the reason to the fetch failures / closed connections? Is there high package loss due to package congestion? Or overloaded server? Something else? |
2a7f94d
to
62921a2
Compare
Unfortunately, I don't know what exactly caused these issues. Congestion may have been an issue in some instances. But, even with a low degree of parallelism to restrict the amount of parallel fetches we saw sporadic failures. For a bit more context: the remote cache is on GCS with CDN. Fetches happen from CI nodes in GCP as well as developer machines working from all over. |
Hello @aherrmann, Above PR has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days, if no further activity occurs. Thank you! |
Hi @sgowroji, is there anything we should do / change with this PR to get it merged? |
Hello @avdv, its been stale for long time. Can you contribute on the same request and send it for code review. Thank you ! |
@brentleyjones @coeuvre Would one of you be available to review or suggest a reviewer? It would be really unfortunate if this effort stalled. |
@coeuvre We spoke at BazelCon, I mentioned this PR to you, and you asked me to ping you on it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR and sorry for the delay.
It seems that this PR tries to do two things:
- Retry the download/upload if certain errors happened.
- Upon retry of download, continue the last download at certain offset.
IMO, it's risky to combine these two within one PR. Can you split 2) into another PR?
Also, can you please add some tests for 1)?
(e) -> { | ||
boolean retry = false; | ||
if (e instanceof ClosedChannelException) { | ||
retry = true; | ||
} else if (e instanceof HttpException) { | ||
retry = true; | ||
} else if (e instanceof IOException) { | ||
String msg = e.getMessage().toLowerCase(); | ||
if (msg.contains("connection reset by peer")) { | ||
retry = true; | ||
} else if (msg.contains("operation timed out")) { | ||
retry = true; | ||
} | ||
} | ||
return retry; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe organize this into RETRIABLE_HTTP_ERRORS
and put it alone RETRIABLE_GRPC_ERRORS
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can factor it out. Unfortunately, I cannot place it next to the grpc error predicate as that causes a dependency cycle:
//.../remote/http:http
depends on//.../remote:Retrier
(due to the retry logic inHttpCacheClient
).//.../remote:Retrier
depends on//.../remote/http:http
(due to the dependency onHttpException
).
Correct, it does these two things. Unfortunately, it doesn't seem to be possible to separate the two without much larger changes to the Bazel code-base as described in the PR description above:
Please let me know if you see a good way to separate the two concerns.
Without being able to separate 1) and 2) (as described above), could you give some pointers toward this question from the original PR message:
|
Sorry, I missed that part. Yes, you are right. Then I am ok to merge them at once if we have enough test coverage.
I think the best place for the tests are in |
The GRPC cache client seems to do that for CAS downloads, but not when retrieving AC, right? And this PR seems to do it also for AC, right? I think it is safe to do for CAS where the key represent a specific content, but a retried download of AC key could result in different ActionResult. And concatenating chunks from different ActionResult is not reliable. Do you have ideas about how to address that? |
httpRequest.headers().set(HttpHeaderNames.ACCEPT_ENCODING, HttpHeaderValues.GZIP); | ||
if (offset != 0) { | ||
httpRequest.headers().set(HttpHeaderNames.RANGE, String.format("%s=%d-", HttpHeaderValues.BYTES, offset)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is the RANGE implementation in this PR working in combination with the also supplied ACCEPT-ENCODING: gzip
header?
It seems this PR expects ranges to be applied to the uncompressed data, right? but RFC 9110 states: “If the representation data has a content coding applied, each byte range is calculated with respect to the encoded sequence of bytes, not the sequence of underlying bytes that would be obtained after decoding.“
For the gPRC cache protocol, there were lots of discussions and considerations in bazelbuild/remote-apis#168 about combining compression with read offsets. E.g. regarding if offsets should be applied to the compressed or non-compressed stream, regarding when offsets don’t align with stored compressed chunk sizes on the server side, etc.
How often do you expect retries of already partially downloaded files? Would it be sufficient to always re-request the whole file and simply skip already received CAS chunks after receiving them again? Would that avoid much of the complexities in both code and test cases?
Bazel-remote supports zstd compression for both the gRPC and HTTP protocols. Bazel is still only supporting it for gRPC. I’m afraid the RANGE implementation part of this PR would make it harder for future work of adding support also for ACCEPT-ENCODING: zstd
to the HTTP cache client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, the offset is calculated on uncompressed bytes.
Bazel is still only supporting it for gRPC. I’m afraid the RANGE implementation part of this PR would make it harder for future work of adding support also for ACCEPT-ENCODING: zstd to the HTTP cache client.
I see, fair enough. Perhaps a solution that takes encoding into account could introduce a new step into the download pipeline before the inflater to determine the correct download offset in light of encoding.
How often do you expect retries of already partially downloaded files? Would it be sufficient to always re-request the whole file and simply skip already received CAS chunks after receiving them again? Would that avoid much of the complexities in both code and test cases?
The failures were sporadic, usually a handful in one CI pipeline. So, this may well be good enough. That would help make the tests simpler as the mock server wouldn't have to worry about range requests. I'll try this route.
@ulrfa I think that's possible. The blob download takes an external output stream, so that's where we don't have control and need to continue from the offset. But, the AC download controls the output stream, so, if we perform the retry at a higher level, then we'd be able to do a fresh download. |
Bazel's previous behavior was to rebuild an artifact locally if fetching it from an HTTP remote cache failed. This behavior is different from GRPC remote cache case where Bazel will retry the fetch. The lack of retry is an issue for multiple reasons: On one hand rebuilding locally can be slower than fetching from the remote cache, on the other hand if a build action is not bit reproducible, as is the case with some compilers, then the local rebuild will trigger cache misses on further build actions that depend on the current artifact. This change aims to avoid theses issues by retrying the fetch in the HTTP cache case similarly to how the GRPC cache client does it. Some care needs to be taken due to the design of Bazel's internal remote cache client API. For a fetch the client is given an `OutputStream` object that it is expected to write the fetched data to. This may be a temporary file on disk that will be moved to the final location after the fetch completed. On retry, we need to be careful to not duplicate previously written data when writing into this `OutputStream`. Due to the generality of the `OutputStream` interface we cannot reset the file handle or write pointer to start fresh. Instead, this change follows the same pattern used in the GRPC cache client. Namely, keep track of the data previously written and continue from that offset on retry. With this change the HTTP cache client will attempt to fetch the data from the remote cache via an HTTP range request. So that the server only needs to send the data that is still missing. If the server replies with a 206 Partial Content response, then we write the received data directly into the output stream, if the server does not support range requests and instead replies with the full data, then we drop the duplicate prefix and only write into the output stream from the required offset.
@coeuvre @ulrfa @avdv Thank you for the review! I’ve changed the retry logic such that retries of I’ve removed RANGE requests. As pointed out above the implementation would not have handled compressed streams correctly. RANGE requests could still be introduced later on, but they could be seen as an optimization over this PR to avoid sending redundant bytes over the wire, and as an improvement in that fewer bytes to send should imply a lower likelihood of repeated intermittent failure. I’ve factored out the logic deciding whether to retry into I’ve added tests for the retry mechanism on I’ve also tested the changes against a test-setup and observed retries occurring as expected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update! I agree that we should leave RANGE for another PR.
I will import after @ulrfa has reviewed.
src/main/java/com/google/devtools/build/lib/remote/http/HttpDownloadHandler.java
Outdated
Show resolved
Hide resolved
Thanks for the update @aherrmann! I think the logic is fine and good test cases! Now I only have the question about the |
Addressing review comment bazelbuild#14258 (comment)
Addressing review comment bazelbuild#14258 (comment)
475c5a0
to
2690f23
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @aherrmann!
I did not notice the long to int cast in the previous commit, but noticed it now when it moved to other place. 😃
src/main/java/com/google/devtools/build/lib/remote/http/HttpDownloadHandler.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work @aherrmann!
@coeuvre I see this is labeled as |
Hi @aherrmann, Above PR merging is in progress. @coeuvre is merging the above PR. |
Bazel's previous behavior was to rebuild an artifact locally if fetching it from an HTTP remote cache failed. This behavior is different from GRPC remote cache case where Bazel will retry the fetch. The lack of retry is an issue for multiple reasons: On one hand rebuilding locally can be slower than fetching from the remote cache, on the other hand if a build action is not bit reproducible, as is the case with some compilers, then the local rebuild will trigger cache misses on further build actions that depend on the current artifact. This change aims to avoid theses issues by retrying the fetch in the HTTP cache case similarly to how the GRPC cache client does it. Some care needs to be taken due to the design of Bazel's internal remote cache client API. For a fetch the client is given an `OutputStream` object that it is expected to write the fetched data to. This may be a temporary file on disk that will be moved to the final location after the fetch completed. On retry, we need to be careful to not duplicate previously written data when writing into this `OutputStream`. Due to the generality of the `OutputStream` interface we cannot reset the file handle or write pointer to start fresh. Instead, this change follows the same pattern used in the GRPC cache client. Namely, keep track of the data previously written and continue from that offset on retry. With this change the HTTP cache client will attempt to fetch the data from the remote cache via an HTTP range request. So that the server only needs to send the data that is still missing. If the server replies with a 206 Partial Content response, then we write the received data directly into the output stream, if the server does not support range requests and instead replies with the full data, then we drop the duplicate prefix and only write into the output stream from the required offset. This patch has been running successfully in production [here](digital-asset/daml#11238). cc @cocreature Closes #14258. PiperOrigin-RevId: 508604846 Change-Id: I10a5d2a658e9c32a9d9fcd6bd29f6a0b95e84566
Bazel's previous behavior was to rebuild an artifact locally if fetching it from an HTTP remote cache failed. This behavior is different from GRPC remote cache case where Bazel will retry the fetch. The lack of retry is an issue for multiple reasons: On one hand rebuilding locally can be slower than fetching from the remote cache, on the other hand if a build action is not bit reproducible, as is the case with some compilers, then the local rebuild will trigger cache misses on further build actions that depend on the current artifact. This change aims to avoid theses issues by retrying the fetch in the HTTP cache case similarly to how the GRPC cache client does it. Some care needs to be taken due to the design of Bazel's internal remote cache client API. For a fetch the client is given an `OutputStream` object that it is expected to write the fetched data to. This may be a temporary file on disk that will be moved to the final location after the fetch completed. On retry, we need to be careful to not duplicate previously written data when writing into this `OutputStream`. Due to the generality of the `OutputStream` interface we cannot reset the file handle or write pointer to start fresh. Instead, this change follows the same pattern used in the GRPC cache client. Namely, keep track of the data previously written and continue from that offset on retry. With this change the HTTP cache client will attempt to fetch the data from the remote cache via an HTTP range request. So that the server only needs to send the data that is still missing. If the server replies with a 206 Partial Content response, then we write the received data directly into the output stream, if the server does not support range requests and instead replies with the full data, then we drop the duplicate prefix and only write into the output stream from the required offset. This patch has been running successfully in production [here](digital-asset/daml#11238). cc @cocreature Closes bazelbuild#14258. PiperOrigin-RevId: 508604846 Change-Id: I10a5d2a658e9c32a9d9fcd6bd29f6a0b95e84566
The changes in this PR have been included in Bazel 6.4.0 RC1. Please test out the release candidate and report any issues as soon as possible. If you're using Bazelisk, you can point to the latest RC by setting USE_BAZEL_VERSION=last_rc. |
Bazel's previous behavior was to rebuild an artifact locally if fetching
it from an HTTP remote cache failed. This behavior is different from
GRPC remote cache case where Bazel will retry the fetch.
The lack of retry is an issue for multiple reasons: On one hand
rebuilding locally can be slower than fetching from the remote cache, on
the other hand if a build action is not bit reproducible, as is the case
with some compilers, then the local rebuild will trigger cache misses on
further build actions that depend on the current artifact.
This change aims to avoid theses issues by retrying the fetch in the
HTTP cache case similarly to how the GRPC cache client does it.
Some care needs to be taken due to the design of Bazel's internal remote
cache client API. For a fetch the client is given an
OutputStream
object that it is expected to write the fetched data to. This may be a
temporary file on disk that will be moved to the final location after
the fetch completed. On retry, we need to be careful to not duplicate
previously written data when writing into this
OutputStream
. Due tothe generality of the
OutputStream
interface we cannot reset the filehandle or write pointer to start fresh. Instead, this change follows the
same pattern used in the GRPC cache client. Namely, keep track of the
data previously written and continue from that offset on retry.
With this change the HTTP cache client will attempt to fetch the data
from the remote cache via an HTTP range request. So that the server only
needs to send the data that is still missing. If the server replies with
a 206 Partial Content response, then we write the received data directly
into the output stream, if the server does not support range requests
and instead replies with the full data, then we drop the duplicate
prefix and only write into the output stream from the required offset.
This patch has been running successfully in production here.
cc @cocreature
TODO
src/test/java/com/google/devtools/build/lib/remote/http/HttpCacheClientTest.java
. But, that seemed to require implementing support for HTTP range requests insrc/tools/remote/src/main/java/com/google/devtools/build/remote/worker/http/HttpCacheServerHandler.java
. Perhaps there is a better way to test this change.