Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Bazel cache download retry #11238

Merged
merged 4 commits into from
Oct 22, 2021
Merged

Fix Bazel cache download retry #11238

merged 4 commits into from
Oct 22, 2021

Conversation

aherrmann-da
Copy link
Contributor

@aherrmann-da aherrmann-da commented Oct 13, 2021

Bazel's HTTP remote cache client does not support retry on download (or upload) failure. Instead, it will rebuild the artifact locally. This is potentially slower than fetching from cache, and worse if the build action is not bit reproducible (like GHC compilation tends to be) then the local rebuild will trigger cache misses due to changed inputs down the line.

We already have a patch in place to enable retry in the HTTP cache client. Unfortunately, it has a bug: Bazel continues appending to the same file (OutputStream in general) without resetting the write offset to the beginning of the file. Meaning, after retry the file contains a duplicate prefix and Bazel's digest check fails and Bazel falls back to a local rebuild after all.

This PR fixes that issue. Bazel's internal API is structured in a way that doesn't allow to reset the write offset. Instead, this PR takes the same approach as Bazel's already existing GRPC cache client and only downloads the missing data with a range request, or discards the duplicate prefix if a range request is not supported by the HTTP server.

I have tested this patch locally against a bazel-remote HTTP cache (which doesn't support range requests) and an nginx HTTP cache (which does support range requests). I have also tested this patch on daml CI here. Retries are logged with the prefix RETRYING. E.g.

RETRYING: com.google.devtools.build.lib.remote.http.HttpException: 502 Bad Gateway

Errors that don't trigger a retry are logged with the prefix NOT RETRYING (apart from cache misses). No such cases are expected and also not found in the CI logs. However, if they ever appear we can consider extending the retry logic to cover these errors as well.

This PR changes a patch, which can be awkward to review. The full diff against Bazel can be seen here.

changelog_begin
changelog_end

Pull Request Checklist

  • Read and understand the contribution guidelines
  • Include appropriate tests
  • Set a descriptive title and thorough description
  • Add a reference to the issue this PR will solve, if appropriate
  • Include changelog additions in one or more commit message bodies between the CHANGELOG_BEGIN and CHANGELOG_END tags
  • Normal production system change, include purpose of change in description
  • If you mean to change the status of a component, please make sure you keep the Component Status page up to date.

NOTE: CI is not automatically run on non-members pull-requests for security
reasons. The reviewer will have to comment with /AzurePipelines run to
trigger the build.

This reverts commit b62f16c.
@aherrmann-da aherrmann-da requested review from cocreature and garyverhaegen-da and removed request for cocreature and garyverhaegen-da October 21, 2021 14:30
@aherrmann-da aherrmann-da marked this pull request as ready for review October 21, 2021 14:30
Copy link
Contributor

@cocreature cocreature left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thank you so much for all your work on this!

nix/bazel-retry-cache.patch Outdated Show resolved Hide resolved
+ return;
+ }
+ }
+ downloadSucceeded = full_content || partial_content;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just use an else here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The downloadSucceeded variable is used further down in Bazel's code. So, we'll still need the binding.

I was considering to move the out = ... part into the else. But, I decided against it because it makes it clearer that this is specifically about the failure case. However, I don't feel strongly about this. So, happy to be convinced otherwise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I see, makes sense to keep it the way it is.

@aherrmann-da aherrmann-da enabled auto-merge (squash) October 21, 2021 16:25
@aherrmann-da aherrmann-da merged commit c95db72 into main Oct 22, 2021
@aherrmann-da aherrmann-da deleted the bazel-retry-fix branch October 22, 2021 15:34
azure-pipelines bot pushed a commit that referenced this pull request Oct 27, 2021
This PR has been created by a script, which is not very smart
and does not have all the context. Please do double-check that
the version prefix is correct before merging.

@akshayshirahatti-da is in charge of this release.

Commit log:
```
e474b2d [JSON-API] Websockets fix for matchedQueries (#11361)
4a34b68 KV: port V2 errors to self-service errors framework [KVL-1143] (#11326)
811a6d3 Fixed AuthorizationInterceptorSpec again (#11418)
c8006b8 ScenarioRunner: enrich incomplete transactions (#11384)
d9c7031 ACS testing - payload support [DPP-661] (#11308)
d87d3d4 deal with deadlocks while fetching contracts in json-api Oracle (#11391)
8212c0b Make submission ID optional [KVL-1107] (#11011)
3587eb8 Use Timestamp instead of Instant (#11356)
ea5f09e sandbox: Deprecate the `--eager-package-loading` flag. (#11404)
9f882f2 remove search index on json fields that harm insert and pruning performance (#11041)
70b90f4 optimize max event_sequential_id query for oracle  (#11297)
b1fed31 Fix missing script results (#11395)
03db0aa Auto run/check security evidence generation in ./fmt.sh (#11407)
c928f0e [Short] Typo (#11400)
ba6c2be Add missing TransactionId to com.daml.error.ErrorResource (#11396)
a2a1571 Generate security evidence by documenting security testcases (#11306)
8d17882 Allocate parties sequentially in script export tests (#11389)
1309c2f DPP-587 Use Timestamp instead of Instant (#11183)
82f9873 Rotate release rotation (#11394)
b14077a Fix AuthorizationInterceptorSpec flake (#11387)
7090f2d update NOTICES file (#11367)
ad42dfa Update gRPC to the latest (1.41.0) and Protobuf (#11380)
54c400a Update wording in Deploying to a generic Daml ledger (#11327)
4461ed1 Fix log output (#11374)
613aac3 Add support for non-star-kinded type synonyms in data-dependencies (#11293)
f89ecc6 interfaces: add an experimental `toTypeRep` builtin. (#11378)
5654d5c fix es ingest for missing files (#11375)
03cfd12 Configurable assertions in Ledger API test tool by feature descriptors (#11328)
96b7b58 [DPP-648][Self-service error codes] Adopt ApiPartyManagementService (#11338)
9e94ae0 LF: move repl exception-auth test from dev to stable (#11369)
5365d68 LF: Remove PartialTransaction out from ScenarioRunner/IdeLedgerClient (#11368)
79037c8 [DPP-646][Self-service error codes] Adopt ApiPackageManagementService (#11314)
0ee59f5 Command submission in the ledger-api-bench-tool. (#11296)
8d5cab5 LF: Simplify seeds generation in scenario runnner (#11353)
9e5b788 Speedup daml repl integration tests (#11335)
3bc0db3 fix contract_tpid_fkey-related race condition (#11330)
ab8a863 [docs] Add Daml Driver for VMBC to the commercial integrations section (#11360)
c95db72 Fix Bazel cache download retry (#11238)
e8d0ccb [DPP-611][Self-service error codes] Adapt ApiCommandService (#11325)
a89079b [DPP-647][Self-service error codes] Adopt ApiParticipantPruningService (#11324)
cc8ec28 [Self-service error codes] Adapt GrpcHealthService (#11354)
c60c94b [DPP-645][Self-service error codes] Adapt ApiConfigManagementService (#11312)
e6da1f7 Add step in ghc-lib guide for getting submodules to work (#11351)
f3057ea Increase timeout for non-repudation tests on Postgres (#11340)
176f470 interface: adding interfaces to the TS codegen (#11280)
355352f DPP-650 Remove the mutating schema (#11211)
443b64d [DPP-621][Self-service error codes] Adopt error codes in ApiVersionService (#11302)
ed9dbed interfaces: Add fixed choice collision check in typechecker (Haskell) (#11337)
c37ecd1 [Short] Pass correct loggingContext to withValidatedPackageId (#11307)
0d305cf [Short] Move field before logging statement (ApiTimeService) (#11313)
73c94b5 Increase timeout for non-repudation test (#11281)
88c607b [Self-service error codes] Adapt ApiTransactionService [DPP-613] (#11094)
07ad3e0 Suport multi-party readAs in triggers (#11299)
76eb165 Interface fixed choices: ghc parser (#11275)
da27a1e [DPP-619][Self-service error codes] Adopt error codes in ApiVersionService (#11303)
5f5af30 [DPP-628][Self-service error codes] Adapt error codes in ApiTimeService (#11295)
f9e67ad [Self-service error codes] Adapt error responses in ledger-api-auth [DPP-617] (#11223)
7282965 Fix component status for triggers (#11311)
17776f3 Factor out npm install step of create-daml-app tests (#11294)
3a8b685 [Short] Fix docs for Dispatcher#startingAt (#11304)
f315a90 release 1.18.0-snapshot.20211019.8113.0.8ff347d8 (#11300)
8a3abce [DPP-618][Self-service error codes] Adapt error codes in ApiPackageService (#11284)
50ea92f Use ApiTypes.Party instead of String in the trigger runner (#11298)
2267429 [DPP-656] Assert on self-service error code details in ErrorFactoriesSpec (#11289)
c06faf2 LF: remove imperative environment from Speedy compiler (#11285)
d3dad75 [DPP-592] Generate docs for self-service error codes. (#11129)
```
Changelog:
```
[JSON-API] fixes a bug related to the matchedQueries value returned for websocket multiqueries,
this only happens for patterns where the multiqueries contain a mixture of queries with and without
offsets.
- [Integration Kit] - ledger-api-bench-tool can generate test contracts with configurable payload size.
- [Integration Kit] - Added multi-template support for command submission in the ledger-api-bench-tool
- [Sandbox] The ``--eager-package-loading`` flag has been deprecated. It
  hasn't actually done anything for many releases; packages are always
  loaded eagerly. This does not affect Daml on SQL, which does support
  lazy package loading.

- [Daml Studio] Fix a bug where script results in Daml Studio
  sometimes do not show up.

- [Integration Kit] - The ledger-api-bench-tool is now capable of generating test contracts for testing purposes.
- [JSON API] Fixed a rare error that manifested as
  ‘violates foreign key constraint "contract_tpid_fkey"
   Detail: Key (tpid)=(...) is not present in table’
  when attempting to run queries and goes away on JSON API restart.
  See `issue #11330 <https://github.com/digital-asset/daml/pull/11330>`__.
- [Participant] All participants now use the new append-only schema. Existing databases will
  automatically upgrade to the new schema the first time a participant/ledger is started.

- [Daml Triggers] Triggers now support readAs parties. They can be
  specified via `--ledger-readas a,b,c`. As part of this change
  ``testRule`` gained an extra argument to specify the `readAs`
  parties. If you previously used

  ```
  testRule trigger party acsBuilder commandsInFlight s
  ```

  you now need to use

  ```
  testRule trigger party [] acsBuilder commandsInFlight s
  ```

```

CHANGELOG_BEGIN
CHANGELOG_END
akshayshirahatti-da pushed a commit that referenced this pull request Oct 27, 2021
This PR has been created by a script, which is not very smart
and does not have all the context. Please do double-check that
the version prefix is correct before merging.

@akshayshirahatti-da is in charge of this release.

Commit log:
```
e474b2d [JSON-API] Websockets fix for matchedQueries (#11361)
4a34b68 KV: port V2 errors to self-service errors framework [KVL-1143] (#11326)
811a6d3 Fixed AuthorizationInterceptorSpec again (#11418)
c8006b8 ScenarioRunner: enrich incomplete transactions (#11384)
d9c7031 ACS testing - payload support [DPP-661] (#11308)
d87d3d4 deal with deadlocks while fetching contracts in json-api Oracle (#11391)
8212c0b Make submission ID optional [KVL-1107] (#11011)
3587eb8 Use Timestamp instead of Instant (#11356)
ea5f09e sandbox: Deprecate the `--eager-package-loading` flag. (#11404)
9f882f2 remove search index on json fields that harm insert and pruning performance (#11041)
70b90f4 optimize max event_sequential_id query for oracle  (#11297)
b1fed31 Fix missing script results (#11395)
03db0aa Auto run/check security evidence generation in ./fmt.sh (#11407)
c928f0e [Short] Typo (#11400)
ba6c2be Add missing TransactionId to com.daml.error.ErrorResource (#11396)
a2a1571 Generate security evidence by documenting security testcases (#11306)
8d17882 Allocate parties sequentially in script export tests (#11389)
1309c2f DPP-587 Use Timestamp instead of Instant (#11183)
82f9873 Rotate release rotation (#11394)
b14077a Fix AuthorizationInterceptorSpec flake (#11387)
7090f2d update NOTICES file (#11367)
ad42dfa Update gRPC to the latest (1.41.0) and Protobuf (#11380)
54c400a Update wording in Deploying to a generic Daml ledger (#11327)
4461ed1 Fix log output (#11374)
613aac3 Add support for non-star-kinded type synonyms in data-dependencies (#11293)
f89ecc6 interfaces: add an experimental `toTypeRep` builtin. (#11378)
5654d5c fix es ingest for missing files (#11375)
03cfd12 Configurable assertions in Ledger API test tool by feature descriptors (#11328)
96b7b58 [DPP-648][Self-service error codes] Adopt ApiPartyManagementService (#11338)
9e94ae0 LF: move repl exception-auth test from dev to stable (#11369)
5365d68 LF: Remove PartialTransaction out from ScenarioRunner/IdeLedgerClient (#11368)
79037c8 [DPP-646][Self-service error codes] Adopt ApiPackageManagementService (#11314)
0ee59f5 Command submission in the ledger-api-bench-tool. (#11296)
8d5cab5 LF: Simplify seeds generation in scenario runnner (#11353)
9e5b788 Speedup daml repl integration tests (#11335)
3bc0db3 fix contract_tpid_fkey-related race condition (#11330)
ab8a863 [docs] Add Daml Driver for VMBC to the commercial integrations section (#11360)
c95db72 Fix Bazel cache download retry (#11238)
e8d0ccb [DPP-611][Self-service error codes] Adapt ApiCommandService (#11325)
a89079b [DPP-647][Self-service error codes] Adopt ApiParticipantPruningService (#11324)
cc8ec28 [Self-service error codes] Adapt GrpcHealthService (#11354)
c60c94b [DPP-645][Self-service error codes] Adapt ApiConfigManagementService (#11312)
e6da1f7 Add step in ghc-lib guide for getting submodules to work (#11351)
f3057ea Increase timeout for non-repudation tests on Postgres (#11340)
176f470 interface: adding interfaces to the TS codegen (#11280)
355352f DPP-650 Remove the mutating schema (#11211)
443b64d [DPP-621][Self-service error codes] Adopt error codes in ApiVersionService (#11302)
ed9dbed interfaces: Add fixed choice collision check in typechecker (Haskell) (#11337)
c37ecd1 [Short] Pass correct loggingContext to withValidatedPackageId (#11307)
0d305cf [Short] Move field before logging statement (ApiTimeService) (#11313)
73c94b5 Increase timeout for non-repudation test (#11281)
88c607b [Self-service error codes] Adapt ApiTransactionService [DPP-613] (#11094)
07ad3e0 Suport multi-party readAs in triggers (#11299)
76eb165 Interface fixed choices: ghc parser (#11275)
da27a1e [DPP-619][Self-service error codes] Adopt error codes in ApiVersionService (#11303)
5f5af30 [DPP-628][Self-service error codes] Adapt error codes in ApiTimeService (#11295)
f9e67ad [Self-service error codes] Adapt error responses in ledger-api-auth [DPP-617] (#11223)
7282965 Fix component status for triggers (#11311)
17776f3 Factor out npm install step of create-daml-app tests (#11294)
3a8b685 [Short] Fix docs for Dispatcher#startingAt (#11304)
f315a90 release 1.18.0-snapshot.20211019.8113.0.8ff347d8 (#11300)
8a3abce [DPP-618][Self-service error codes] Adapt error codes in ApiPackageService (#11284)
50ea92f Use ApiTypes.Party instead of String in the trigger runner (#11298)
2267429 [DPP-656] Assert on self-service error code details in ErrorFactoriesSpec (#11289)
c06faf2 LF: remove imperative environment from Speedy compiler (#11285)
d3dad75 [DPP-592] Generate docs for self-service error codes. (#11129)
```
Changelog:
```
[JSON-API] fixes a bug related to the matchedQueries value returned for websocket multiqueries,
this only happens for patterns where the multiqueries contain a mixture of queries with and without
offsets.
- [Integration Kit] - ledger-api-bench-tool can generate test contracts with configurable payload size.
- [Integration Kit] - Added multi-template support for command submission in the ledger-api-bench-tool
- [Sandbox] The ``--eager-package-loading`` flag has been deprecated. It
  hasn't actually done anything for many releases; packages are always
  loaded eagerly. This does not affect Daml on SQL, which does support
  lazy package loading.

- [Daml Studio] Fix a bug where script results in Daml Studio
  sometimes do not show up.

- [Integration Kit] - The ledger-api-bench-tool is now capable of generating test contracts for testing purposes.
- [JSON API] Fixed a rare error that manifested as
  ‘violates foreign key constraint "contract_tpid_fkey"
   Detail: Key (tpid)=(...) is not present in table’
  when attempting to run queries and goes away on JSON API restart.
  See `issue #11330 <https://github.com/digital-asset/daml/pull/11330>`__.
- [Participant] All participants now use the new append-only schema. Existing databases will
  automatically upgrade to the new schema the first time a participant/ledger is started.

- [Daml Triggers] Triggers now support readAs parties. They can be
  specified via `--ledger-readas a,b,c`. As part of this change
  ``testRule`` gained an extra argument to specify the `readAs`
  parties. If you previously used

  ```
  testRule trigger party acsBuilder commandsInFlight s
  ```

  you now need to use

  ```
  testRule trigger party [] acsBuilder commandsInFlight s
  ```

```

CHANGELOG_BEGIN
CHANGELOG_END

Co-authored-by: Azure Pipelines Daml Build <support@digitalasset.com>
copybara-service bot pushed a commit to bazelbuild/bazel that referenced this pull request Feb 10, 2023
Bazel's previous behavior was to rebuild an artifact locally if fetching
it from an HTTP remote cache failed. This behavior is different from
GRPC remote cache case where Bazel will retry the fetch.

The lack of retry is an issue for multiple reasons: On one hand
rebuilding locally can be slower than fetching from the remote cache, on
the other hand if a build action is not bit reproducible, as is the case
with some compilers, then the local rebuild will trigger cache misses on
further build actions that depend on the current artifact.

This change aims to avoid theses issues by retrying the fetch in the
HTTP cache case similarly to how the GRPC cache client does it.

Some care needs to be taken due to the design of Bazel's internal remote
cache client API. For a fetch the client is given an `OutputStream`
object that it is expected to write the fetched data to. This may be a
temporary file on disk that will be moved to the final location after
the fetch completed. On retry, we need to be careful to not duplicate
previously written data when writing into this `OutputStream`. Due to
the generality of the `OutputStream` interface we cannot reset the file
handle or write pointer to start fresh. Instead, this change follows the
same pattern used in the GRPC cache client. Namely, keep track of the
data previously written and continue from that offset on retry.

With this change the HTTP cache client will attempt to fetch the data
from the remote cache via an HTTP range request. So that the server only
needs to send the data that is still missing. If the server replies with
a 206 Partial Content response, then we write the received data directly
into the output stream, if the server does not support range requests
and instead replies with the full data, then we drop the duplicate
prefix and only write into the output stream from the required offset.

This patch has been running successfully in production [here](digital-asset/daml#11238).

cc @cocreature

Closes #14258.

PiperOrigin-RevId: 508604846
Change-Id: I10a5d2a658e9c32a9d9fcd6bd29f6a0b95e84566
hvadehra pushed a commit to bazelbuild/bazel that referenced this pull request Feb 14, 2023
Bazel's previous behavior was to rebuild an artifact locally if fetching
it from an HTTP remote cache failed. This behavior is different from
GRPC remote cache case where Bazel will retry the fetch.

The lack of retry is an issue for multiple reasons: On one hand
rebuilding locally can be slower than fetching from the remote cache, on
the other hand if a build action is not bit reproducible, as is the case
with some compilers, then the local rebuild will trigger cache misses on
further build actions that depend on the current artifact.

This change aims to avoid theses issues by retrying the fetch in the
HTTP cache case similarly to how the GRPC cache client does it.

Some care needs to be taken due to the design of Bazel's internal remote
cache client API. For a fetch the client is given an `OutputStream`
object that it is expected to write the fetched data to. This may be a
temporary file on disk that will be moved to the final location after
the fetch completed. On retry, we need to be careful to not duplicate
previously written data when writing into this `OutputStream`. Due to
the generality of the `OutputStream` interface we cannot reset the file
handle or write pointer to start fresh. Instead, this change follows the
same pattern used in the GRPC cache client. Namely, keep track of the
data previously written and continue from that offset on retry.

With this change the HTTP cache client will attempt to fetch the data
from the remote cache via an HTTP range request. So that the server only
needs to send the data that is still missing. If the server replies with
a 206 Partial Content response, then we write the received data directly
into the output stream, if the server does not support range requests
and instead replies with the full data, then we drop the duplicate
prefix and only write into the output stream from the required offset.

This patch has been running successfully in production [here](digital-asset/daml#11238).

cc @cocreature

Closes #14258.

PiperOrigin-RevId: 508604846
Change-Id: I10a5d2a658e9c32a9d9fcd6bd29f6a0b95e84566
coeuvre pushed a commit to coeuvre/bazel that referenced this pull request Aug 28, 2023
Bazel's previous behavior was to rebuild an artifact locally if fetching
it from an HTTP remote cache failed. This behavior is different from
GRPC remote cache case where Bazel will retry the fetch.

The lack of retry is an issue for multiple reasons: On one hand
rebuilding locally can be slower than fetching from the remote cache, on
the other hand if a build action is not bit reproducible, as is the case
with some compilers, then the local rebuild will trigger cache misses on
further build actions that depend on the current artifact.

This change aims to avoid theses issues by retrying the fetch in the
HTTP cache case similarly to how the GRPC cache client does it.

Some care needs to be taken due to the design of Bazel's internal remote
cache client API. For a fetch the client is given an `OutputStream`
object that it is expected to write the fetched data to. This may be a
temporary file on disk that will be moved to the final location after
the fetch completed. On retry, we need to be careful to not duplicate
previously written data when writing into this `OutputStream`. Due to
the generality of the `OutputStream` interface we cannot reset the file
handle or write pointer to start fresh. Instead, this change follows the
same pattern used in the GRPC cache client. Namely, keep track of the
data previously written and continue from that offset on retry.

With this change the HTTP cache client will attempt to fetch the data
from the remote cache via an HTTP range request. So that the server only
needs to send the data that is still missing. If the server replies with
a 206 Partial Content response, then we write the received data directly
into the output stream, if the server does not support range requests
and instead replies with the full data, then we drop the duplicate
prefix and only write into the output stream from the required offset.

This patch has been running successfully in production [here](digital-asset/daml#11238).

cc @cocreature

Closes bazelbuild#14258.

PiperOrigin-RevId: 508604846
Change-Id: I10a5d2a658e9c32a9d9fcd6bd29f6a0b95e84566
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants