Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent result when CAS exists from proxy but not locally. #318

Closed
lizan opened this issue Jul 23, 2020 · 6 comments
Closed

Inconsistent result when CAS exists from proxy but not locally. #318

lizan opened this issue Jul 23, 2020 · 6 comments

Comments

@lizan
Copy link
Contributor

lizan commented Jul 23, 2020

I have CI cache backed by S3 and here is an HTTP cache log:

2020/07/22 23:54:39 S3 DOWNLOAD envoy-ci-build-cache-us-east-2 public-x64/ac/a559cd7910cbbe821525f48e6efef10964c4243aeff3b6747baffd1e2863c096 OK
2020/07/22 23:54:39 S3 CONTAINS envoy-ci-build-cache-us-east-2 public-x64/cas/d35d521228f07e30b28984179d3435236ace2eddaadaa1526aaea58e39a389ba OK
2020/07/22 23:54:39 S3 CONTAINS envoy-ci-build-cache-us-east-2 public-x64/cas/615324b6d4553bf401f3e20e1d1e7fbb02f99f4d6a672f069157871547b0b4cb OK
2020/07/22 23:54:39 S3 DOWNLOAD envoy-ci-build-cache-us-east-2 public-x64/cas/0e5166cbbaee2f9bcb621398ab8f0c75deb169766d88b9f7b4047f39e2dcef31 OK
2020/07/22 23:54:39 S3 CONTAINS envoy-ci-build-cache-us-east-2 public-x64/cas/f8e5351bad6c3c961fbc247790047c031508b98ae083f8d72483505c8c039931 OK
2020/07/22 23:54:39 S3 CONTAINS envoy-ci-build-cache-us-east-2 public-x64/cas/6a36cd36e694976de00dd7aa55a8731a7fa9e06cde1a855d6409530bcfbd4e40 OK
2020/07/22 23:54:39 S3 DOWNLOAD envoy-ci-build-cache-us-east-2 public-x64/cas/f5ea44e96bcbec7c85633f0ebaecb16cc374ed1b1e2fb29513b76d0f1896289e OK
2020/07/22 23:54:39 S3 CONTAINS envoy-ci-build-cache-us-east-2 public-x64/cas/f8e5351bad6c3c961fbc247790047c031508b98ae083f8d72483505c8c039931 OK
2020/07/22 23:54:39 S3 CONTAINS envoy-ci-build-cache-us-east-2 public-x64/cas/6a36cd36e694976de00dd7aa55a8731a7fa9e06cde1a855d6409530bcfbd4e40 OK
2020/07/22 23:54:39 S3 CONTAINS envoy-ci-build-cache-us-east-2 public-x64/cas/3a56dc577ede5d09612ee47874719a67709bede1d54c7337709ba0f1b34e7eae OK
2020/07/22 23:54:39 S3 CONTAINS envoy-ci-build-cache-us-east-2 public-x64/cas/d35d521228f07e30b28984179d3435236ace2eddaadaa1526aaea58e39a389ba OK
2020/07/22 23:54:39 S3 CONTAINS envoy-ci-build-cache-us-east-2 public-x64/cas/615324b6d4553bf401f3e20e1d1e7fbb02f99f4d6a672f069157871547b0b4cb OK
2020/07/22 23:54:39 S3 CONTAINS envoy-ci-build-cache-us-east-2 public-x64/cas/4069577f6b77e0d8e22150e405f13faf31cc3973735a6613034cfbf88e9f3f09 OK
2020/07/22 23:54:39 S3 CONTAINS envoy-ci-build-cache-us-east-2 public-x64/cas/39cda18dd362ba07607ecb7007f92a4270408b2eefd3a0b6ca4f7a06998f7de2 OK
2020/07/22 23:54:39 S3 CONTAINS envoy-ci-build-cache-us-east-2 public-x64/cas/74a306028b6a3fe5f6635cf35c7b0844e816d6e689f8c9af1e5aebe0193ef4a1 OK
2020/07/22 23:54:39 S3 CONTAINS envoy-ci-build-cache-us-east-2 public-x64/cas/bd62d11533ba6d63328b92369ec905eaa18380badbfc394603feef887388e062 OK
2020/07/22 23:54:39 S3 CONTAINS envoy-ci-build-cache-us-east-2 public-x64/cas/a5bb7efae52fb008c3d3db5aaec798a74815d73bc786efacefc56a13b8f0336c OK
2020/07/22 23:54:39 S3 CONTAINS envoy-ci-build-cache-us-east-2 public-x64/cas/5c458a5e671110255be575f7f58b2e63931f3a7c6b960fa489e67343c87e1d61 OK
2020/07/22 23:54:39 S3 CONTAINS envoy-ci-build-cache-us-east-2 public-x64/cas/5efafabd06c21547fa756512bbf92b3a8b1e9e9a6cf872080b4cadd8821f7c35 OK
2020/07/22 23:54:39 S3 CONTAINS envoy-ci-build-cache-us-east-2 public-x64/cas/400863baa375f1b465e0f72bfe6bc18f6ac13cc065fa9fc66254df9e88e7ff5c OK
2020/07/22 23:54:39 S3 CONTAINS envoy-ci-build-cache-us-east-2 public-x64/cas/aad80f9583cb54f43ea70280c68a737f01441f430f96c1f290fd0ee24c979009 OK
2020/07/22 23:54:39  GET 200       127.0.0.1 /cas/0e5166cbbaee2f9bcb621398ab8f0c75deb169766d88b9f7b4047f39e2dcef31
2020/07/22 23:54:39  GET 200       127.0.0.1 /cas/f5ea44e96bcbec7c85633f0ebaecb16cc374ed1b1e2fb29513b76d0f1896289e
2020/07/22 23:54:39  GET 404       127.0.0.1 /cas/d35d521228f07e30b28984179d3435236ace2eddaadaa1526aaea58e39a389ba
2020/07/22 23:54:39  GET 404       127.0.0.1 /cas/615324b6d4553bf401f3e20e1d1e7fbb02f99f4d6a672f069157871547b0b4cb
2020/07/22 23:54:39  GET 404       127.0.0.1 /cas/6a36cd36e694976de00dd7aa55a8731a7fa9e06cde1a855d6409530bcfbd4e40
2020/07/22 23:54:39  GET 404       127.0.0.1 /cas/f8e5351bad6c3c961fbc247790047c031508b98ae083f8d72483505c8c039931
2020/07/22 23:54:39  GET 200       127.0.0.1 /cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
2020/07/22 23:54:39 S3 DOWNLOAD envoy-ci-build-cache-us-east-2 public-x64/cas/d35d521228f07e30b28984179d3435236ace2eddaadaa1526aaea58e39a389ba OK
2020/07/22 23:54:39  GET 200       127.0.0.1 /cas/d35d521228f07e30b28984179d3435236ace2eddaadaa1526aaea58e39a389ba
2020/07/22 23:54:39 S3 DOWNLOAD envoy-ci-build-cache-us-east-2 public-x64/cas/400863baa375f1b465e0f72bfe6bc18f6ac13cc065fa9fc66254df9e88e7ff5c OK
2020/07/22 23:54:39  GET 200       127.0.0.1 /cas/400863baa375f1b465e0f72bfe6bc18f6ac13cc065fa9fc66254df9e88e7ff5c
2020/07/22 23:54:39 S3 DOWNLOAD envoy-ci-build-cache-us-east-2 public-x64/cas/bd62d11533ba6d63328b92369ec905eaa18380badbfc394603feef887388e062 OK
2020/07/22 23:54:39  GET 200       127.0.0.1 /cas/bd62d11533ba6d63328b92369ec905eaa18380badbfc394603feef887388e062
2020/07/22 23:54:39 S3 DOWNLOAD envoy-ci-build-cache-us-east-2 public-x64/cas/5c458a5e671110255be575f7f58b2e63931f3a7c6b960fa489e67343c87e1d61 OK
2020/07/22 23:54:39  GET 200       127.0.0.1 /cas/5c458a5e671110255be575f7f58b2e63931f3a7c6b960fa489e67343c87e1d61
2020/07/22 23:54:39 S3 DOWNLOAD envoy-ci-build-cache-us-east-2 public-x64/cas/39cda18dd362ba07607ecb7007f92a4270408b2eefd3a0b6ca4f7a06998f7de2 OK
2020/07/22 23:54:39  GET 200       127.0.0.1 /cas/39cda18dd362ba07607ecb7007f92a4270408b2eefd3a0b6ca4f7a06998f7de2
2020/07/22 23:54:39 S3 DOWNLOAD envoy-ci-build-cache-us-east-2 public-x64/cas/615324b6d4553bf401f3e20e1d1e7fbb02f99f4d6a672f069157871547b0b4cb OK
2020/07/22 23:54:39  GET 200       127.0.0.1 /cas/615324b6d4553bf401f3e20e1d1e7fbb02f99f4d6a672f069157871547b0b4cb
2020/07/22 23:54:39 S3 DOWNLOAD envoy-ci-build-cache-us-east-2 public-x64/cas/aad80f9583cb54f43ea70280c68a737f01441f430f96c1f290fd0ee24c979009 OK
2020/07/22 23:54:39 S3 DOWNLOAD envoy-ci-build-cache-us-east-2 public-x64/cas/f8e5351bad6c3c961fbc247790047c031508b98ae083f8d72483505c8c039931 OK
2020/07/22 23:54:39  GET 200       127.0.0.1 /cas/aad80f9583cb54f43ea70280c68a737f01441f430f96c1f290fd0ee24c979009
2020/07/22 23:54:39  GET 200       127.0.0.1 /cas/f8e5351bad6c3c961fbc247790047c031508b98ae083f8d72483505c8c039931

Particularly, for entries like /cas/d35d521228f07e30b28984179d3435236ace2eddaadaa1526aaea58e39a389ba, S3 returns OK :

2020/07/22 23:54:39 S3 CONTAINS envoy-ci-build-cache-us-east-2 public-x64/cas/d35d521228f07e30b28984179d3435236ace2eddaadaa1526aaea58e39a389ba OK

While bazel-remote return 404 to Bazel:

2020/07/22 23:54:39  GET 404       127.0.0.1 /cas/d35d521228f07e30b28984179d3435236ace2eddaadaa1526aaea58e39a389ba

And the download completes later:

2020/07/22 23:54:39 S3 DOWNLOAD envoy-ci-build-cache-us-east-2 public-x64/cas/d35d521228f07e30b28984179d3435236ace2eddaadaa1526aaea58e39a389ba OK

This result

WARNING: Reading from Remote Cache:
BulkTransferException

and a cache miss.

@mostynb
Copy link
Collaborator

mostynb commented Jul 23, 2020

Thanks for the bug report.

Looking at instances of d35d521228f07e30b28984179d3435236ace2eddaadaa1526aaea58e39a389ba in your logs, I suspect that there are two concurrent requests for the blob:

  1. request 1 arrives, the blob is not available locally, but a place is reserved in the LRU index, then bazel-remote starts downloading the blob from s3
  2. request 2 arrives, the blob is still not available locally, and a place cannot be reserved in the LRU index because there's an in-progress download, and gives the 404 error
  3. request 1 finishes downloading from s3, then successfully serves the blob

So this is almost certainly related to #267 - a shortcut was taken early on in bazel-remote's design, based on the assumption that such concurrent requests are rare enough to ignore. That assumption no longer seems to hold with more recent versions of bazel.

The reporter of that issue has gone quiet, so I will try implementing this myself. I might have a test build for you to try next week.

@lizan
Copy link
Contributor Author

lizan commented Jul 23, 2020

a shortcut was taken early on in bazel-remote's design, based on the assumption that such concurrent requests are rare enough to ignore. That assumption no longer seems to hold with more recent versions of bazel.

@mostynb Yes that's the case. I will open a bug in bazel as well to not request same blob.

I was able to hack a quick solution by return result from proxy directly when a blob is in progress: https://github.com/buchgr/bazel-remote/compare/master...lizan:inprogress?expand=1

Another potential quick workaround is return 503 instead of 404, that at least let bazel to retry the request instead of execute the action.

In our case, re-execute the action (which is a rules_foreign_cc build) will result slightly different result, which is foreign_cc's issue though. As this happens in an early action so the cache of targets depends on it will also be invalidated.

@mostynb
Copy link
Collaborator

mostynb commented Jul 23, 2020

I will open a bug in bazel as well to not request same blob.

Note that this is also an issue on the bazel-remote side when you have multiple bazel clients interacting with the cache simultaneously.

I was able to hack a quick solution by return result from proxy directly when a blob is in progress: https://github.com/buchgr/bazel-remote/compare/master...lizan:inprogress?expand=1

I have made a start on the cleanup work to remove the assumption, and so far it appears to be easier than I thought. So I would prefer to focus on this, and hopefully have something ready for you to test fairly soon.

Another potential quick workaround is return 503 instead of 404, that at least let bazel to retry the request instead of execute the action.

That is an interesting idea, definitely worth investigating if my cleanup work takes too long.

@mostynb
Copy link
Collaborator

mostynb commented Jul 25, 2020

Here's my work-in-progress fix: https://github.com/mostynb/bazel-remote/commits/concurrent_inserts

@lizan
Copy link
Contributor Author

lizan commented Jul 25, 2020

@mostynb thanks, I can confirm the branch fixes with my minimal case.

mostynb added a commit to mostynb/bazel-remote that referenced this issue Jul 27, 2020
Instead of adding placeholder "uncommitted" items to the LRU, reserve space,
then write the incoming blob to a randomly named tempfile (and verify it if
possible). If everything went well, unreserve the space, add it to the LRU
index and move the tempfile to the final location.

This allows concurrent uploads (or proxy downloads) of the same blob, which
might be slightly wasteful but meets all our consistency requirements. We
can try to optimise this later, if needed.

The "commit" now happens in the main part of the function and the deferred
cleanup function is much simpler.

We no longer have the "uncommitted item evicted" issue to consider, but we
might run into cases where there are inserts that fail because there is too
much space reserved.

Fixes buchgr#267, buchgr#318.
mostynb added a commit to mostynb/bazel-remote that referenced this issue Jul 27, 2020
Instead of adding placeholder "uncommitted" items to the LRU, reserve space,
then write the incoming blob to a randomly named tempfile (and verify it if
possible). If everything went well, unreserve the space, add it to the LRU
index and move the tempfile to the final location.

This allows concurrent uploads (or proxy downloads) of the same blob, which
might be slightly wasteful but meets all our consistency requirements. We
can try to optimise this later, if needed.

The "commit" now happens in the main part of the function and the deferred
cleanup function is much simpler.

We no longer have the "uncommitted item evicted" issue to consider, but we
might run into cases where there are inserts that fail because there is too
much space reserved.

Fixes buchgr#267, buchgr#318.
mostynb added a commit to mostynb/bazel-remote that referenced this issue Jul 27, 2020
Instead of adding placeholder "uncommitted" items to the LRU, reserve space,
then write the incoming blob to a randomly named tempfile (and verify it if
possible). If everything went well, unreserve the space, add it to the LRU
index and move the tempfile to the final location.

This allows concurrent uploads (or proxy downloads) of the same blob, which
might be slightly wasteful but meets all our consistency requirements. We
can try to optimise this later, if needed.

The "commit" now happens in the main part of the function and the deferred
cleanup function is much simpler.

We no longer have the "uncommitted item evicted" issue to consider, but we
might run into cases where there are inserts that fail because there is too
much space reserved.

Fixes buchgr#267, buchgr#318.
mostynb added a commit to mostynb/bazel-remote that referenced this issue Aug 19, 2020
Instead of adding placeholder "uncommitted" items to the LRU, reserve space,
then write the incoming blob to a randomly named tempfile (and verify it if
possible). If everything went well, unreserve the space, add it to the LRU
index and move the tempfile to the final location.

This allows concurrent uploads (or proxy downloads) of the same blob, which
might be slightly wasteful but meets all our consistency requirements. We
can try to optimise this later, if needed.

The "commit" now happens in the main part of the function and the deferred
cleanup function is much simpler.

We no longer have the "uncommitted item evicted" issue to consider, but we
might run into cases where there are inserts that fail because there is too
much space reserved.

Fixes buchgr#267, buchgr#318.
mostynb added a commit that referenced this issue Aug 24, 2020
Instead of adding placeholder "uncommitted" items to the LRU, reserve space,
then write the incoming blob to a randomly named tempfile (and verify it if
possible). If everything went well, unreserve the space, add it to the LRU
index and move the tempfile to the final location.

This allows concurrent uploads (or proxy downloads) of the same blob, which
might be slightly wasteful but meets all our consistency requirements. We
can try to optimise this later, if needed.

The "commit" now happens in the main part of the function and the deferred
cleanup function is much simpler.

We no longer have the "uncommitted item evicted" issue to consider, but we
might run into cases where there are inserts that fail because there is too
much space reserved.

Fixes #267, #318.
@mostynb
Copy link
Collaborator

mostynb commented Aug 24, 2020

The concurrent_inserts branch has now landed, hopefully if fixes this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants