-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent result when CAS exists from proxy but not locally. #318
Comments
Thanks for the bug report. Looking at instances of
So this is almost certainly related to #267 - a shortcut was taken early on in bazel-remote's design, based on the assumption that such concurrent requests are rare enough to ignore. That assumption no longer seems to hold with more recent versions of bazel. The reporter of that issue has gone quiet, so I will try implementing this myself. I might have a test build for you to try next week. |
@mostynb Yes that's the case. I will open a bug in bazel as well to not request same blob. I was able to hack a quick solution by return result from proxy directly when a blob is in progress: https://github.com/buchgr/bazel-remote/compare/master...lizan:inprogress?expand=1 Another potential quick workaround is return 503 instead of 404, that at least let bazel to retry the request instead of execute the action. In our case, re-execute the action (which is a rules_foreign_cc build) will result slightly different result, which is foreign_cc's issue though. As this happens in an early action so the cache of targets depends on it will also be invalidated. |
Note that this is also an issue on the bazel-remote side when you have multiple bazel clients interacting with the cache simultaneously.
I have made a start on the cleanup work to remove the assumption, and so far it appears to be easier than I thought. So I would prefer to focus on this, and hopefully have something ready for you to test fairly soon.
That is an interesting idea, definitely worth investigating if my cleanup work takes too long. |
Here's my work-in-progress fix: https://github.com/mostynb/bazel-remote/commits/concurrent_inserts |
@mostynb thanks, I can confirm the branch fixes with my minimal case. |
Instead of adding placeholder "uncommitted" items to the LRU, reserve space, then write the incoming blob to a randomly named tempfile (and verify it if possible). If everything went well, unreserve the space, add it to the LRU index and move the tempfile to the final location. This allows concurrent uploads (or proxy downloads) of the same blob, which might be slightly wasteful but meets all our consistency requirements. We can try to optimise this later, if needed. The "commit" now happens in the main part of the function and the deferred cleanup function is much simpler. We no longer have the "uncommitted item evicted" issue to consider, but we might run into cases where there are inserts that fail because there is too much space reserved. Fixes buchgr#267, buchgr#318.
Instead of adding placeholder "uncommitted" items to the LRU, reserve space, then write the incoming blob to a randomly named tempfile (and verify it if possible). If everything went well, unreserve the space, add it to the LRU index and move the tempfile to the final location. This allows concurrent uploads (or proxy downloads) of the same blob, which might be slightly wasteful but meets all our consistency requirements. We can try to optimise this later, if needed. The "commit" now happens in the main part of the function and the deferred cleanup function is much simpler. We no longer have the "uncommitted item evicted" issue to consider, but we might run into cases where there are inserts that fail because there is too much space reserved. Fixes buchgr#267, buchgr#318.
Instead of adding placeholder "uncommitted" items to the LRU, reserve space, then write the incoming blob to a randomly named tempfile (and verify it if possible). If everything went well, unreserve the space, add it to the LRU index and move the tempfile to the final location. This allows concurrent uploads (or proxy downloads) of the same blob, which might be slightly wasteful but meets all our consistency requirements. We can try to optimise this later, if needed. The "commit" now happens in the main part of the function and the deferred cleanup function is much simpler. We no longer have the "uncommitted item evicted" issue to consider, but we might run into cases where there are inserts that fail because there is too much space reserved. Fixes buchgr#267, buchgr#318.
Instead of adding placeholder "uncommitted" items to the LRU, reserve space, then write the incoming blob to a randomly named tempfile (and verify it if possible). If everything went well, unreserve the space, add it to the LRU index and move the tempfile to the final location. This allows concurrent uploads (or proxy downloads) of the same blob, which might be slightly wasteful but meets all our consistency requirements. We can try to optimise this later, if needed. The "commit" now happens in the main part of the function and the deferred cleanup function is much simpler. We no longer have the "uncommitted item evicted" issue to consider, but we might run into cases where there are inserts that fail because there is too much space reserved. Fixes buchgr#267, buchgr#318.
Instead of adding placeholder "uncommitted" items to the LRU, reserve space, then write the incoming blob to a randomly named tempfile (and verify it if possible). If everything went well, unreserve the space, add it to the LRU index and move the tempfile to the final location. This allows concurrent uploads (or proxy downloads) of the same blob, which might be slightly wasteful but meets all our consistency requirements. We can try to optimise this later, if needed. The "commit" now happens in the main part of the function and the deferred cleanup function is much simpler. We no longer have the "uncommitted item evicted" issue to consider, but we might run into cases where there are inserts that fail because there is too much space reserved. Fixes #267, #318.
The concurrent_inserts branch has now landed, hopefully if fixes this issue. |
I have CI cache backed by S3 and here is an HTTP cache log:
Particularly, for entries like
/cas/d35d521228f07e30b28984179d3435236ace2eddaadaa1526aaea58e39a389ba
, S3 returns OK :While bazel-remote return 404 to Bazel:
And the download completes later:
This result
and a cache miss.
The text was updated successfully, but these errors were encountered: