-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read After Write Guarantee #267
Conversation
Four issues addressed with this commit: a. Client A uploads file to CAS. It becomes commited in disk storage. Later client B uploads same file with same hash overwriting the one from client A. During upload the disk storage considered the file as uncommited, resulting in that it can not be retrieved for a while. This issue is problematic for Bazel Remote Execution and the bazel feature builds-without-the-bytes. b. If the same file is uploaded concurrently (either by the same client or different clients) the second one silently discards the uploaded data and acknowledthe upload even before the file upload has been completed by the first client. This issue is problematic for Bazel Remote Execution and the bazel feature builds-without-the bytes. c. A deferred os.Remove(tmpFilePath) after an upload, gets delayed and interfers with another upload of the same key. d. After Put adds an uncommited entry, but before it starts downloading the file, it become evicted by a concurrent add. Then the download would continue and could result in a file stored in CAS without entry in LRU, and it could also interfer with following concurrent uploads of the same key. These issues are addressed without introducing risk for a fast uploading client from having to wait for a slower uploadling client concurrently uploading the same file. Instead it adds support for concurrent uploads of the same key. The logic is controlled via new parameters in YAML configuration file: # Might be slightly faster if set to false, but should be set to true # if using either bazel's feature builds-without-the-bytes or bazel # remote execution. read_after_write_guarantee: true # Setting to false migth be slightly faster, especially for remote # execution use cases when client uploads the same input file multiple # times for concurrent actions. Setting to true allows overwring # already existing entries in the cache. It has been argued that # overwriting could be usefull in order to overwrite poisened items, # but in practice it could be hard to identify all poisoned entries # and clearing the complete cache could instead make more sense. overwrite_commited: false
Hi, What do you think about the issues addressed by this commit? Would you accept this pull request if I add and update unit tests? Should the behavour be configurable? BR, |
Thanks for the PR. This is complicated code, so it's great to have another pair of eyes on it. I will try to take a proper look early next week, but here are some initial comments...
I have a large refactoring work-in-progress that might avoid this problem in another way (#186) by removing the need to overwrite CAS blobs. I need to find time to finish it off.
This is a known issue, I guess it was less likely to occur in practice before builds-without-the-bytes. Have you encountered it in practice?
This might be a simple fix in the current setup- should unlock the mutex after removing the temp file here: bazel-remote/cache/disk/disk.go Line 470 in 69bce62
And maybe also merge this deferred function with the one a few lines above, which uses the mutex: bazel-remote/cache/disk/disk.go Line 319 in 69bce62
If you hit that case, it is probably a sign that your cache size is simply too small. I'm not sure if we need to "gracefully" handle this situation, since the real solution is to increase the cache size, which requires a restart.
|
Thanks for your instant response! My scenario is bazel with -j 1000, using bazel-remote as CAS/AC for remote execution with Buildbarn. We use bazel-remote instead of Buildbarn’s bb-storage, due to higher troughput and lower cpu consumption. I experienced failed builds, due to files missing in CAS, and tried to find potential causes, which resulted in #267 and #266. I have tested extensively with both those patches combined and after that never seen missing CAS entries. I can’t say for sure which of the individual changes are most important. But I’ve seen the bazel client uploading the same file concurrently many times. So I suspect handling concurrent uncommited uploads of the same file is critical. And if some of the duplicated uploads gets delayed, it could reach bazel-remote also in commited state. Therefore handling is cricitcal both for uncommited and commited.
Perhaps the temp file is removed outside the lock, in order to minimize file operations while holding the lock? For better parallellism? I think the non-unique temporary file name is the root of this problem, and much of the other complexity. It the temporary file name would be unique, then it could be handled safely outside the lock and also using defer.
Yes, the cache size should be increased in that case, and failed builds and restart could be OK. But I’m a bit worried about if it could result in corrupt files in the cache. I have to think more about that. /Ulrik |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like a good feature, but as mentioned earlier let's delay work on this for a little while so I can make a release without any more large changes. Small and obviously safe tweaks are welcome in the meantime. I will aim to make the release on wednesday.
ReadAfterWriteGuarantee bool `yaml:"read_after_write_guarantee"` | ||
OverwriteCommited bool `yaml:"overwrite_commited"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would rather keep things simple and have only one behaviour, ie allow concurrent writes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Then ReadAfterWriteGuarantee configuration can be removed.
What about OverwriteCommited? Personally I don’t have a use case for replacing committed entries, because:
- Either the cache is huge and contains so many artifacts from so many different builds, that it becomes too hard to know which entries are poisoned and should be overwritten. If I discovered that such a cache contained poisened entries, I would clear the whole cache anyway.
- Or the cache is tiny and could be re-populated quickly, and then it would be OK to clear the whole cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only two reasons I can think why we would want to overwrite blobs are:
- If the blob is somehow known to be corrupt, then we can make a client re-upload everything to replace the bad items.
- For simplicity. We let the client upload the blob anyway, and saving it to disk is takes much less time than the upload so why not do it anyway, without bothering to check if the preexisting blob was corrupt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point about simplicity! And I have not noticed any obvious difference in performance. So the conclusion is to remove both configuration parameters and only keep the code for the cases ReadAfterWriteGuarantee=true and OverwriteCommited=true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes- let's not introduce configuration options, we just need to be careful to use atomic move/rename operations to replace existing files.
This can be time-consuming, since you probably need to start with an empty cache for such a stress-test. But if you are able to, could you check what kind of blobs were causing trouble? I suspect the most likely candidates are the empty blob, or large blobs (which take some time to upload).
Yes, IIRC that's what I was aiming for. But there's a potential race condition, as you say. |
I think you are right about most likely candidates are empty blob or large blobs. And now with the special handling for empty blob, the empty blob should not trigger this race condition. Without #267, I have seen it once also for a non-empty blob. It was a 59 MB large file (a tool) that a lot of concurrent actions had as input file:
I have now tried to reproduce it by clearing the caches between each build, but I have not seen it again. However I suspect the issue is still there. Because sometimes up to 25% of all the PUT operations reaching bazel-remote, are for keys already in commited state in the LRU, and with unfortunate scheduling that could trigger this issue. Example with many overwritten keys in commited state. All of the numbers below are a single build, after clearing the caches, using Buildbarn with bazel-remote: bazel -j 1000 results in:
bazel -j 250 results in:
16 000 / (50 429 + 16 000) = ~25% |
Do you have more comments @mostynb ? Or should I do the changes discussed here and add/update unit tests? |
OK, how about we try this:
BTW in your testcase, is that a single bazel client? If so, the number of re-uploads looks unusually high. Maybe this behaviour is coming from buildbarn? Are the buildbarn workers uploading all their inputs to the CAS without checking if they're already there? |
When there is a race between concurrent uploaders, then we don’t know in which order they rename their temporary files to the common commited destination. The first ones are overwritten and the last one wins. I can see two alternatives for keeping the cache size accounting correct: A) Hold the cache lock during rename, always. B) Stat the resulting file while holding the cache lock, but only needed when there has been another concurrent uploader. How would an upload counter avoid holding the lock while doing a file operation?
Yes, a single bazel client. But with -j 1000 on 72 core machine, with 396 GB RAM. I think buildbarn workers are uploading their output to CAS without checking if they are already there, but it can not explain this, because I now also tried buildbarn with no connected workers. Then bazel client uploads input files for the first parallel 1000 actions, but buildbarn won't process them nor upload any file. But I still got:
2944 / (2944 + 11 253) = ~20% |
I think "last upload to finish wins" is OK, the server can't really promise anything more. I was thinking of trying something like this: initialize the item's counter to 0 on the first upload, and increment it for each concurrent upload. Each incoming upload increases the cache size counter by the blob size. When an upload finishes:
We increase the size of the index a little, since every item has this counter value even if there are no concurrent uploads, but the code is simple. I have an idea for a way to avoid this counter, at the cost of some complexity and some temporary miscounting of the cache size (temporary as long as each upload eventually ends), but I think we should attempt the simpler solution first.
Interesting. Technically it is OK for the client to do this, just undesirable. But that percentage is higher than I would expect (especially given buchgr's comment about this being rare). Still, it's good to make bazel-remote more robust. |
I thought this through a bit and realised we can avoid the counter by simply using the committed bool that we already have. When an upload finishes, if committed is true then we know we're replacing a blob, and if it is false then we know this is the first upload to add the blob, and it should be easy to adjust the cache size count accordingly. On a related note, we might want to maintain a separate uncommitted size count, and include that when deciding if we need to purge items. |
Perhaps buchgr’s comment were in the context of remote caching, and not remote execution. But I hope it is rare even for remote execution, except when I explicitly clear the cache to trigger it. ;-) |
I view this as two separate questions: 1: Should we minimize holding locks during os.Rename operations of temporary file? For question 1, I see the three alternatives: A) Always hold the global disk cache lock during os.Rename. Your posted pseudo code is using A. What do you think about high lock contention for that alternative? I guess os.Rename takes significant time and involves context switch to kernel space. A single writer holding the lock for a longer time, would reduce the parallelism for many readers. I’m lean towards C, since the stat would only be required when multiple concurrent writers of same hash has been detected. For question 2, I wonder if it is motivated to have uncommitted entries in the LRU’s list at all? And instead only count their sizes in LRU’s already existing currentSize field. |
Let's take question 2 first... I suspect that the uncommitted bool was originally added as a way to recognise concurrent uploads, so they could be easily ignored. Since we're going to change that behaviour, I think it would be a good idea to drop that field and only have committed entries in the LRU index.
(A) might be slow, if os.Rename is slow. We should ensure that the source and destination are on the same filesystem, which is typically a fast path, but I don't know precisely how fast. (B) sounds complicated. I would prefer to skip this until we see that simpler options fail. (C) sounds slow, given that your numbers show this happening ~20% of the time, and there are performance issues with os.Stat on mac. So let's consider another option D: perform os.Rename without holding the lock, in such a way that we have eventual consistency in cache size accounting on the server side (assuming that all uploads eventually finish). How about this:
When the function returns, run a deferred function:
With this setup, clients who download a blob can only ever receive a complete blob. The blob should only ever change if there's a cache collision, and clients can choose to mitigate that possibility by only using a good hash function, which should be fine for SHA256, and if we later add support for weaker hashes that is a risk that clients can choose to take or not. We would also need to modify Get when there's a proxy backend. If the client provides the size of the blob, then we can use a similar algorithm when downloading from the proxy. If the blob size is unknown then I guess we need a special case that only adds the blob size at commit-time (and subtracts the previous blob size if it was already in the index). |
We are discussing allowing concurrent uploads for the same blob in buchgr#267. Part of the proposed new design involves using pseudo-randomly named temp files. Unfortunately, ioutil.TempFile creates files with more restrictive permissions than we want- 0600 before umask, whereas we want 0666 before umask. This code is an adapted version of the algorithm in ioutil.TempFile, refactored for our more narrow use case- in particular we skip the use of the pid in the seed, since our use case is single-process, and we don't bother re-seeding if we encounter lots of collisions (which is unlikely). (We do not use this code, yet.)
We are discussing allowing concurrent uploads for the same blob in #267. Part of the proposed new design involves using pseudo-randomly named temp files. Unfortunately, ioutil.TempFile creates files with more restrictive permissions than we want- 0600 before umask, whereas we want 0666 before umask. This code is an adapted version of the algorithm in ioutil.TempFile, refactored for our more narrow use case- in particular we skip the use of the pid in the seed, since our use case is single-process, and we don't bother re-seeding if we encounter lots of collisions (which is unlikely). (We do not use this code, yet.)
When inserting a blob: 1) Reserve the required space. 2) Write the blob to a random tempfile on the same fs. 3) Verify the blob (if possible). 4) Unreserve the previously reserved space. 5) Insert into the index and rename to the destination file. This allows concurrent uploads (or proxy downloads) of the same blob, which might be wasteful but meets all the consistency requirements. Fixes buchgr#267.
Here's my work-in-progress fix: https://github.com/mostynb/bazel-remote/commits/concurrent_inserts |
Instead of adding placeholder "uncommitted" items to the LRU, reserve space, then write the incoming blob to a randomly named tempfile (and verify it if possible). If everything went well, unreserve the space, add it to the LRU index and move the tempfile to the final location. This allows concurrent uploads (or proxy downloads) of the same blob, which might be slightly wasteful but meets all our consistency requirements. We can try to optimise this later, if needed. The "commit" now happens in the main part of the function and the deferred cleanup function is much simpler. We no longer have the "uncommitted item evicted" issue to consider, but we might run into cases where there are inserts that fail because there is too much space reserved. Fixes buchgr#267, buchgr#318.
Instead of adding placeholder "uncommitted" items to the LRU, reserve space, then write the incoming blob to a randomly named tempfile (and verify it if possible). If everything went well, unreserve the space, add it to the LRU index and move the tempfile to the final location. This allows concurrent uploads (or proxy downloads) of the same blob, which might be slightly wasteful but meets all our consistency requirements. We can try to optimise this later, if needed. The "commit" now happens in the main part of the function and the deferred cleanup function is much simpler. We no longer have the "uncommitted item evicted" issue to consider, but we might run into cases where there are inserts that fail because there is too much space reserved. Fixes buchgr#267, buchgr#318.
Instead of adding placeholder "uncommitted" items to the LRU, reserve space, then write the incoming blob to a randomly named tempfile (and verify it if possible). If everything went well, unreserve the space, add it to the LRU index and move the tempfile to the final location. This allows concurrent uploads (or proxy downloads) of the same blob, which might be slightly wasteful but meets all our consistency requirements. We can try to optimise this later, if needed. The "commit" now happens in the main part of the function and the deferred cleanup function is much simpler. We no longer have the "uncommitted item evicted" issue to consider, but we might run into cases where there are inserts that fail because there is too much space reserved. Fixes buchgr#267, buchgr#318.
Instead of adding placeholder "uncommitted" items to the LRU, reserve space, then write the incoming blob to a randomly named tempfile (and verify it if possible). If everything went well, unreserve the space, add it to the LRU index and move the tempfile to the final location. This allows concurrent uploads (or proxy downloads) of the same blob, which might be slightly wasteful but meets all our consistency requirements. We can try to optimise this later, if needed. The "commit" now happens in the main part of the function and the deferred cleanup function is much simpler. We no longer have the "uncommitted item evicted" issue to consider, but we might run into cases where there are inserts that fail because there is too much space reserved. Fixes buchgr#267, buchgr#318.
Instead of adding placeholder "uncommitted" items to the LRU, reserve space, then write the incoming blob to a randomly named tempfile (and verify it if possible). If everything went well, unreserve the space, add it to the LRU index and move the tempfile to the final location. This allows concurrent uploads (or proxy downloads) of the same blob, which might be slightly wasteful but meets all our consistency requirements. We can try to optimise this later, if needed. The "commit" now happens in the main part of the function and the deferred cleanup function is much simpler. We no longer have the "uncommitted item evicted" issue to consider, but we might run into cases where there are inserts that fail because there is too much space reserved. Fixes #267, #318.
Four related issues/scenarios addressed with this commit:
All these issues are addressed by this commit, without introducing risk for a fast uploading clients having to wait for a slower uploadling clients that are concurrently uploading the same file. Instead it adds support for concurrent uploads of the same key.