Read After Write Guarantee #267

ulrfa · 2020-05-07T16:39:00Z

Four related issues/scenarios addressed with this commit:

Client A uploads file to CAS. It becomes commited in disk storage. Later client B uploads same file with same hash overwriting the one from client A. During second upload the disk storage changes the entry into uncommited, resulting in that file that was previously available can not be retrieved for a while. This behviour is problematic for Bazel Remote Execution and the bazel feature builds-without-the-bytes, which expect read after write guarantee.
If the same file is uploaded concurrently (either by the same client or different clients) the second one is silently discarded and the upload is acknowledge even before the upload has been completed by the first client. This behviour is problematic for Bazel Remote Execution and the bazel feature builds-without-the bytes, which expect read after write guarantee.
A deferred os.Remove(tmpFilePath) after an upload, could get delayed and interfer with another upload of the same key.
After Put adds an uncommited entry, but before it creates a temporary file, it can become evicted by another concurrent add. Then the download would continue and could result in a file stored in CAS without entry in LRU, and it could also interfer with following concurrent uploads of the same key.

All these issues are addressed by this commit, without introducing risk for a fast uploading clients having to wait for a slower uploadling clients that are concurrently uploading the same file. Instead it adds support for concurrent uploads of the same key.

Four issues addressed with this commit: a. Client A uploads file to CAS. It becomes commited in disk storage. Later client B uploads same file with same hash overwriting the one from client A. During upload the disk storage considered the file as uncommited, resulting in that it can not be retrieved for a while. This issue is problematic for Bazel Remote Execution and the bazel feature builds-without-the-bytes. b. If the same file is uploaded concurrently (either by the same client or different clients) the second one silently discards the uploaded data and acknowledthe upload even before the file upload has been completed by the first client. This issue is problematic for Bazel Remote Execution and the bazel feature builds-without-the bytes. c. A deferred os.Remove(tmpFilePath) after an upload, gets delayed and interfers with another upload of the same key. d. After Put adds an uncommited entry, but before it starts downloading the file, it become evicted by a concurrent add. Then the download would continue and could result in a file stored in CAS without entry in LRU, and it could also interfer with following concurrent uploads of the same key. These issues are addressed without introducing risk for a fast uploading client from having to wait for a slower uploadling client concurrently uploading the same file. Instead it adds support for concurrent uploads of the same key. The logic is controlled via new parameters in YAML configuration file: # Might be slightly faster if set to false, but should be set to true # if using either bazel's feature builds-without-the-bytes or bazel # remote execution. read_after_write_guarantee: true # Setting to false migth be slightly faster, especially for remote # execution use cases when client uploads the same input file multiple # times for concurrent actions. Setting to true allows overwring # already existing entries in the cache. It has been argued that # overwriting could be usefull in order to overwrite poisened items, # but in practice it could be hard to identify all poisoned entries # and clearing the complete cache could instead make more sense. overwrite_commited: false

ulrfa · 2020-05-07T16:41:46Z

Hi,

What do you think about the issues addressed by this commit?

Would you accept this pull request if I add and update unit tests?

Should the behavour be configurable?

BR,
Ulrik Falklöf

mostynb · 2020-05-07T22:55:03Z

Thanks for the PR. This is complicated code, so it's great to have another pair of eyes on it.

I will try to take a proper look early next week, but here are some initial comments...

Four related issues/scenarios addressed with this commit:

Client A uploads file to CAS. It becomes commited in disk storage. Later client B uploads same file with same hash overwriting the one from client A. During second upload the disk storage changes the entry into uncommited, resulting in that file that was previously available can not be retrieved for a while. This behviour is problematic for Bazel Remote Execution and the bazel feature builds-without-the-bytes, which expect read after write guarantee.

I have a large refactoring work-in-progress that might avoid this problem in another way (#186) by removing the need to overwrite CAS blobs. I need to find time to finish it off.

If the same file is uploaded concurrently (either by the same client or different clients) the second one is silently discarded and the upload is acknowledge even before the upload has been completed by the first client. This behviour is problematic for Bazel Remote Execution and the bazel feature builds-without-the bytes, which expect read after write guarantee.

This is a known issue, I guess it was less likely to occur in practice before builds-without-the-bytes. Have you encountered it in practice?

A deferred os.Remove(tmpFilePath) after an upload, could get delayed and interfer with another upload of the same key.

This might be a simple fix in the current setup- should unlock the mutex after removing the temp file here:

bazel-remote/cache/disk/disk.go

Line 470 in 69bce62

c.mu.Unlock()

And maybe also merge this deferred function with the one a few lines above, which uses the mutex:

bazel-remote/cache/disk/disk.go

Line 319 in 69bce62

defer func() {

After Put adds an uncommited entry, but before it creates a temporary file, it can become evicted by another concurrent add. Then the download would continue and could result in a file stored in CAS without entry in LRU, and it could also interfer with following concurrent uploads of the same key.

If you hit that case, it is probably a sign that your cache size is simply too small. I'm not sure if we need to "gracefully" handle this situation, since the real solution is to increase the cache size, which requires a restart.

All these issues are addressed by this commit, without introducing risk for a fast uploading clients having to wait for a slower uploadling clients that are concurrently uploading the same file. Instead it adds support for concurrent uploads of the same key.

ulrfa · 2020-05-08T14:07:26Z

Thanks for your instant response!

My scenario is bazel with -j 1000, using bazel-remote as CAS/AC for remote execution with Buildbarn. We use bazel-remote instead of Buildbarn’s bb-storage, due to higher troughput and lower cpu consumption.

I experienced failed builds, due to files missing in CAS, and tried to find potential causes, which resulted in #267 and #266. I have tested extensively with both those patches combined and after that never seen missing CAS entries.

I can’t say for sure which of the individual changes are most important. But I’ve seen the bazel client uploading the same file concurrently many times. So I suspect handling concurrent uncommited uploads of the same file is critical. And if some of the duplicated uploads gets delayed, it could reach bazel-remote also in commited state. Therefore handling is cricitcal both for uncommited and commited.

This might be a simple fix in the current setup- should unlock the mutex after removing the temp file

Perhaps the temp file is removed outside the lock, in order to minimize file operations while holding the lock? For better parallellism?

I think the non-unique temporary file name is the root of this problem, and much of the other complexity. It the temporary file name would be unique, then it could be handled safely outside the lock and also using defer.

If you hit that case, it is probably a sign that your cache size is simply too small. I'm not sure if we need to "gracefully" handle this situation, since the real solution is to increase the cache size, which requires a restart.

Yes, the cache size should be increased in that case, and failed builds and restart could be OK. But I’m a bit worried about if it could result in corrupt files in the cache. I have to think more about that.

/Ulrik

mostynb

This sounds like a good feature, but as mentioned earlier let's delay work on this for a little while so I can make a release without any more large changes. Small and obviously safe tweaks are welcome in the meantime. I will aim to make the release on wednesday.

mostynb · 2020-05-11T06:48:11Z

config/config.go

+        ReadAfterWriteGuarantee bool                      `yaml:"read_after_write_guarantee"`
+        OverwriteCommited       bool                      `yaml:"overwrite_commited"`


I think I would rather keep things simple and have only one behaviour, ie allow concurrent writes.

Agree. Then ReadAfterWriteGuarantee configuration can be removed.

What about OverwriteCommited? Personally I don’t have a use case for replacing committed entries, because:

Either the cache is huge and contains so many artifacts from so many different builds, that it becomes too hard to know which entries are poisoned and should be overwritten. If I discovered that such a cache contained poisened entries, I would clear the whole cache anyway.

Or the cache is tiny and could be re-populated quickly, and then it would be OK to clear the whole cache.

The only two reasons I can think why we would want to overwrite blobs are:

If the blob is somehow known to be corrupt, then we can make a client re-upload everything to replace the bad items.

For simplicity. We let the client upload the blob anyway, and saving it to disk is takes much less time than the upload so why not do it anyway, without bothering to check if the preexisting blob was corrupt?

Good point about simplicity! And I have not noticed any obvious difference in performance. So the conclusion is to remove both configuration parameters and only keep the code for the cases ReadAfterWriteGuarantee=true and OverwriteCommited=true.

Yes- let's not introduce configuration options, we just need to be careful to use atomic move/rename operations to replace existing files.

mostynb · 2020-05-12T19:51:46Z

I can’t say for sure which of the individual changes are most important. But I’ve seen the bazel client uploading the same file concurrently many times. So I suspect handling concurrent uncommited uploads of the same file is critical. And if some of the duplicated uploads gets delayed, it could reach bazel-remote also in commited state. Therefore handling is cricitcal both for uncommited and commited.

This can be time-consuming, since you probably need to start with an empty cache for such a stress-test. But if you are able to, could you check what kind of blobs were causing trouble? I suspect the most likely candidates are the empty blob, or large blobs (which take some time to upload).

This might be a simple fix in the current setup- should unlock the mutex after removing the temp file

Perhaps the temp file is removed outside the lock, in order to minimize file operations while holding the lock? For better parallellism?

Yes, IIRC that's what I was aiming for. But there's a potential race condition, as you say.

ulrfa · 2020-05-18T10:45:21Z

I think you are right about most likely candidates are empty blob or large blobs. And now with the special handling for empty blob, the empty blob should not trigger this race condition.

Without #267, I have seen it once also for a non-empty blob. It was a 59 MB large file (a tool) that a lot of concurrent actions had as input file:

Remote connection/protocol failed with: execution failed FAILED_PRECONDITION: Failed to obtain input file "external/xxx": GRPC BYTESTREAM READ BLOB NOT FOUND: 7f580eb4ec42890f924420f552a828f4362d83e3e4104587cdea71d0472c2c51
(”xxx” is obfuscated)

I have now tried to reproduce it by clearing the caches between each build, but I have not seen it again. However I suspect the issue is still there. Because sometimes up to 25% of all the PUT operations reaching bazel-remote, are for keys already in commited state in the LRU, and with unfortunate scheduling that could trigger this issue.

Example with many overwritten keys in commited state. All of the numbers below are a single build, after clearing the caches, using Buildbarn with bazel-remote:

bazel -j 1000 results in:

PUT with the key not in LRU: 50 429
PUT with the key in LRU in commited state: varying between 10 000 - 16 000.

bazel -j 250 results in:

PUT with the key not in LRU: 50 429
PUT with the key in LRU in commited state: varying between 4 000 - 8 000.

16 000 / (50 429 + 16 000) = ~25%

ulrfa · 2020-05-18T11:31:09Z

Do you have more comments @mostynb ? Or should I do the changes discussed here and add/update unit tests?

mostynb · 2020-05-18T21:14:12Z

OK, how about we try this:

Allow concurrent uploads, using a unique temp file inside the same dir as the eventual committed file.
Attempt to keep the cache size accounting correct without stat'ing a file while holding the cache lock. Maybe we can add an ongoing uploads counter to the LRU items?

BTW in your testcase, is that a single bazel client? If so, the number of re-uploads looks unusually high. Maybe this behaviour is coming from buildbarn? Are the buildbarn workers uploading all their inputs to the CAS without checking if they're already there?

ulrfa · 2020-05-19T14:15:01Z

Attempt to keep the cache size accounting correct without stat'ing a file while holding the cache lock. Maybe we can add an ongoing uploads counter to the LRU items?

When there is a race between concurrent uploaders, then we don’t know in which order they rename their temporary files to the common commited destination. The first ones are overwritten and the last one wins. I can see two alternatives for keeping the cache size accounting correct:

A) Hold the cache lock during rename, always.

B) Stat the resulting file while holding the cache lock, but only needed when there has been another concurrent uploader.

How would an upload counter avoid holding the lock while doing a file operation?

BTW in your testcase, is that a single bazel client? If so, the number of re-uploads looks unusually high. Maybe this behaviour is coming from buildbarn? Are the buildbarn workers uploading all their inputs to the CAS without checking if they're already there?

Yes, a single bazel client. But with -j 1000 on 72 core machine, with 396 GB RAM. I think buildbarn workers are uploading their output to CAS without checking if they are already there, but it can not explain this, because I now also tried buildbarn with no connected workers. Then bazel client uploads input files for the first parallel 1000 actions, but buildbarn won't process them nor upload any file. But I still got:

PUT with the key not in LRU: 11 253
PUT with the key in LRU in committed state: 2 944

2944 / (2944 + 11 253) = ~20%

mostynb · 2020-05-19T21:21:26Z

Attempt to keep the cache size accounting correct without stat'ing a file while holding the cache lock. Maybe we can add an ongoing uploads counter to the LRU items?

When there is a race between concurrent uploaders, then we don’t know in which order they rename their temporary files to the common commited destination. The first ones are overwritten and the last one wins. I can see two alternatives for keeping the cache size accounting correct:

A) Hold the cache lock during rename, always.

B) Stat the resulting file while holding the cache lock, but only needed when there has been another concurrent uploader.

How would an upload counter avoid holding the lock while doing a file operation?

I think "last upload to finish wins" is OK, the server can't really promise anything more.

I was thinking of trying something like this: initialize the item's counter to 0 on the first upload, and increment it for each concurrent upload. Each incoming upload increases the cache size counter by the blob size. When an upload finishes:

Lock the mutex.
(Optional?) move the item to the front of the LRU.
If the item's counter is greater than 0, subtract the blob size from the cache size count and decrement the item's counter (no stat required).
Do an atomic rename to the final destination.
Release the lock.

We increase the size of the index a little, since every item has this counter value even if there are no concurrent uploads, but the code is simple.

I have an idea for a way to avoid this counter, at the cost of some complexity and some temporary miscounting of the cache size (temporary as long as each upload eventually ends), but I think we should attempt the simpler solution first.

BTW in your testcase, is that a single bazel client? If so, the number of re-uploads looks unusually high. Maybe this behaviour is coming from buildbarn? Are the buildbarn workers uploading all their inputs to the CAS without checking if they're already there?

Yes, a single bazel client. But with -j 1000 on 72 core machine, with 396 GB RAM. I think buildbarn workers are uploading their output to CAS without checking if they are already there, but it can not explain this, because I now also tried buildbarn with no connected workers. Then bazel client uploads input files for the first parallel 1000 actions, but buildbarn won't process them nor upload any file. But I still got:

PUT with the key not in LRU: 11 253

PUT with the key in LRU in committed state: 2 944

2944 / (2944 + 11 253) = ~20%

Interesting. Technically it is OK for the client to do this, just undesirable. But that percentage is higher than I would expect (especially given buchgr's comment about this being rare).

Still, it's good to make bazel-remote more robust.

mostynb · 2020-05-22T06:24:10Z

I have an idea for a way to avoid this counter, at the cost of some complexity and some temporary miscounting of the cache size (temporary as long as each upload eventually ends), but I think we should attempt the simpler solution first.

I thought this through a bit and realised we can avoid the counter by simply using the committed bool that we already have. When an upload finishes, if committed is true then we know we're replacing a blob, and if it is false then we know this is the first upload to add the blob, and it should be easy to adjust the cache size count accordingly.

On a related note, we might want to maintain a separate uncommitted size count, and include that when deciding if we need to purge items.

ulrfa · 2020-05-25T09:08:15Z

Interesting. Technically it is OK for the client to do this, just undesirable. But that percentage is higher than I would expect (especially given buchgr's comment about this being rare).

Perhaps buchgr’s comment were in the context of remote caching, and not remote execution. But I hope it is rare even for remote execution, except when I explicitly clear the cache to trigger it. ;-)

ulrfa · 2020-05-25T09:11:34Z

I view this as two separate questions:

1: Should we minimize holding locks during os.Rename operations of temporary file?
2: What data structure for uncommitted entries?

For question 1, I see the three alternatives:

A) Always hold the global disk cache lock during os.Rename.
B) Introduce new fine granular locking per hash, to hold during os.Rename.
C) Don’t lock during os.Rename, instead occasionally stat after os.Rename, before adding as commited to LRU

Your posted pseudo code is using A. What do you think about high lock contention for that alternative? I guess os.Rename takes significant time and involves context switch to kernel space. A single writer holding the lock for a longer time, would reduce the parallelism for many readers.

I’m lean towards C, since the stat would only be required when multiple concurrent writers of same hash has been detected.

For question 2, I wonder if it is motivated to have uncommitted entries in the LRU’s list at all? And instead only count their sizes in LRU’s already existing currentSize field.

mostynb · 2020-05-26T21:56:33Z

I view this as two separate questions:

1: Should we minimize holding locks during os.Rename operations of temporary file?
2: What data structure for uncommitted entries?
[snip]
For question 2, I wonder if it is motivated to have uncommitted entries in the LRU’s list at all? And instead only count their sizes in LRU’s already existing currentSize field.

Let's take question 2 first...

I suspect that the uncommitted bool was originally added as a way to recognise concurrent uploads, so they could be easily ignored. Since we're going to change that behaviour, I think it would be a good idea to drop that field and only have committed entries in the LRU index.

For question 1, I see the three alternatives:

A) Always hold the global disk cache lock during os.Rename.
B) Introduce new fine granular locking per hash, to hold during os.Rename.
C) Don’t lock during os.Rename, instead occasionally stat after os.Rename, before adding as commited to LRU

Your posted pseudo code is using A. What do you think about high lock contention for that alternative? I guess os.Rename takes significant time and involves context switch to kernel space. A single writer holding the lock for a longer time, would reduce the parallelism for many readers.

I’m lean towards C, since the stat would only be required when multiple concurrent writers of same hash has been detected.

(A) might be slow, if os.Rename is slow. We should ensure that the source and destination are on the same filesystem, which is typically a fast path, but I don't know precisely how fast.

(B) sounds complicated. I would prefer to skip this until we see that simpler options fail.

(C) sounds slow, given that your numbers show this happening ~20% of the time, and there are performance issues with os.Stat on mac.

So let's consider another option D: perform os.Rename without holding the lock, in such a way that we have eventual consistency in cache size accounting on the server side (assuming that all uploads eventually finish). How about this:

Lock the mutex.
Add the blob size to the cache size (but do not add the item to the LRU index yet).
Unlock the mutex.
Write the blob to a temp file in the same directory as the final file.
Rename the temp file to the final file. This is atomic unless there's a crash, according to posix IIRC. If the new blob is the same as the old blob or if there was no old blob, then clients will see no difference- this should be the vast majority of cases.

When the function returns, run a deferred function:

Lock the mutex.
If the upload failed (the temp file should have been removed), subtract the expected blob size from the cache size.
Else If the blob is already in the index, subtract the blob size from the cache size.
Unlock the mutex.

With this setup, clients who download a blob can only ever receive a complete blob. The blob should only ever change if there's a cache collision, and clients can choose to mitigate that possibility by only using a good hash function, which should be fine for SHA256, and if we later add support for weaker hashes that is a risk that clients can choose to take or not.

We would also need to modify Get when there's a proxy backend. If the client provides the size of the blob, then we can use a similar algorithm when downloading from the proxy. If the blob size is unknown then I guess we need a special case that only adds the blob size at commit-time (and subtracts the previous blob size if it was already in the index).

We are discussing allowing concurrent uploads for the same blob in buchgr#267. Part of the proposed new design involves using pseudo-randomly named temp files. Unfortunately, ioutil.TempFile creates files with more restrictive permissions than we want- 0600 before umask, whereas we want 0666 before umask. This code is an adapted version of the algorithm in ioutil.TempFile, refactored for our more narrow use case- in particular we skip the use of the pid in the seed, since our use case is single-process, and we don't bother re-seeding if we encounter lots of collisions (which is unlikely). (We do not use this code, yet.)

We are discussing allowing concurrent uploads for the same blob in #267. Part of the proposed new design involves using pseudo-randomly named temp files. Unfortunately, ioutil.TempFile creates files with more restrictive permissions than we want- 0600 before umask, whereas we want 0666 before umask. This code is an adapted version of the algorithm in ioutil.TempFile, refactored for our more narrow use case- in particular we skip the use of the pid in the seed, since our use case is single-process, and we don't bother re-seeding if we encounter lots of collisions (which is unlikely). (We do not use this code, yet.)

When inserting a blob: 1) Reserve the required space. 2) Write the blob to a random tempfile on the same fs. 3) Verify the blob (if possible). 4) Unreserve the previously reserved space. 5) Insert into the index and rename to the destination file. This allows concurrent uploads (or proxy downloads) of the same blob, which might be wasteful but meets all the consistency requirements. Fixes buchgr#267.

mostynb · 2020-07-25T07:27:12Z

Here's my work-in-progress fix: https://github.com/mostynb/bazel-remote/commits/concurrent_inserts

Instead of adding placeholder "uncommitted" items to the LRU, reserve space, then write the incoming blob to a randomly named tempfile (and verify it if possible). If everything went well, unreserve the space, add it to the LRU index and move the tempfile to the final location. This allows concurrent uploads (or proxy downloads) of the same blob, which might be slightly wasteful but meets all our consistency requirements. We can try to optimise this later, if needed. The "commit" now happens in the main part of the function and the deferred cleanup function is much simpler. We no longer have the "uncommitted item evicted" issue to consider, but we might run into cases where there are inserts that fail because there is too much space reserved. Fixes buchgr#267, buchgr#318.

Instead of adding placeholder "uncommitted" items to the LRU, reserve space, then write the incoming blob to a randomly named tempfile (and verify it if possible). If everything went well, unreserve the space, add it to the LRU index and move the tempfile to the final location. This allows concurrent uploads (or proxy downloads) of the same blob, which might be slightly wasteful but meets all our consistency requirements. We can try to optimise this later, if needed. The "commit" now happens in the main part of the function and the deferred cleanup function is much simpler. We no longer have the "uncommitted item evicted" issue to consider, but we might run into cases where there are inserts that fail because there is too much space reserved. Fixes #267, #318.

ulrfa mentioned this pull request May 8, 2020

Consistent handling of empty files in CAS #266

Merged

mostynb reviewed May 11, 2020

View reviewed changes

mostynb mentioned this pull request Jun 18, 2020

add some customized tempfile code #294

Merged

mostynb mentioned this pull request Jul 23, 2020

Inconsistent result when CAS exists from proxy but not locally. #318

Closed

mostynb mentioned this pull request Jul 27, 2020

allow concurrent Puts and proxied Gets #323

Merged

mostynb closed this in #323 Aug 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read After Write Guarantee #267

Read After Write Guarantee #267

ulrfa commented May 7, 2020

ulrfa commented May 7, 2020 •

edited

Loading

mostynb commented May 7, 2020

ulrfa commented May 8, 2020

mostynb left a comment

mostynb May 11, 2020

ulrfa May 11, 2020

mostynb May 12, 2020

ulrfa May 18, 2020

mostynb May 18, 2020

mostynb commented May 12, 2020

ulrfa commented May 18, 2020

ulrfa commented May 18, 2020

mostynb commented May 18, 2020

ulrfa commented May 19, 2020

mostynb commented May 19, 2020

mostynb commented May 22, 2020

ulrfa commented May 25, 2020

ulrfa commented May 25, 2020

mostynb commented May 26, 2020

mostynb commented Jul 25, 2020

		ReadAfterWriteGuarantee bool `yaml:"read_after_write_guarantee"`
		OverwriteCommited bool `yaml:"overwrite_commited"`

Read After Write Guarantee #267

Read After Write Guarantee #267

Conversation

ulrfa commented May 7, 2020

ulrfa commented May 7, 2020 • edited Loading

mostynb commented May 7, 2020

ulrfa commented May 8, 2020

mostynb left a comment

Choose a reason for hiding this comment

mostynb May 11, 2020

Choose a reason for hiding this comment

ulrfa May 11, 2020

Choose a reason for hiding this comment

mostynb May 12, 2020

Choose a reason for hiding this comment

ulrfa May 18, 2020

Choose a reason for hiding this comment

mostynb May 18, 2020

Choose a reason for hiding this comment

mostynb commented May 12, 2020

ulrfa commented May 18, 2020

ulrfa commented May 18, 2020

mostynb commented May 18, 2020

ulrfa commented May 19, 2020

mostynb commented May 19, 2020

mostynb commented May 22, 2020

ulrfa commented May 25, 2020

ulrfa commented May 25, 2020

mostynb commented May 26, 2020

mostynb commented Jul 25, 2020

ulrfa commented May 7, 2020 •

edited

Loading