Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow concurrent Puts and proxied Gets #323

Merged
merged 2 commits into from
Aug 24, 2020

Conversation

mostynb
Copy link
Collaborator

@mostynb mostynb commented Jul 27, 2020

At the moment, bazel-remote only allows a single concurrent Put for a given blob. If another Put starts, while one is ongoing, we discard the second copy and return quickly, but the blob is not yet available. This can cause trouble for the second client, which assumes that the blob is available as soon as its Put finishes. The same problem also exists for Get calls when using a proxy backend.

I guess this was not a problem when bazel-remote was first written and only supported the http cache protocol, because bazel serializes those requests more than it does with the newer gRPC protocol. With gRPC, a single bazel client is able to hit this problem.

This change refactors disk.Cache to allow concurrent Puts and proxied Gets, by reserving space in the LRU index and using pseudo-random tempfiles, While I was at it I extracted some common code into a couple of functions and reuse them in both Put and Get.

Hopefully this fixes #267 and #318.

@mostynb mostynb force-pushed the concurrent_inserts branch 2 times, most recently from 19b97f6 to 1621686 Compare July 27, 2020 22:58
@mostynb
Copy link
Collaborator Author

mostynb commented Jul 27, 2020

@buchgr: any thoughts on this change?

@ulrfa, @lizan: please let me know if you're able to test this with your builds.

@buchgr
Copy link
Owner

buchgr commented Jul 29, 2020

Thanks @mostynb . Taking a look!

@mostynb
Copy link
Collaborator Author

mostynb commented Jul 29, 2020

Thanks @mostynb . Taking a look!

The diff is a bit confusing, it might be easier to look at the entire file:
https://github.com/buchgr/bazel-remote/blob/1621686f98d07baefc117f5e57a8c068a3b160c9/cache/disk/disk.go

if err != nil {
log.Println("Failed to proxy Put:", err)
} else {
c.proxy.Put(kind, hash, size, rc) // Async, should be fast.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if err != nil {
log.Println("Failed to proxy Put:", err)
return err
}

// Doesn't block, should be fast
c.proxy.Put(kind, hash, size, rc)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels wrong to let this proxy.Put block cause the entire operation to fail (proxy.Put is allowed to fail silently). But if os.Open(tf.Name()) fails then the commit is very likely to fail too, I guess.

cache/disk/disk.go Outdated Show resolved Hide resolved
cache/disk/disk.go Show resolved Hide resolved
cache/disk/disk.go Outdated Show resolved Hide resolved
if size > 0 {
// If we know the size, attempt to reserve that much space.
tryProxy = c.lru.Reserve(size)
} else {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When/How does this happen that we don't know the size of a blob?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HTTP API allows Content-Length to be unset/-1.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to support chunked encoding though? I believe both client and server always know the length of the content in advance?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I got mixed up. Content-Length is not relevant for http Get. I was trying to refer to the fact that incoming HTTP Get requests only specify the hash, without the size:

rdr, sizeBytes, err := h.cache.Get(kind, hash, -1)

Also, in the gRPC API clients do provide the expected size for CAS items, but cannot do so for the action cache. So these also result in a -1 here.

if !existingItem.(*lruItem).committed {
inProgress = true
val, available := c.lru.Get(key)
if available {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you move computing tryProxy above the if available block you can flatten it

if available {
  item := val.(*lruItem)
  if isSizeMismatch(size, item.size) {
    return nil, -1, tryPoxy
  }
  
  fileInfo, err = os.Stat(blobPath)
  if err != nil {
    return nil, -1, tryProxy
  }
  
  foundSize := fileInfo.Size()
  if isSizeMismatch(size, foundSize) {
    return nil, -1, tryProxy
  }
  
  f, err := os.Open(blobPath)
  if err != nil {
    return nil, -1, tryProxy
  }
  
  return f, foundSize, false
}

I think this is a lot easier to read but also makes it clear that err is not being handled. IIUC at least if os.Open() errs we should surface this no?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue with that setup is that calculating tryProxy will (usually) need to reserve space, but we do not want to reserve space if the blob available locally. I guess we could do that with a more complicated deferred function (maybe unreserve space, then unlock the mutex).

@@ -120,6 +124,44 @@ func (c *sizedLRU) MaxSize() int64 {
return c.maxSize
}

func (c *sizedLRU) Reserve(size int64) bool {
if size < 0 || size > c.maxSize || c.reservedSize+size > c.maxSize {
Copy link
Owner

@buchgr buchgr Jul 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be simplified to

if size < 0 || c.reservedSize+size > c.maxSize

When would size be < 0?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a safety check for HTTP API request where Content-Length is unset/-1.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also shouldn't you be checking c.currentSize instead of c.reservedSize?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking with reservedSize instead of currentSize is definitely a bug.

But just checking c.currentSize+size > c.maxSize for going over the upper bound is problematic if the result overflows int64. Maybe unlikely in practice, but size comes directly from the client. If the client sends an unusually value, then size > c.maxSize is likely to catch this, but I think I can come up with a more correct check.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, wait... We shouldn't check c.currentSize+size > c.maxSize at the start, since we can evict items to make space.

This code was checking c.reservedSize+size > c.maxSize because if that fails, we can't evict items to reserve enough space. I will change that part back for now, while still considering how to handle overflow.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored this a bit to improve the overflow checking, and added a comment about why we check size+reservedSize.

@@ -27,16 +27,20 @@ type SizedLRU interface {
Get(key Key) (value sizedItem, ok bool)
Remove(key Key)
Len() int
CurrentSize() int64
CurrentSize() int64 // Returns the current used + reserved size.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename this method to CommittedAndReservedSize()? Might be interesting to export both as stats?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what is most useful form? This is only used by Stats() and tests. A pair of <current used + reserved size>, <reserved size> seems reasonable and won't confuse existing users who are collecting and graphing "CurrSize".

Callers could use totalSize, reservedSize := c.CurrentSize() which is pretty readable IMO.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to TotalSize() and added ReservedSize(). I left the label in the json output as CurrSize.

cache/disk/lru.go Outdated Show resolved Hide resolved
if ele != nil {
c.removeElement(ele)
} else {
return false // This should have been caught at the start.
Copy link
Owner

@buchgr buchgr Jul 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bug if this happens?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we fail somehow then and not just return false? This will put the cache in an undefined state?

newR := c.reservedSize - size

if newC < 0 || newR < 0 {
return false
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bug if this happens?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

@lizan
Copy link
Contributor

lizan commented Jul 29, 2020

@mostynb I was able to test at 7c1e9bd and it fixed my issue.

@lizan
Copy link
Contributor

lizan commented Aug 6, 2020

@mostynb @buchgr thank you for working on this, any update?

@mostynb
Copy link
Collaborator Author

mostynb commented Aug 6, 2020

@mostynb @buchgr thank you for working on this, any update?

I will do another pass on this and try to land it next week.

@dstranz
Copy link

dstranz commented Aug 11, 2020

Hello @mostynb,

We have similar issue when using bazel 3.4.1 and bazel-remote:master instance with s3 storage as remote-cache.

I've just deploy bazel-remote from your branch on our server but it's not resolving this issue.

WARNING: Reading from Remote Cache:
BulkTransferException
WARNING: Writing to Remote Cache:
BulkTransferException

or

WARNING: Reading from Remote Cache:
ClosedChannelException

Let me know, if I could help you with testing your fix.

@mostynb
Copy link
Collaborator Author

mostynb commented Aug 11, 2020

@dstranz: are there any errors in bazel-remote's logs when you encounter this error? And do the bazel client errors mention any specific blob hashes that we can look for in bazel-remote's logs?

@dstranz
Copy link

dstranz commented Aug 11, 2020

@mostynb I will try to compare build output with bazel-remote logs tomorrow and let you know.

@ulrfa
Copy link
Contributor

ulrfa commented Aug 11, 2020

@ulrfa, ... please let me know if you're able to test this with your builds.

I had a long summer vacation and is now trying to catch up. I will try to find time to review and test next week.

@dstranz
Copy link

dstranz commented Aug 12, 2020

@mostynb Looking into bazel-remote logs was a very good point. I figure out that our issue was related to s3 configuration - after fixing it everything works perfectly.

Copy link
Contributor

@ulrfa ulrfa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry for not answering for such a long time. I had a long vacation and I'm now trying to catch up.

Avoid having uncommitted entries in the LRU turned out really well and simplified a lot in your PR! Well done @mostynb!

Your new logic for when to call lru.Unreserve seems correct to me, but it is a bit hard to keep track of the special cases spread out. Have you considered simplifying that? (E.g. One pair of Reserve/defered-Unreserve in Put. And another pair in Get. But nothing in commit nor in availableOrTryProxy.)

My main question is about how performance is affected by holding lock while doing the system calls, see below.

Comment on lines 424 to 429
fileInfo, err = os.Stat(blobPath)
if err == nil {
foundSize := fileInfo.Size()
if !isSizeMismatch(size, foundSize) {
var f *os.File
f, err = os.Open(blobPath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have benchmarks been conducted, how holding the lock, during os.Stat and os.Open, affects performance when many concurrent Get and Put invocations?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have access to a setup for high concurrency testing at the moment, So I have been mostly checking for correctness.

IIRC my reasoning here was that calling os.Stat() and then os.Open() without a lock is racy. But now I realise that we can open the file, release the lock, and then call File.Stat() safely. So then we would be down to one atomic filesystem operation while checking the availability, and we can decide if it's OK for there to be a race between checking the index (with the lock) and checking the filesystem (without the lock).

If we release the lock before opening the file, there's a potential race condition with the item being purged (or overwritten, but I think that case is OK). The purged-before-open is most likely when the disk cache is small, eg as some users have when using a proxy backend. But in this scenario, we would just fall back to retrieving the file from the proxy backend. We currently have this race on master, and it's probably OK- the cache is too small for the workload and should be increased. I will push a change to do this.

}

return err
err = os.Rename(tempfile, finalPath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have benchmarks been conducted, how holding the lock during os.Rename, affects performance when many concurrent Get and Put invocations?

I have been thinking about an option E, in addition to the alternatives A-D that we discussed in #267

E: Avoid the need for holding lock during os.Rename, by embedding both the hash and the size in the path returned by cacheFilePath. So that concurrent Put requests with same hash but different size, could not cause size inconsistences depending on in which order their os.Rename are scheduled.

Copy link
Collaborator Author

@mostynb mostynb Aug 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the cache item filenames contain the size, then it would be difficult to service requests (like HTTP GET/HEAD) that do not specify a size.

I think we need the file renaming and insertion into the index to be atomic for a given key. Otherwise overlapping Put calls can leave the file on disk out of sync with the index.

Aside: I just noticed that os.Rename() is not atomic, it calls os.Lstat() and then syscall.Rename(). We may as well just call syscall.Rename() directly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, another idea... what if we write to a tempfile in the dir, then rename it to something which indicates that we have finished writing to it, then obtain the lock, and lru.Add an item to the LRU which contains the final filename and the same key as now. lru.Add could then return the previous filename if there was one. Then we release the lock and remove the old file that was returned (if there was one).

On startup, if there are multiple "finished" files for a given key we keep the one with the most recent atime and delete the others.

But I think this change might be best done in a followup PR, if calling syscall.Rename turns out to be too slow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the cache item filenames contain the size, then it would be difficult to service requests (like HTTP GET/HEAD) that do not specify a size.

For PUT we know the size after writeAndCloseFile finished for the temporary file. And for GET/HEAD with unspecified size, perhaps we could get the size from LRU, via the hash as key?

I think we need the file renaming and insertion into the index to be atomic for a given key. Otherwise overlapping Put calls can leave the file on disk out of sync with the index.

I’m thinking about if embedding both hash and size in the final file name, could avoid the risk for out of sync between file on disk and index.

Aside: I just noticed that os.Rename() is not atomic, it calls os.Lstat() and then syscall.Rename(). We may as well just call syscall.Rename() directly.

Good finding!

if err != nil {
return err
}

if c.proxy != nil {
rc, err := os.Open(tf.Name())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smart that you use the temporary file, instead of the final destination file, for the proxy, so that no lock is needed.

@ulrfa
Copy link
Contributor

ulrfa commented Aug 19, 2020

I don't have access to a setup for high concurrency testing at the moment, So I have been mostly checking for correctness.

At Ericsson we deploy bazel-remote on 72 core machines. I plan to allocate some of them during the coming weekend, and evaluate if holding the mutex during syscall.Rename and os.Open, affects build performance or not, when there are many concurrent GET and PUT.

The previous conversation is easily lost in the collapsed code review for outdated code above. @mostynb, do you have more reflections about embedding both hash and size in the file path, to potentially reduce the need for holding mutex during the syscalls?

@mostynb
Copy link
Collaborator Author

mostynb commented Aug 19, 2020

At Ericsson we deploy bazel-remote on 72 core machines. I plan to allocate some of them during the coming weekend, and evaluate if holding the mutex during syscall.Rename and os.Open, affects build performance or not, when there are many concurrent GET and PUT.

Thanks- that would be really useful.

The previous conversation is easily lost in the collapsed code review for outdated code above. @mostynb, do you have more reflections about embedding both hash and size in the file path, to potentially reduce the need for holding mutex during the syscalls?

Unfortunately I think it might be racy, eg:

  • abc-111 is in the index and on disk
  • request 1: add abc-222 to disk
  • request 1: add abc-222 to index
  • request 2: add abc-111 to disk
  • request 1: remove abc-111 from disk (no lock)
  • request 2: add abc-111 to the index
  • request 2: remove abc-222 from disk (no lock)

A more complicated version might work, where we write to files of the form {hash}.tmp{unique} then move them to {hash}.done{unique} and then insert them into the index (including a field with the {unique} string). If we're replacing an item in the index then we remove the old {hash}.done{otherunique} file from disk after releasing the lock. That way there is much lower chance of collision between racing requests.

But I think this change is large enough that we should land it in the more safe version (removing the files while holding the lock), and then follow up with an optimisation if it's too slow. Changing the on-disk format in this way probably requires a major version bump.

@ulrfa
Copy link
Contributor

ulrfa commented Aug 19, 2020

Perhaps a {hash}-{size} named file would not have to be removed by a later PUT of same hash but different size. Instead a new entry could be added to the LRU’s double linked list, and the original entry could still be in the double linked list and be evicted the normal way. That might involve having hash+size as key in the LRU, just as the gRPC protocol seems to do conceptually. And for HTTP with unspecified size, complement with an additional map in LRU with only hash as key, towards the most recent regardless of size. I think that could avoid the problematic race and still not require holding mutex during syscall.Rename.

However, I hope to get a chance to benchmark during the weekend. If no significant performance penalty, then we don’t need to bother about this and can keep it simple. :-)

Instead of adding placeholder "uncommitted" items to the LRU, reserve space,
then write the incoming blob to a randomly named tempfile (and verify it if
possible). If everything went well, unreserve the space, add it to the LRU
index and move the tempfile to the final location.

This allows concurrent uploads (or proxy downloads) of the same blob, which
might be slightly wasteful but meets all our consistency requirements. We
can try to optimise this later, if needed.

The "commit" now happens in the main part of the function and the deferred
cleanup function is much simpler.

We no longer have the "uncommitted item evicted" issue to consider, but we
might run into cases where there are inserts that fail because there is too
much space reserved.

Fixes buchgr#267, buchgr#318.
@mostynb mostynb force-pushed the concurrent_inserts branch from 2c826a6 to 983d785 Compare August 19, 2020 20:26
@ulrfa
Copy link
Contributor

ulrfa commented Aug 23, 2020

I benchmarked three alternative implementations using mutexes in different ways:

  • mutex rename = Almost identical to 983d785 from mostynb:concurrent_inserts
  • mutex minimal = As mutex rename, but releases the mutex during syscall.Rename in commit.
  • mutex open + rename = As mutex rename, but keeps the mutex locked also during os.Open in availableOrTryProxy.

Each implementation is benchmarked in two different scenarios. Each scenario is executed using up to 87 bazel clients started in parallell, for generating the load. Each client on separate machine. The bazel-remote server is deployed on a machine with 72 logical processors (2 x Xeon 6154) and 2 x 10 Gbit network interface (link aggregation).

Scenario A: 0% Cache Hits

  • All clients doing same build, but with unique dummy –action_env to trigger cache misses.
  • Each bazel client send:
    ~7000 HTTP GET requests for non existing AC entries,
    ~30 000 HTTP PUT requests for uploading AC and CAS entries.
    No HTTP HEAD. No gRPC.
  • Nothing evicted from cache during benchmark.
  • Bazel-remote scales really well if not holding mutexes during system calls.
  • The graph indicates holding mutexes during syscall.Rename, starts to cause lock contention issues from 50 concurrent bazel clients. Increases build times 70% at 87 clients.
  • The SSD seems to handle the load. Messured Write IOPS peek at 65 000 and sustained 35 000. (But it could be interesting to see how much time f.sync() takes...)

bazelremote_mutex_graph_A

Scenario B: 100% Cache Hits

  • Each bazel client send:
    ~36 000 HTTP GET requests downloading AC and CAS entries.
    No HTTP PUT, No HTTP HEAD. No gRPC.
  • Almost no disk accesses, all data already in kernel’s RAM file system cache.
  • The 2 x 10 Gbit network interfaces become saturated from ~30 concurrent clients.

bazelremote_mutex_graph_B

@mostynb
Copy link
Collaborator Author

mostynb commented Aug 24, 2020

Thanks for running these benchmarks. Here are some initial questions/notes:

  • Did you hit eny errors during testing?
  • In the hot-cache/100% cache hit scenario, it looks like "mutex rename" performs pretty well. And since the network connection is saturated we might need to wait for something like feature request: blob level compression bazelbuild/remote-apis#147 to improve on this significantly.
  • There's clearly room for improvement over "mutex rename" in the cold-cache/0% cache hit scenario, but hopefully this isn't a common case to be in.
  • Re f.Sync(), there's an old issue about this being slow on Mac hardware (Don't Sync() on every upload #67). I had a prototype for working around this, but it has gone stale.

@ulrfa
Copy link
Contributor

ulrfa commented Aug 24, 2020

Thanks!

I believe the logic is correct, and did not notice any errors (related to this commit) during the load test.

If you decide to merge 983d785 now, despite the performance degradation, then it could perhaps be considered to embed both hash and file size in file path strings, the next time the on-disk format is changed for other reasons? (or address the degradation in other ways)

For the hot-cache/100% cache hit scenario, with saturated network interface, I agree that something like bazelbuild/remote-apis#147 and/or bazelbuild/bazel#6862 would be needed.

@mostynb
Copy link
Collaborator Author

mostynb commented Aug 24, 2020

Thanks!

I believe the logic is correct, and did not notice any errors (related to this commit) during the load test.

If you decide to merge 983d785 now, despite the performance degradation, then it could perhaps be considered to embed both hash and file size in file path strings, the next time the on-disk format is changed for other reasons? (or address the degradation in other ways)

This PR is pretty large already, I think it's best to land it with the potential performance hit but continue working on improvements to see if we can win some of that back. This is surely faster than a failed build.

We can change the on-disk format later if needed (and probably bump the version number to 2.0.0 in that case).

For the hot-cache/100% cache hit scenario, with saturated network interface, I agree that something like bazelbuild/remote-apis#147 and/or bazelbuild/bazel#6862 would be needed.

We can continue optimising after we get compression support (builds-without-the-bytes are already supported of course).

@mostynb mostynb merged commit feb1333 into buchgr:master Aug 24, 2020
@mostynb mostynb deleted the concurrent_inserts branch August 24, 2020 22:29
@ulrfa
Copy link
Contributor

ulrfa commented Aug 25, 2020

Thanks @mostynb! Sounds good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants