-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a blob existence cache to RemoteExecutionCache #13166
Conversation
Bazel currently performs a full findMissingBlob call for every input of every action that is not an immediate cache hit. This causes a significant amount of network traffic, resulting in long build times on low-bandwidth, high-latency connections (i.e., work from home). In particular, for a build of TensorFlow w/ remote execution over a slow-ish network, we see multi-minute delays on actions due to these findMissingBlob calls, and the client is not able to saturate a 50-machine remote execution cluster. This change carefully de-duplicates findMissingBlob calls as well as file uploads to the remote execution cluster. It is based on a previous simpler change that did not de-duplicate *concurrent* calls. However, we have found that concurrent calls with the same inputs are common - i.e., an action completes that unblocks a large number of actions that have significant overlaps in their input sets (e.g., protoc link completion unblocks all protoc compile actions), and we were still seeing multi-minute delays on action execution. With this change, a TensorFlow build w/ remote execution is down to ~20 minutes on a 50-machine cluster, and is able to fully saturate the cluster. We have also seen significant improvements for remote Android app builds. Progress on bazelbuild#12113. Change-Id: I2b86076881f99d3d4a3ad8196abf9e2b5f6e0aa7
42f2977
to
b985984
Compare
@coeuvre, can you take a look or find someone else who can? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
futures.add(uploadBlob(context, missingDigest, merkleTree, additionalInputs)); | ||
} | ||
} else { | ||
// This code deduplicates findMissingDigests calls as well as file uploads. It works by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to distinguish owned and unowned futures?
I think we are doing too much here. Can we have a cache layer between RemoteExecutionCache
and RemoteCacheClient
which is used to deduplicate the uploads and owns the futures. All the futures are unowned from the perspective of RemoteExecutionCache
. e.g.
class UploadCache {
ListenableFuture<Void> upload(MerkleTree merkleTree, Map<Digest, Message> additionalInputs);
}
We can also hide the findMissingDigests
call inside UploadCache
and can potentially save the call if the requested uploads are already uploaded or are in progress.
AsyncTaskCache which is added recently to fix #12972 can be used to implement the cache.
FYI, we have the deduplication logic at ByteStreamUploader but isn't enabled since the forceUpload
is set to true
here. That being said, the deduplication is only available for gRPC cache client. This PR allows deduplication for all cache protocols. (This is only available for remote execution which only supports gRPC protocol)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, we have the deduplication logic at ByteStreamUploader but isn't enabled since the
forceUpload
is set totrue
here.That being said, the deduplication is only available for gRPC cache client. This PR allows deduplication for all cache protocols.(This is only available for remote execution which only supports gRPC protocol)
I’m prototyping remote execution support also with HTTP cache protocol. Because that allows using HTTP load balance infrastructure. I get performance on par with gRPC. Therefore I very much appreciate that this PR is generic and works also for HTTP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think AsyncTaskCache would be ideal here. The problem is that the findMissingBlobs
calls are bulk calls: we have to deduplicate the individual entries, but we still want to perform bulk calls for performance. That requires passing a collection in and out of this class.
I think ByteStreamUploader is not a good place for a cache. The API to check for CAS entries is findMissingBlobs
, which is a bulk API. ByteStreamUploader has a single file upload API - it can't really automatically turn a sequence of single-file uploads into a bulk check call. The caching has to be outside of ByteStreamUploader and itself needs a bulk API to be able to efficiently call findMissingBlobs
. I also don't like that it unconditionally caches knowledge about an upload for the duration of the build - if the CAS loses entries faster than the build time, Bazel gets confused and fails for no good reason.
The distinction between owned and unowned futures is thus: there is any number of concurrent calls into this method, and we own some of the digests, but not all. We then make a bulk findMissingBlobs
call for all owned digests, and then upload the ones that are missing remotely.
I'm happy to remove the flag and always enable this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification. I am now agree with you that ByteStreamUploader can not be used to deduplicate findMissingBlobs
calls.
Feel free to ignore: AsyncTaskCache can be used to deduplicate uploads just like what you achieved with ConcurrentHashMap combined with deduplication logic.
I'm happy to remove the flag and always enable this code.
Please remove the flag!
Can you please also add test cases for the deduplication?
@@ -132,7 +132,7 @@ public ExecutionResult execute( | |||
additionalInputs.put(actionDigest, action); | |||
additionalInputs.put(commandHash, command); | |||
|
|||
remoteCache.ensureInputsPresent(context, merkleTree, additionalInputs); | |||
remoteCache.ensureInputsPresent(context, merkleTree, additionalInputs, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why set true
here?
help = | ||
"If set to true, Bazel deduplicates calls to the remote service to find missing " | ||
+ "files and also deduplicates uploads.") | ||
public boolean experimentalRemoteDeduplicateUploads; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have too much flags now. Are there concerns to introduce this flag and disable the feature by default? I think people always want this enabled.
Bazel currently performs a full findMissingBlob call for every input of
every action that is not an immediate cache hit. This causes a
significant amount of network traffic, resulting in long build times on
low-bandwidth, high-latency connections (i.e., work from home).
In particular, for a build of TensorFlow w/ remote execution over a
slow-ish network, we see multi-minute delays on actions due to these
findMissingBlob calls, and the client is not able to saturate a
50-machine remote execution cluster.
This change carefully de-duplicates findMissingBlob calls as well as
file uploads to the remote execution cluster.
It is based on a previous simpler change that did not de-duplicate
concurrent calls. However, we have found that concurrent calls with
the same inputs are common - i.e., an action completes that unblocks a
large number of actions that have significant overlaps in their input
sets (e.g., protoc link completion unblocks all protoc compile
actions), and we were still seeing multi-minute delays on action
execution.
With this change, a TensorFlow build w/ remote execution is down to ~20
minutes on a 50-machine cluster, and is able to fully saturate the
cluster.
We have also seen significant improvements for remote Android app
builds.
Change-Id: Ic39347a7a7a8dc7cfd463d78f0a80e0d26a970bc