Add a blob existence cache to RemoteExecutionCache #13166

ulfjack · 2021-03-06T00:44:52Z

Bazel currently performs a full findMissingBlob call for every input of
every action that is not an immediate cache hit. This causes a
significant amount of network traffic, resulting in long build times on
low-bandwidth, high-latency connections (i.e., work from home).

In particular, for a build of TensorFlow w/ remote execution over a
slow-ish network, we see multi-minute delays on actions due to these
findMissingBlob calls, and the client is not able to saturate a
50-machine remote execution cluster.

This change carefully de-duplicates findMissingBlob calls as well as
file uploads to the remote execution cluster.

It is based on a previous simpler change that did not de-duplicate
concurrent calls. However, we have found that concurrent calls with
the same inputs are common - i.e., an action completes that unblocks a
large number of actions that have significant overlaps in their input
sets (e.g., protoc link completion unblocks all protoc compile
actions), and we were still seeing multi-minute delays on action
execution.

With this change, a TensorFlow build w/ remote execution is down to ~20
minutes on a 50-machine cluster, and is able to fully saturate the
cluster.

We have also seen significant improvements for remote Android app
builds.

Change-Id: Ic39347a7a7a8dc7cfd463d78f0a80e0d26a970bc

Bazel currently performs a full findMissingBlob call for every input of every action that is not an immediate cache hit. This causes a significant amount of network traffic, resulting in long build times on low-bandwidth, high-latency connections (i.e., work from home). In particular, for a build of TensorFlow w/ remote execution over a slow-ish network, we see multi-minute delays on actions due to these findMissingBlob calls, and the client is not able to saturate a 50-machine remote execution cluster. This change carefully de-duplicates findMissingBlob calls as well as file uploads to the remote execution cluster. It is based on a previous simpler change that did not de-duplicate *concurrent* calls. However, we have found that concurrent calls with the same inputs are common - i.e., an action completes that unblocks a large number of actions that have significant overlaps in their input sets (e.g., protoc link completion unblocks all protoc compile actions), and we were still seeing multi-minute delays on action execution. With this change, a TensorFlow build w/ remote execution is down to ~20 minutes on a 50-machine cluster, and is able to fully saturate the cluster. We have also seen significant improvements for remote Android app builds. Progress on bazelbuild#12113. Change-Id: I2b86076881f99d3d4a3ad8196abf9e2b5f6e0aa7

ulfjack · 2021-03-10T21:44:21Z

@coeuvre, can you take a look or find someone else who can?

coeuvre

Thanks for the PR!

coeuvre · 2021-03-11T04:18:21Z

src/main/java/com/google/devtools/build/lib/remote/RemoteExecutionCache.java

+        futures.add(uploadBlob(context, missingDigest, merkleTree, additionalInputs));
+      }
+    } else {
+      // This code deduplicates findMissingDigests calls as well as file uploads. It works by


Why do we need to distinguish owned and unowned futures?

I think we are doing too much here. Can we have a cache layer between RemoteExecutionCache and RemoteCacheClient which is used to deduplicate the uploads and owns the futures. All the futures are unowned from the perspective of RemoteExecutionCache. e.g.

class UploadCache { ListenableFuture<Void> upload(MerkleTree merkleTree, Map<Digest, Message> additionalInputs); }

We can also hide the findMissingDigests call inside UploadCache and can potentially save the call if the requested uploads are already uploaded or are in progress.

AsyncTaskCache which is added recently to fix #12972 can be used to implement the cache.

FYI, we have the deduplication logic at ByteStreamUploader but isn't enabled since the forceUpload is set to true here. ~~That being said, the deduplication is only available for gRPC cache client. This PR allows deduplication for all cache protocols.~~ (This is only available for remote execution which only supports gRPC protocol)

FYI, we have the deduplication logic at ByteStreamUploader but isn't enabled since the forceUpload is set to true here. ~~That being said, the deduplication is only available for gRPC cache client. This PR allows deduplication for all cache protocols.~~ (This is only available for remote execution which only supports gRPC protocol)

I’m prototyping remote execution support also with HTTP cache protocol. Because that allows using HTTP load balance infrastructure. I get performance on par with gRPC. Therefore I very much appreciate that this PR is generic and works also for HTTP.

I don't think AsyncTaskCache would be ideal here. The problem is that the findMissingBlobs calls are bulk calls: we have to deduplicate the individual entries, but we still want to perform bulk calls for performance. That requires passing a collection in and out of this class.

I think ByteStreamUploader is not a good place for a cache. The API to check for CAS entries is findMissingBlobs, which is a bulk API. ByteStreamUploader has a single file upload API - it can't really automatically turn a sequence of single-file uploads into a bulk check call. The caching has to be outside of ByteStreamUploader and itself needs a bulk API to be able to efficiently call findMissingBlobs. I also don't like that it unconditionally caches knowledge about an upload for the duration of the build - if the CAS loses entries faster than the build time, Bazel gets confused and fails for no good reason.

The distinction between owned and unowned futures is thus: there is any number of concurrent calls into this method, and we own some of the digests, but not all. We then make a bulk findMissingBlobs call for all owned digests, and then upload the ones that are missing remotely.

I'm happy to remove the flag and always enable this code.

Thanks for the clarification. I am now agree with you that ByteStreamUploader can not be used to deduplicate findMissingBlobs calls.

Feel free to ignore: AsyncTaskCache can be used to deduplicate uploads just like what you achieved with ConcurrentHashMap combined with deduplication logic.

I'm happy to remove the flag and always enable this code.

Please remove the flag!

Can you please also add test cases for the deduplication?

coeuvre · 2021-03-11T04:35:24Z

src/main/java/com/google/devtools/build/lib/remote/RemoteRepositoryRemoteExecutor.java

@@ -132,7 +132,7 @@ public ExecutionResult execute(
        additionalInputs.put(actionDigest, action);
        additionalInputs.put(commandHash, command);

-        remoteCache.ensureInputsPresent(context, merkleTree, additionalInputs);
+        remoteCache.ensureInputsPresent(context, merkleTree, additionalInputs, true);


Why set true here?

coeuvre · 2021-03-11T04:37:51Z

src/main/java/com/google/devtools/build/lib/remote/options/RemoteOptions.java

+          help =
+              "If set to true, Bazel deduplicates calls to the remote service to find missing "
+                  + "files and also deduplicates uploads.")
+  public boolean experimentalRemoteDeduplicateUploads;


We have too much flags now. Are there concerns to introduce this flag and disable the feature by default? I think people always want this enabled.

google-cla bot added the cla: yes label Mar 6, 2021

ulfjack force-pushed the blob-existence-cache branch from 42f2977 to b985984 Compare March 6, 2021 00:45

coeuvre self-requested a review March 11, 2021 02:50

coeuvre added the team-Remote-Exec Issues and PRs for the Execution (Remote) team label Mar 11, 2021

coeuvre reviewed Mar 11, 2021

View reviewed changes

coeuvre mentioned this pull request Jul 8, 2021

Remote: async upload #13655

Closed

bazel-io closed this in db15e47 Sep 6, 2021

coeuvre mentioned this pull request Aug 8, 2022

"Upload missing inputs" performance regression in Bazel 5.0 and 5.3 #16054

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a blob existence cache to RemoteExecutionCache #13166

Add a blob existence cache to RemoteExecutionCache #13166

ulfjack commented Mar 6, 2021

ulfjack commented Mar 10, 2021

coeuvre left a comment

coeuvre Mar 11, 2021 •

edited

Loading

ulrfa Mar 11, 2021

ulfjack Mar 12, 2021

coeuvre Mar 15, 2021

coeuvre Mar 11, 2021

coeuvre Mar 11, 2021

Add a blob existence cache to RemoteExecutionCache #13166

Add a blob existence cache to RemoteExecutionCache #13166

Conversation

ulfjack commented Mar 6, 2021

ulfjack commented Mar 10, 2021

coeuvre left a comment

Choose a reason for hiding this comment

coeuvre Mar 11, 2021 • edited Loading

Choose a reason for hiding this comment

ulrfa Mar 11, 2021

Choose a reason for hiding this comment

ulfjack Mar 12, 2021

Choose a reason for hiding this comment

coeuvre Mar 15, 2021

Choose a reason for hiding this comment

coeuvre Mar 11, 2021

Choose a reason for hiding this comment

coeuvre Mar 11, 2021

Choose a reason for hiding this comment

coeuvre Mar 11, 2021 •

edited

Loading