-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reuse input map and merkle tree for tools with lots of inputs #10875
Comments
I've been profiling our Bazel and remote execution when building C/C++ code. We have experimented with a packaged C/C++ toolchain, but 10000 extra input files made Bazel 2.2.0 gives me average self time of (listed numbers above 1 ms):
I tried to modify and cache parts of MerkleTree.build, but I find myself in the fundamental problem of always having to loop through all the files. Just that loop (10k inputs) will be too slow for 100k actions. Instead, I tried to create a quick hash directly on the Further, I use the hash for The result is then, given a populated
Perfect! All the cpu time has gone. As the Now,
What is Maybe the best way would be to cache the |
@moroten Great analysis! One option that we considered was building a merkle tree based on the nestedset structure, and merging them at subsequent levels. Did you try that? |
@ulfjack My assumption is that the NestedSet structure doesn't reassemble a directory structure. The dependency graph might also include diamond structures. Do you want to make use of the fact that in most cases, each level of the NestedSet should only contain files from a single directory? I have not been able to conclude any good approach on this. Another test I did was to call |
I've been optimizing and profiling Bazel regarding this topic. My test example is running about 50000 actions of which most are compiling C/C++ code using remote execution with For each test, I run My base line of Bazel 2.2.0 runs the execution phase in 34 s. Looking at the profile below show that most CPU is spent on The fix is to calculate a fast hash, based on NestedSets/depsets. An action with fast hash is stored in the action cache to map to the real action key, based on the merkle tree. (The action cache storage is used in a hacky way.) A build based on Bazel 3.0.0rc2 shows that the CPU time has been converted into network latency, which is good, but the execution phase still takes 34 s. To remove the latency, I added an in memory cache with eviction after one hour. This cuts the execution phase to 13 s. Most of the profile graph was empty, so I added more profiling points, see the red selection in the image below. (This test is based on Bazel 3.1.0rc1. The screen dump still covers 10 ms)
|
C++ dependency discovery is complicated. Google internally uses the include scanner (which is open source, but not hooked up in Bazel), and that is a hard requirement for C++ modules (to reduce the number of inputs based on which modules are actually necessary rather than shipping the full set of transitive inputs). I seem to remember that there is a way to disable .d file parsing, but keep in mind that it's problematic to do that if you use dynamic execution (the action doesn't know whether it's used with local or remote execution, or both). It's also incompatible with include scanning (which is different from what shows up as dependency discovery in the profile). This would require a C++ expert to look into it. |
For reference in my investigation, the sizes of the input lists are distributed as follows among the 44479 actions:
|
Below follows another view of my comment with profiling pictures above, but with summarized numbers from the profiling. After removing all the actions with more than 10000 inputs in my C/C++ build, I did some more profiling. The numbers below were measured using --jobs=1 and with an in memory AC+CAS inside Bazel. The AC+CAS was fully populated to avoid an idling CPU due to network latency. An adapted Bazel 3.0 gave Using the fast cache based on depsets, Saving 30 % CPU of for a fully cached build is not bad. The time outside the action execution seems to be updating the build graph, i.e. code related to "discover inputs" and the Skyframe framework (which does depset.toList() in a few places). Also, plotting a graph over number of inputs vs. MerkleTree.build duration was roughly 0.1 ms/action + 0.002 ms/input in cpu time. |
I've uploaded a branch with the code I'm testing right now (rebased version from 3.1.0, hopefully compiles): https://github.com/moroten/bazel/commits/optimize-remote-execution |
@moroten thanks! Can you also provide the repository you used to benchmark your fork? |
@buchgr Unfortunately no. It is a proprietary repository with mostly C and C++ code. When testing a bit more, it seems like activating the in-memory action cache gives a good speedup (roughly 40%), but additionally activating the in-memory CAS for .d files does add much more (roughly 10%). This is with a low latency connection to the Buildbarn server. |
The |
Nice! Skyframe tracks the files modified between builds. A nice optimization to make for incremental builds would be to not call findMissingBlobs() eagerly but instead proactively upload the modified files and assume that all other files already exist remotely. |
@buchgr Does your idea with Skyframe data idea hold when "discarding analysis cache"? In our CI builds, we are switching configuration (e.g. dbg vs. opt). (Very annoying when we know that our outputs go to different folders anyway.) |
I don't know. Probably not. You would need to try it out. I was more thinking about incremental interactive builds anyway. Do you find that optimizing |
I believe that the analysis cache does not include the file system cache, so it should work (for source files; output files are special in another way, so that shouldn't be a problem). Reducing the number of findMissingBlobs entries seems like a nice improvement - ideally, it would integrate with Skyframe somehow, but I don't see how that could be done. |
Hi guys, I looked into this extensively a couple years ago. My conclusion was that generalised merkle tree caching doesn't really work. There aren't that many directories that are entirely shared between many actions, and the cache gets really really big. I did however look into whether you could have so-called "input sets", which would be additional pre-cached sets of inputs supplied to action execution alongside the regular set of inputs. Rules would explicitly request caching for these based on knowledge that they would be large, frequently used sets. I think the idea was to do it for toolchains, runfiles, and middlemen. Unfortunately the doc with the investigation contains internal information (since it related to our internal build distribution system) so I can't link it here verbatim. I can link any googler that wants it, and I am also happy to discuss general numbers. At a glance it looked like we could achieve something like 50% reduction in repeated inputs. |
That's what I was trying to suggest in the original issue when I said "each tool's input map and merkle tree should be reused between actions that depend on the same tool".
Maybe I'm misunderstanding how merkle trees are used, but I don't think we need generalized caching. Couldn't you add a |
That's the idea, but one thing you may possibly be missing is that our remote execution protocols only support a single tree. They would need to support multiple. If not, you'd have to merge the input sets with the other inputs, which is relatively expensive. |
As a sort of follow up to our specific problem that led me to file the original issue, I used vercel/ncc to bundle the node script and all of its dependencies into a single file. The average duration for |
@JaredNeil do you have something like |
We generate it at build time. It takes a little over 60s, but since it's basically always cached, that's not a big problem for us. The |
@JaredNeil thank you so much for sharing these. In my local tests (admittedly in a synthetic test repo) I'm seeing pretty significant improvements. This is a great trick! |
Would be possible to speed up computation of input digest by splitting input into disjoint NestedSets? Big libraries, or tools, would land in a single disjoint set each, and digest need to be computed only once for those. Conceptually, for each NestedSet we could compute common path of all files within this set. To compute disjoint sets, first we could start with all input nested sets, and for nested sets that have unique path, we stop. For NestedSets that may contribute files to the same directories, need to expand those NestedSets to their children, and perform check and possibly expansion again. I will give it a try, but let me know if there are obvious issues with this solution - @tomlu post above may suggest that in practice it may not give enough benefits, but in principle maybe it could be comparable to @moroten solution with fast digest. |
It sounds like an interesting idea, but if I'm thinking of our code structure it would probably not work very well. All code includes I hope to get time to experiment with bazelbuild/remote-apis#141 during August, implementing patches for Bazel and Buildbarn. |
MerkleTree calculations are now cached for each node in the input NestedSets. This drastically improves the speed when checking for cache hit. One example reduced the calculation time from 78 ms to 3 ms for 3000 inputs. This caching can be disabled using --remote_cache_merkle_trees=false which will reduce the memory footprint. The caching is discarded after each build to free up memory, the cache setup time is negligible. Fixes bazelbuild#10875.
MerkleTree calculations are now cached for each node in the input NestedSets (depsets). This drastically improves the speed when checking for remote cache hits. One example reduced the Merkle tree calculation time from 78 ms to 3 ms for 3000 inputs. This caching can be disabled using --remote_cache_merkle_trees=false which will reduce the memory footprint. The caching is discarded after each build to free up memory, the cache setup time is negligible. Fixes bazelbuild#10875.
The experiments have resulted in #13879 which is caching Merkle trees built from each |
When --experimental_remote_merkle_tree_cache is set, Merkle tree calculations are cached for each node in the input NestedSets (depsets). This drastically improves the speed when checking for remote cache hits. One example reduced the Merkle tree calculation time from 78 ms to 3 ms for 3000 inputs. The memory foot print of the cache is controlled by --experimental_remote_merkle_tree_cache_size. The caching is discarded after each build to free up memory, the cache setup time is negligible. Fixes bazelbuild#10875. Closes bazelbuild#13879. PiperOrigin-RevId: 405793372
When --experimental_remote_merkle_tree_cache is set, Merkle tree calculations are cached for each node in the input NestedSets (depsets). This drastically improves the speed when checking for remote cache hits. One example reduced the Merkle tree calculation time from 78 ms to 3 ms for 3000 inputs. The memory foot print of the cache is controlled by --experimental_remote_merkle_tree_cache_size. The caching is discarded after each build to free up memory, the cache setup time is negligible. Fixes #10875. Closes #13879. PiperOrigin-RevId: 405793372 Co-authored-by: Fredrik Medley <fredrik.medley@gmail.com>
Description of the feature request:
Create a separate input map and merkle tree for each tool used in an action, then combine those to calculate the final action digest. Each tool's input map and merkle tree should be reused between actions that depend on the same tool.
Feature requests: what underlying problem are you trying to solve with this feature?
These two function calls cause remote caching to be really slow for actions with lots of inputs. Having a way to re-use the parts that don't change between multiple actions that use the same tool could make these significantly faster.
Tools written in JavaScript will often have a high number of runfiles (>10,000) because they depend on the
node_modules
folder. This is also sometimes true of Python tools.Have you found anything relevant by searching the web?
Any other information, logs, or outputs that you want to share?
In our specific case, we have an action per source file that uses a
nodejs_binary
tool. So we have 15,000 actions with the same ~10,000 runfiles for the tool and 1 unique input file. Each of these takes ~200ms to calculate the cache key, so that's 50 CPU-minutes of work just to check if the actions are cached.The text was updated successfully, but these errors were encountered: