Make Bazel more responsive and use less memory when --jobs is high #17120

EdSchouten · 2023-01-03T13:37:17Z

When using Bazel in combination with a larger remote execution cluster, it's not uncommon to call it with something like --jobs=512. We have observed that this is currently problematic for a couple of reasons:

It causes Bazel to launch 512 local threads, each being responsible for running one action remotely. All of these local threads may spend a lot of time in buildRemoteAction(), generating input roots in the form of Merkle trees.

As the local system tends to have fewer than 512 CPUs, all of these threads will unnecessarily compete with each other. One practical downside of that is that interrupting Bazel using ^C takes a very long time, as it first wants to complete the computation of all 512 Merkle trees.

Let's put a semaphore in place, limiting the number of concurrent Merkle tree computations to the number of CPU cores available. By making the semaphore interruptible, Bazel will immediately stop processing the 512 - nCPU actions that were waiting for the semaphore.
Related to the above, Bazel will end up keeping 512 Merkle trees in memory throughout all stages of execution. This makes sense, as we may get cache misses, requiring us to upload the input root afterwards. Or the execution of a remote action may fail, requiring us to upload the input root.

That said, generally speaking these cases are fairly uncommon. Most builds have relatively high cache hit rates and execution retries only happen rarely. It is therefore not worth keeping these Merkle trees in memory constantly. We only need it when computing the action digest for GetActionResult(), and while uploading it into the CAS.
AbstractSpawnStrategy.getInputMapping() has some smartness to memoize its results. This makes a lot of sense for local execution, where the input mapping is used in a couple of places. For remote caching/execution it is not evident that this is a good idea. Assuming you end up having a remote cache hit, you don't need it.

Let's make the memoization optional, only using it in cases where we do local execution (which may also happen when you get a cache miss when doing remote caching without remote exection).

Similar changes against Bazel 5.x have allowed me to successfully do builds of a large monorepo using --jobs=512 using the default heap size limits, whereas I would normally see occasional OOM behaviour when providing --host_jvm_args=-Xmx64g.

brentleyjones · 2023-01-03T14:35:34Z

FYI, a workaround for 2 is to use the Merkle cache. We've seen it reduce the memory needed from over 64 GB down to 17 GB.

coeuvre · 2023-01-03T14:46:19Z

Thanks for bringing this up!

We are exploring using virtual threads to eliminate flag --jobs so that we only need num of cores platform threads and use virtual threads to execute actions. With that, 1) shouldn't be the problem. In the meanwhile, I am open to use the semaphore as workaround.

For 2) and 3), we are saving memory for common cases but sacrificing performance for rare cases which sounds reasonable but probably need more data. e.g. what's the build time regression for a cold build?

EdSchouten · 2023-01-03T15:26:10Z

FYI, a workaround for 2 is to use the Merkle cache. We've seen it reduce the memory needed from over 64 GB down to 17 GB.

I also tried this option, but it didn't cause significant improvements for us (both in terms of memory usage and CPU time). Great to hear it works for you, though!

We are exploring using virtual threads to eliminate flag --jobs so that we only need num of cores platform threads and use virtual threads to execute actions. With that, 1) shouldn't be the problem. In the meanwhile, I am open to use the semaphore as workaround.

👍

For 2) and 3), we are saving memory for common cases but sacrificing performance for rare cases which sounds reasonable but probably need more data. e.g. what's the build time regression for a cold build?

I wasn't able to measure any regression in wall time, for the reason that in the case of a cold build the build time is mostly dominated by execution taking place on the remote side. So it's wasting more CPU cycles on the client, but only in the case where it would have been idling otherwise.

Let me see what I can do to address the remaining test failures.

fmeum · 2023-01-03T16:45:14Z

I wasn't able to measure any regression in wall time, for the reason that in the case of a cold build the build time is mostly dominated by execution taking place on the remote side. So it's wasting more CPU cycles on the client, but only in the case where it would have been idling otherwise.

Just so that point isn't forgotten: It might be interesting to collect measurements for dynamic execution (local + remote) as that will have the client do actual work in addition to orchestrating the remote executor.

EdSchouten · 2023-01-03T17:49:01Z

Just so that point isn't forgotten: It might be interesting to collect measurements for dynamic execution (local + remote) as that will have the client do actual work in addition to orchestrating the remote executor.

I would really appreciate it if anyone using dynamic execution is willing to test this. We never use it on our end, for the reason that remote execution is inherently faster for us (due to us using bb_clientd). I thus have no frame of reference. Sorry!

coeuvre

I hate flag but to be on the safe side, can we add an experimental flag for 2 and 3 and set it disabled by default? It also let others have a chance to benchmark the impact.

There is also internal code relying on 3, making it conditional will help when importing.

src/test/java/com/google/devtools/build/lib/remote/RemoteExecutionServiceTest.java

When using Bazel in combination with a larger remote execution cluster, it's not uncommon to call it with something like --jobs=512. We have observed that this is currently problematic for a couple of reasons: 1. It causes Bazel to launch 512 local threads, each being responsible for running one action remotely. All of these local threads may spend a lot of time in buildRemoteAction(), generating input roots in the form of Merkle trees. As the local system tends to have fewer than 512 CPUs, all of these threads will unnecessarily compete with each other. One practical downside of that is that interrupting Bazel using ^C takes a very long time, as it first wants to complete the computation of all 512 Merkle trees. Let's put a semaphore in place, limiting the number of concurrent Merkle tree computations to the number of CPU cores available. 2. Related to the above, Bazel will end up keeping 512 Merkle trees in memory throughout all stages of execution. This makes sense, as we may get cache misses, requiring us to upload the input root afterwards. Or the execution of a remote action may fail, requiring us to upload the input root. That said, generally speaking these cases are fairly uncommon. Most builds have relatively high cache hit rates and execution retries only happen rarely. It is therefore not worth keeping these Merkle trees in memory constantly. We only need it when computing the action digest for GetActionResult(), and while uploading it into the CAS. 3. AbstractSpawnStrategy.getInputMapping() has some smartness to memoize its results. This makes a lot of sense for local execution, where the input mapping is used in a couple of places. For remote caching/execution it is not evident that this is a good idea. Assuming you end up having a remote cache hit, you don't need it. Let's make the memoization optional, only using it in cases where we do local execution (which may also happen when you get a cache miss when doing remote caching). Similar changes against Bazel 5.x have allowed me to successfully do builds of a large monorepo using --jobs=512 using the default heap size limits, whereas I would normally see occasional OOM behaviour when providing --host_jvm_args=-Xmx64g.

EdSchouten · 2023-01-09T11:06:50Z

I hate flag but to be on the safe side, can we add an experimental flag for 2 and 3 and set it disabled by default? It also let others have a chance to benchmark the impact.

Sure! I have just added a --experimental_remote_discard_merkle_trees flag. Only when set, it will keep RemoteAction.merkleTree unset, and provide willAccessRepeatedly == false to getInputMapping().

coeuvre

Thanks! I am importing the PR and make changes to internal code. If something went wrong internally, I will report back.

meisterT · 2023-01-10T15:29:35Z

Interestingly I made this related change yesterday: 3d29b2e

Side note: IMHO it would have been better to split this PR into three separate PRs (no action required).

exoson · 2023-01-20T08:35:30Z

Could this be cherry-picked to bazel 6.1.0?

coeuvre · 2023-01-20T09:02:24Z

@bazel-io fork 6.1.0

coeuvre · 2023-01-20T09:05:02Z

Sure. I believe this is a safe change but brings lots of value to remote execution. I would also like folks can play around with the new flag sooner and maybe report back their experience/benchmarks here.

When using Bazel in combination with a larger remote execution cluster, it's not uncommon to call it with something like --jobs=512. We have observed that this is currently problematic for a couple of reasons: 1. It causes Bazel to launch 512 local threads, each being responsible for running one action remotely. All of these local threads may spend a lot of time in buildRemoteAction(), generating input roots in the form of Merkle trees. As the local system tends to have fewer than 512 CPUs, all of these threads will unnecessarily compete with each other. One practical downside of that is that interrupting Bazel using ^C takes a very long time, as it first wants to complete the computation of all 512 Merkle trees. Let's put a semaphore in place, limiting the number of concurrent Merkle tree computations to the number of CPU cores available. By making the semaphore interruptible, Bazel will immediately stop processing the `512 - nCPU` actions that were waiting for the semaphore. 2. Related to the above, Bazel will end up keeping 512 Merkle trees in memory throughout all stages of execution. This makes sense, as we may get cache misses, requiring us to upload the input root afterwards. Or the execution of a remote action may fail, requiring us to upload the input root. That said, generally speaking these cases are fairly uncommon. Most builds have relatively high cache hit rates and execution retries only happen rarely. It is therefore not worth keeping these Merkle trees in memory constantly. We only need it when computing the action digest for GetActionResult(), and while uploading it into the CAS. 3. AbstractSpawnStrategy.getInputMapping() has some smartness to memoize its results. This makes a lot of sense for local execution, where the input mapping is used in a couple of places. For remote caching/execution it is not evident that this is a good idea. Assuming you end up having a remote cache hit, you don't need it. Let's make the memoization optional, only using it in cases where we do local execution (which may also happen when you get a cache miss when doing remote caching without remote exection). Similar changes against Bazel 5.x have allowed me to successfully do builds of a large monorepo using --jobs=512 using the default heap size limits, whereas I would normally see occasional OOM behaviour when providing --host_jvm_args=-Xmx64g. Closes bazelbuild#17120. PiperOrigin-RevId: 500990181 Change-Id: I6d1ba03470b79424ce2e1c2e83abd8fa779dd268

…17398) When using Bazel in combination with a larger remote execution cluster, it's not uncommon to call it with something like --jobs=512. We have observed that this is currently problematic for a couple of reasons: 1. It causes Bazel to launch 512 local threads, each being responsible for running one action remotely. All of these local threads may spend a lot of time in buildRemoteAction(), generating input roots in the form of Merkle trees. As the local system tends to have fewer than 512 CPUs, all of these threads will unnecessarily compete with each other. One practical downside of that is that interrupting Bazel using ^C takes a very long time, as it first wants to complete the computation of all 512 Merkle trees. Let's put a semaphore in place, limiting the number of concurrent Merkle tree computations to the number of CPU cores available. By making the semaphore interruptible, Bazel will immediately stop processing the `512 - nCPU` actions that were waiting for the semaphore. 2. Related to the above, Bazel will end up keeping 512 Merkle trees in memory throughout all stages of execution. This makes sense, as we may get cache misses, requiring us to upload the input root afterwards. Or the execution of a remote action may fail, requiring us to upload the input root. That said, generally speaking these cases are fairly uncommon. Most builds have relatively high cache hit rates and execution retries only happen rarely. It is therefore not worth keeping these Merkle trees in memory constantly. We only need it when computing the action digest for GetActionResult(), and while uploading it into the CAS. 3. AbstractSpawnStrategy.getInputMapping() has some smartness to memoize its results. This makes a lot of sense for local execution, where the input mapping is used in a couple of places. For remote caching/execution it is not evident that this is a good idea. Assuming you end up having a remote cache hit, you don't need it. Let's make the memoization optional, only using it in cases where we do local execution (which may also happen when you get a cache miss when doing remote caching without remote exection). Similar changes against Bazel 5.x have allowed me to successfully do builds of a large monorepo using --jobs=512 using the default heap size limits, whereas I would normally see occasional OOM behaviour when providing --host_jvm_args=-Xmx64g. Closes #17120. PiperOrigin-RevId: 500990181 Change-Id: I6d1ba03470b79424ce2e1c2e83abd8fa779dd268 Co-authored-by: Ed Schouten <eschouten@apple.com> Co-authored-by: kshyanashree <109167932+kshyanashree@users.noreply.github.com>

When using Bazel in combination with a larger remote execution cluster, it's not uncommon to call it with something like --jobs=512. We have observed that this is currently problematic for a couple of reasons: 1. It causes Bazel to launch 512 local threads, each being responsible for running one action remotely. All of these local threads may spend a lot of time in buildRemoteAction(), generating input roots in the form of Merkle trees. As the local system tends to have fewer than 512 CPUs, all of these threads will unnecessarily compete with each other. One practical downside of that is that interrupting Bazel using ^C takes a very long time, as it first wants to complete the computation of all 512 Merkle trees. Let's put a semaphore in place, limiting the number of concurrent Merkle tree computations to the number of CPU cores available. By making the semaphore interruptible, Bazel will immediately stop processing the `512 - nCPU` actions that were waiting for the semaphore. 2. Related to the above, Bazel will end up keeping 512 Merkle trees in memory throughout all stages of execution. This makes sense, as we may get cache misses, requiring us to upload the input root afterwards. Or the execution of a remote action may fail, requiring us to upload the input root. That said, generally speaking these cases are fairly uncommon. Most builds have relatively high cache hit rates and execution retries only happen rarely. It is therefore not worth keeping these Merkle trees in memory constantly. We only need it when computing the action digest for GetActionResult(), and while uploading it into the CAS. 3. AbstractSpawnStrategy.getInputMapping() has some smartness to memoize its results. This makes a lot of sense for local execution, where the input mapping is used in a couple of places. For remote caching/execution it is not evident that this is a good idea. Assuming you end up having a remote cache hit, you don't need it. Let's make the memoization optional, only using it in cases where we do local execution (which may also happen when you get a cache miss when doing remote caching without remote exection). Similar changes against Bazel 5.x have allowed me to successfully do builds of a large monorepo using --jobs=512 using the default heap size limits, whereas I would normally see occasional OOM behaviour when providing --host_jvm_args=-Xmx64g. Closes #17120. PiperOrigin-RevId: 500990181 Change-Id: I6d1ba03470b79424ce2e1c2e83abd8fa779dd268

brentleyjones · 2023-12-08T21:10:06Z

Similar to #19924 (comment), I filed #20478.

cc: @coeuvre @werkt

ulfjack · 2023-12-19T21:43:43Z

src/main/java/com/google/devtools/build/lib/remote/RemoteExecutionService.java

+        merkleTree = buildInputMerkleTree(spawn, context, toolSignature);
+      }
+
+      remoteExecutionCache.ensureInputsPresent(


This looks wrong because this makes a remote call, and this is now limited to N parallel remote calls. This should have been outside the semaphore.

My understanding was that this limits the number of merkle trees in flight preventing OOMs.

@EdSchouten was this the intention?

also cc @tjgq (and leaving a reference to #21378 here)

I'm specifically talking about the ensureInputsPresent call, which performs remote calls, and is now limited.

Exactly, it has to be limited to make sure we don't run OOM because memory is filled up with Merkle trees. Yes, this does mean that concurrency of outgoing network calls is limited as well.

One way to make this less problematic would be to implement buildInputMerkleTree() in such a way that it streams directories that it creates, and that FindMissingBlobs() is called in batches. But because buildInputMerkleTree() is not built like that, we currently don't have a lot of options here.

I guess I never observed this to be problematic for performance, because Buildbarn's FindMissingBlobs() implementation is sufficiently fast. It typically completes in ~2 milliseconds, even if 50k digests are provided.

Limiting this by the # of cores is nonsensical when concerned with memory exhaustion - I might as well limit by the number 42 - you're no more or less likely to create memory pressure with this based on the number of cores on a host.

We should limit by a heap pool size, charge by the weight of the merkle tree, and if it turns out that portions of that tree are shared across merkle trees (not sure if that optimization exists), then it should be the net weight of all unique merkle tree components.

Measuring this in milliseconds means that you're blocked on it - for a remote that responds in 1ms, the throughput cap is N * 1000, decided arbitrarily. Utilizing the network also means that there's a potential for async resource exhaustion there that will impact the runtime (but has nothing to do with remediating memory). The case where this was discovered was pathologically small input trees, --jobs=10000 with sub-second execution times, where bandwidth consumption, queued request processing on the channel, and the like could drastically increase the time in the critical section, through no fault of any remote service.

Limiting it by the number of cores is a better heuristic than limiting it by the --jobs flag. The reason being that the amount of RAM a system has tends to be proportional to the number of cores. The RAM on my system is not necessarily proportional to the number of worker threads my build cluster running on {AWS,GCP,...} has.

I'm fine with whatever limit it is. As long as there is a sane one. Ulf mentioned that this code should not have been covered by a semaphore. This is something I object to, give the fact that the Merkle tree computation code is not written in such a way that it can return partial Merkle trees.

I am working on async execution (running actions in virtual thread, check --experimental_async_execution in HEAD) which might allow us to remove this limiting because the size of the underlying thread pool equals to the number of cores.

EdSchouten force-pushed the eschouten/20230103-memory-usage branch 3 times, most recently from 8ff3784 to d75826b Compare January 3, 2023 14:04

EdSchouten marked this pull request as ready for review January 3, 2023 14:17

EdSchouten requested a review from a team as a code owner January 3, 2023 14:17

EdSchouten requested review from coeuvre and tjgq January 3, 2023 14:17

sgowroji added team-Performance Issues for Performance teams awaiting-user-response Awaiting a response from the author labels Jan 3, 2023

EdSchouten force-pushed the eschouten/20230103-memory-usage branch from d75826b to 537f650 Compare January 3, 2023 17:59

sgowroji added awaiting-review PR is awaiting review from an assigned reviewer and removed awaiting-user-response Awaiting a response from the author labels Jan 4, 2023

coeuvre requested changes Jan 4, 2023

View reviewed changes

src/test/java/com/google/devtools/build/lib/remote/RemoteExecutionServiceTest.java Outdated Show resolved Hide resolved

coeuvre added awaiting-user-response Awaiting a response from the author team-Remote-Exec Issues and PRs for the Execution (Remote) team and removed awaiting-review PR is awaiting review from an assigned reviewer team-Performance Issues for Performance teams labels Jan 4, 2023

EdSchouten force-pushed the eschouten/20230103-memory-usage branch from 537f650 to 8cba3d8 Compare January 9, 2023 09:54

EdSchouten force-pushed the eschouten/20230103-memory-usage branch from 8cba3d8 to a66590d Compare January 9, 2023 10:56

coeuvre approved these changes Jan 9, 2023

View reviewed changes

copybara-service bot closed this in 4069a87 Jan 10, 2023

coeuvre mentioned this pull request Jan 16, 2023

Huge heap usage with large number of files and RE #16876

Closed

coeuvre mentioned this pull request Jan 16, 2023

Running a big build on Remote execution causes Bazel to OOM #16913

Closed

EdSchouten deleted the eschouten/20230103-memory-usage branch January 19, 2023 08:21

bazel-io mentioned this pull request Jan 20, 2023

[6.1.0] Make Bazel more responsive and use less memory when --jobs is high #17280

Closed

coeuvre mentioned this pull request Mar 21, 2023

Checking --disk_cache cache is slow for huge tree artifacts #17804

Closed

meisterT mentioned this pull request Apr 5, 2023

Mark "Remote Output Service..." as approved. bazelbuild/proposals#311

Merged

werkt mentioned this pull request Oct 27, 2023

Slow action cache checks - please revert 1e17348da7e45c00cb474390a3b8ed3103b6b5cf #19924

Open

brentleyjones mentioned this pull request Dec 8, 2023

remoteActionBuildingSemaphore slows down highly parallel remote builds #20478

Open

ulfjack reviewed Dec 19, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Bazel more responsive and use less memory when --jobs is high #17120

Make Bazel more responsive and use less memory when --jobs is high #17120

EdSchouten commented Jan 3, 2023 •

edited

Loading

brentleyjones commented Jan 3, 2023

coeuvre commented Jan 3, 2023

EdSchouten commented Jan 3, 2023 •

edited

Loading

fmeum commented Jan 3, 2023

EdSchouten commented Jan 3, 2023

coeuvre left a comment

EdSchouten commented Jan 9, 2023

coeuvre left a comment

meisterT commented Jan 10, 2023

exoson commented Jan 20, 2023 •

edited

Loading

coeuvre commented Jan 20, 2023

coeuvre commented Jan 20, 2023

brentleyjones commented Dec 8, 2023 •

edited

Loading

ulfjack Dec 19, 2023

meisterT Feb 19, 2024

ulfjack Apr 3, 2024

EdSchouten Apr 4, 2024 •

edited

Loading

werkt Apr 4, 2024 •

edited

Loading

EdSchouten Apr 20, 2024 •

edited

Loading

coeuvre Apr 22, 2024

Make Bazel more responsive and use less memory when --jobs is high #17120

Make Bazel more responsive and use less memory when --jobs is high #17120

Conversation

EdSchouten commented Jan 3, 2023 • edited Loading

brentleyjones commented Jan 3, 2023

coeuvre commented Jan 3, 2023

EdSchouten commented Jan 3, 2023 • edited Loading

fmeum commented Jan 3, 2023

EdSchouten commented Jan 3, 2023

coeuvre left a comment

Choose a reason for hiding this comment

EdSchouten commented Jan 9, 2023

coeuvre left a comment

Choose a reason for hiding this comment

meisterT commented Jan 10, 2023

exoson commented Jan 20, 2023 • edited Loading

coeuvre commented Jan 20, 2023

coeuvre commented Jan 20, 2023

brentleyjones commented Dec 8, 2023 • edited Loading

ulfjack Dec 19, 2023

Choose a reason for hiding this comment

meisterT Feb 19, 2024

Choose a reason for hiding this comment

ulfjack Apr 3, 2024

Choose a reason for hiding this comment

EdSchouten Apr 4, 2024 • edited Loading

Choose a reason for hiding this comment

werkt Apr 4, 2024 • edited Loading

Choose a reason for hiding this comment

EdSchouten Apr 20, 2024 • edited Loading

Choose a reason for hiding this comment

coeuvre Apr 22, 2024

Choose a reason for hiding this comment

EdSchouten commented Jan 3, 2023 •

edited

Loading

EdSchouten commented Jan 3, 2023 •

edited

Loading

exoson commented Jan 20, 2023 •

edited

Loading

brentleyjones commented Dec 8, 2023 •

edited

Loading

EdSchouten Apr 4, 2024 •

edited

Loading

werkt Apr 4, 2024 •

edited

Loading

EdSchouten Apr 20, 2024 •

edited

Loading