Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Bazel more responsive and use less memory when --jobs is high #17120

Conversation

EdSchouten
Copy link
Contributor

@EdSchouten EdSchouten commented Jan 3, 2023

When using Bazel in combination with a larger remote execution cluster, it's not uncommon to call it with something like --jobs=512. We have observed that this is currently problematic for a couple of reasons:

  1. It causes Bazel to launch 512 local threads, each being responsible for running one action remotely. All of these local threads may spend a lot of time in buildRemoteAction(), generating input roots in the form of Merkle trees.

    As the local system tends to have fewer than 512 CPUs, all of these threads will unnecessarily compete with each other. One practical downside of that is that interrupting Bazel using ^C takes a very long time, as it first wants to complete the computation of all 512 Merkle trees.

    Let's put a semaphore in place, limiting the number of concurrent Merkle tree computations to the number of CPU cores available. By making the semaphore interruptible, Bazel will immediately stop processing the 512 - nCPU actions that were waiting for the semaphore.

  2. Related to the above, Bazel will end up keeping 512 Merkle trees in memory throughout all stages of execution. This makes sense, as we may get cache misses, requiring us to upload the input root afterwards. Or the execution of a remote action may fail, requiring us to upload the input root.

    That said, generally speaking these cases are fairly uncommon. Most builds have relatively high cache hit rates and execution retries only happen rarely. It is therefore not worth keeping these Merkle trees in memory constantly. We only need it when computing the action digest for GetActionResult(), and while uploading it into the CAS.

  3. AbstractSpawnStrategy.getInputMapping() has some smartness to memoize its results. This makes a lot of sense for local execution, where the input mapping is used in a couple of places. For remote caching/execution it is not evident that this is a good idea. Assuming you end up having a remote cache hit, you don't need it.

    Let's make the memoization optional, only using it in cases where we do local execution (which may also happen when you get a cache miss when doing remote caching without remote exection).

Similar changes against Bazel 5.x have allowed me to successfully do builds of a large monorepo using --jobs=512 using the default heap size limits, whereas I would normally see occasional OOM behaviour when providing --host_jvm_args=-Xmx64g.

@EdSchouten EdSchouten force-pushed the eschouten/20230103-memory-usage branch 3 times, most recently from 8ff3784 to d75826b Compare January 3, 2023 14:04
@EdSchouten EdSchouten marked this pull request as ready for review January 3, 2023 14:17
@EdSchouten EdSchouten requested a review from a team as a code owner January 3, 2023 14:17
@brentleyjones
Copy link
Contributor

FYI, a workaround for 2 is to use the Merkle cache. We've seen it reduce the memory needed from over 64 GB down to 17 GB.

@coeuvre
Copy link
Member

coeuvre commented Jan 3, 2023

Thanks for bringing this up!

We are exploring using virtual threads to eliminate flag --jobs so that we only need num of cores platform threads and use virtual threads to execute actions. With that, 1) shouldn't be the problem. In the meanwhile, I am open to use the semaphore as workaround.

For 2) and 3), we are saving memory for common cases but sacrificing performance for rare cases which sounds reasonable but probably need more data. e.g. what's the build time regression for a cold build?

@sgowroji sgowroji added team-Performance Issues for Performance teams awaiting-user-response Awaiting a response from the author labels Jan 3, 2023
@EdSchouten
Copy link
Contributor Author

EdSchouten commented Jan 3, 2023

FYI, a workaround for 2 is to use the Merkle cache. We've seen it reduce the memory needed from over 64 GB down to 17 GB.

I also tried this option, but it didn't cause significant improvements for us (both in terms of memory usage and CPU time). Great to hear it works for you, though!

We are exploring using virtual threads to eliminate flag --jobs so that we only need num of cores platform threads and use virtual threads to execute actions. With that, 1) shouldn't be the problem. In the meanwhile, I am open to use the semaphore as workaround.

👍

For 2) and 3), we are saving memory for common cases but sacrificing performance for rare cases which sounds reasonable but probably need more data. e.g. what's the build time regression for a cold build?

I wasn't able to measure any regression in wall time, for the reason that in the case of a cold build the build time is mostly dominated by execution taking place on the remote side. So it's wasting more CPU cycles on the client, but only in the case where it would have been idling otherwise.

Let me see what I can do to address the remaining test failures.

@fmeum
Copy link
Collaborator

fmeum commented Jan 3, 2023

I wasn't able to measure any regression in wall time, for the reason that in the case of a cold build the build time is mostly dominated by execution taking place on the remote side. So it's wasting more CPU cycles on the client, but only in the case where it would have been idling otherwise.

Just so that point isn't forgotten: It might be interesting to collect measurements for dynamic execution (local + remote) as that will have the client do actual work in addition to orchestrating the remote executor.

@EdSchouten
Copy link
Contributor Author

Just so that point isn't forgotten: It might be interesting to collect measurements for dynamic execution (local + remote) as that will have the client do actual work in addition to orchestrating the remote executor.

I would really appreciate it if anyone using dynamic execution is willing to test this. We never use it on our end, for the reason that remote execution is inherently faster for us (due to us using bb_clientd). I thus have no frame of reference. Sorry!

@EdSchouten EdSchouten force-pushed the eschouten/20230103-memory-usage branch from d75826b to 537f650 Compare January 3, 2023 17:59
@sgowroji sgowroji added awaiting-review PR is awaiting review from an assigned reviewer and removed awaiting-user-response Awaiting a response from the author labels Jan 4, 2023
Copy link
Member

@coeuvre coeuvre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hate flag but to be on the safe side, can we add an experimental flag for 2 and 3 and set it disabled by default? It also let others have a chance to benchmark the impact.

There is also internal code relying on 3, making it conditional will help when importing.

@coeuvre coeuvre added awaiting-user-response Awaiting a response from the author team-Remote-Exec Issues and PRs for the Execution (Remote) team and removed awaiting-review PR is awaiting review from an assigned reviewer team-Performance Issues for Performance teams labels Jan 4, 2023
@EdSchouten EdSchouten force-pushed the eschouten/20230103-memory-usage branch from 537f650 to 8cba3d8 Compare January 9, 2023 09:54
When using Bazel in combination with a larger remote execution cluster,
it's not uncommon to call it with something like --jobs=512. We have
observed that this is currently problematic for a couple of reasons:

1. It causes Bazel to launch 512 local threads, each being responsible
   for running one action remotely. All of these local threads may spend
   a lot of time in buildRemoteAction(), generating input roots in the
   form of Merkle trees.

   As the local system tends to have fewer than 512 CPUs, all of these
   threads will unnecessarily compete with each other. One practical
   downside of that is that interrupting Bazel using ^C takes a very
   long time, as it first wants to complete the computation of all 512
   Merkle trees.

   Let's put a semaphore in place, limiting the number of concurrent
   Merkle tree computations to the number of CPU cores available.

2. Related to the above, Bazel will end up keeping 512 Merkle trees in
   memory throughout all stages of execution. This makes sense, as we
   may get cache misses, requiring us to upload the input root
   afterwards. Or the execution of a remote action may fail, requiring
   us to upload the input root.

   That said, generally speaking these cases are fairly uncommon. Most
   builds have relatively high cache hit rates and execution retries
   only happen rarely. It is therefore not worth keeping these Merkle
   trees in memory constantly. We only need it when computing the action
   digest for GetActionResult(), and while uploading it into the CAS.

3. AbstractSpawnStrategy.getInputMapping() has some smartness to memoize
   its results. This makes a lot of sense for local execution, where the
   input mapping is used in a couple of places. For remote
   caching/execution it is not evident that this is a good idea.
   Assuming you end up having a remote cache hit, you don't need it.

   Let's make the memoization optional, only using it in cases where we
   do local execution (which may also happen when you get a cache miss
   when doing remote caching).

Similar changes against Bazel 5.x have allowed me to successfully do
builds of a large monorepo using --jobs=512 using the default heap size
limits, whereas I would normally see occasional OOM behaviour when
providing --host_jvm_args=-Xmx64g.
@EdSchouten EdSchouten force-pushed the eschouten/20230103-memory-usage branch from 8cba3d8 to a66590d Compare January 9, 2023 10:56
@EdSchouten
Copy link
Contributor Author

I hate flag but to be on the safe side, can we add an experimental flag for 2 and 3 and set it disabled by default? It also let others have a chance to benchmark the impact.

Sure! I have just added a --experimental_remote_discard_merkle_trees flag. Only when set, it will keep RemoteAction.merkleTree unset, and provide willAccessRepeatedly == false to getInputMapping().

Copy link
Member

@coeuvre coeuvre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I am importing the PR and make changes to internal code. If something went wrong internally, I will report back.

@meisterT
Copy link
Member

Interestingly I made this related change yesterday: 3d29b2e

Side note: IMHO it would have been better to split this PR into three separate PRs (no action required).

@EdSchouten EdSchouten deleted the eschouten/20230103-memory-usage branch January 19, 2023 08:21
@exoson
Copy link
Contributor

exoson commented Jan 20, 2023

Could this be cherry-picked to bazel 6.1.0?

@coeuvre
Copy link
Member

coeuvre commented Jan 20, 2023

@bazel-io fork 6.1.0

@coeuvre
Copy link
Member

coeuvre commented Jan 20, 2023

Sure. I believe this is a safe change but brings lots of value to remote execution. I would also like folks can play around with the new flag sooner and maybe report back their experience/benchmarks here.

coeuvre pushed a commit to coeuvre/bazel that referenced this pull request Feb 2, 2023
When using Bazel in combination with a larger remote execution cluster, it's not uncommon to call it with something like --jobs=512. We have observed that this is currently problematic for a couple of reasons:

1. It causes Bazel to launch 512 local threads, each being responsible for running one action remotely. All of these local threads may spend a lot of time in buildRemoteAction(), generating input roots in the form of Merkle trees.

   As the local system tends to have fewer than 512 CPUs, all of these threads will unnecessarily compete with each other. One practical downside of that is that interrupting Bazel using ^C takes a very long time, as it first wants to complete the computation of all 512 Merkle trees.

   Let's put a semaphore in place, limiting the number of concurrent Merkle tree computations to the number of CPU cores available. By making the semaphore interruptible, Bazel will immediately stop processing the `512 - nCPU` actions that were waiting for the semaphore.

2. Related to the above, Bazel will end up keeping 512 Merkle trees in memory throughout all stages of execution. This makes sense, as we may get cache misses, requiring us to upload the input root afterwards. Or the execution of a remote action may fail, requiring us to upload the input root.

   That said, generally speaking these cases are fairly uncommon. Most builds have relatively high cache hit rates and execution retries only happen rarely. It is therefore not worth keeping these Merkle trees in memory constantly. We only need it when computing the action digest for GetActionResult(), and while uploading it into the CAS.

3. AbstractSpawnStrategy.getInputMapping() has some smartness to memoize its results. This makes a lot of sense for local execution, where the input mapping is used in a couple of places. For remote caching/execution it is not evident that this is a good idea. Assuming you end up having a remote cache hit, you don't need it.

   Let's make the memoization optional, only using it in cases where we do local execution (which may also happen when you get a cache miss when doing remote caching without remote exection).

Similar changes against Bazel 5.x have allowed me to successfully do builds of a large monorepo using --jobs=512 using the default heap size limits, whereas I would normally see occasional OOM behaviour when providing --host_jvm_args=-Xmx64g.

Closes bazelbuild#17120.

PiperOrigin-RevId: 500990181
Change-Id: I6d1ba03470b79424ce2e1c2e83abd8fa779dd268
ShreeM01 added a commit that referenced this pull request Feb 7, 2023
…17398)

When using Bazel in combination with a larger remote execution cluster, it's not uncommon to call it with something like --jobs=512. We have observed that this is currently problematic for a couple of reasons:

1. It causes Bazel to launch 512 local threads, each being responsible for running one action remotely. All of these local threads may spend a lot of time in buildRemoteAction(), generating input roots in the form of Merkle trees.

   As the local system tends to have fewer than 512 CPUs, all of these threads will unnecessarily compete with each other. One practical downside of that is that interrupting Bazel using ^C takes a very long time, as it first wants to complete the computation of all 512 Merkle trees.

   Let's put a semaphore in place, limiting the number of concurrent Merkle tree computations to the number of CPU cores available. By making the semaphore interruptible, Bazel will immediately stop processing the `512 - nCPU` actions that were waiting for the semaphore.

2. Related to the above, Bazel will end up keeping 512 Merkle trees in memory throughout all stages of execution. This makes sense, as we may get cache misses, requiring us to upload the input root afterwards. Or the execution of a remote action may fail, requiring us to upload the input root.

   That said, generally speaking these cases are fairly uncommon. Most builds have relatively high cache hit rates and execution retries only happen rarely. It is therefore not worth keeping these Merkle trees in memory constantly. We only need it when computing the action digest for GetActionResult(), and while uploading it into the CAS.

3. AbstractSpawnStrategy.getInputMapping() has some smartness to memoize its results. This makes a lot of sense for local execution, where the input mapping is used in a couple of places. For remote caching/execution it is not evident that this is a good idea. Assuming you end up having a remote cache hit, you don't need it.

   Let's make the memoization optional, only using it in cases where we do local execution (which may also happen when you get a cache miss when doing remote caching without remote exection).

Similar changes against Bazel 5.x have allowed me to successfully do builds of a large monorepo using --jobs=512 using the default heap size limits, whereas I would normally see occasional OOM behaviour when providing --host_jvm_args=-Xmx64g.

Closes #17120.

PiperOrigin-RevId: 500990181
Change-Id: I6d1ba03470b79424ce2e1c2e83abd8fa779dd268

Co-authored-by: Ed Schouten <eschouten@apple.com>
Co-authored-by: kshyanashree <109167932+kshyanashree@users.noreply.github.com>
hvadehra pushed a commit that referenced this pull request Feb 14, 2023
When using Bazel in combination with a larger remote execution cluster, it's not uncommon to call it with something like --jobs=512. We have observed that this is currently problematic for a couple of reasons:

1. It causes Bazel to launch 512 local threads, each being responsible for running one action remotely. All of these local threads may spend a lot of time in buildRemoteAction(), generating input roots in the form of Merkle trees.

   As the local system tends to have fewer than 512 CPUs, all of these threads will unnecessarily compete with each other. One practical downside of that is that interrupting Bazel using ^C takes a very long time, as it first wants to complete the computation of all 512 Merkle trees.

   Let's put a semaphore in place, limiting the number of concurrent Merkle tree computations to the number of CPU cores available. By making the semaphore interruptible, Bazel will immediately stop processing the `512 - nCPU` actions that were waiting for the semaphore.

2. Related to the above, Bazel will end up keeping 512 Merkle trees in memory throughout all stages of execution. This makes sense, as we may get cache misses, requiring us to upload the input root afterwards. Or the execution of a remote action may fail, requiring us to upload the input root.

   That said, generally speaking these cases are fairly uncommon. Most builds have relatively high cache hit rates and execution retries only happen rarely. It is therefore not worth keeping these Merkle trees in memory constantly. We only need it when computing the action digest for GetActionResult(), and while uploading it into the CAS.

3. AbstractSpawnStrategy.getInputMapping() has some smartness to memoize its results. This makes a lot of sense for local execution, where the input mapping is used in a couple of places. For remote caching/execution it is not evident that this is a good idea. Assuming you end up having a remote cache hit, you don't need it.

   Let's make the memoization optional, only using it in cases where we do local execution (which may also happen when you get a cache miss when doing remote caching without remote exection).

Similar changes against Bazel 5.x have allowed me to successfully do builds of a large monorepo using --jobs=512 using the default heap size limits, whereas I would normally see occasional OOM behaviour when providing --host_jvm_args=-Xmx64g.

Closes #17120.

PiperOrigin-RevId: 500990181
Change-Id: I6d1ba03470b79424ce2e1c2e83abd8fa779dd268
@brentleyjones
Copy link
Contributor

brentleyjones commented Dec 8, 2023

Similar to #19924 (comment), I filed #20478.

cc: @coeuvre @werkt

merkleTree = buildInputMerkleTree(spawn, context, toolSignature);
}

remoteExecutionCache.ensureInputsPresent(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks wrong because this makes a remote call, and this is now limited to N parallel remote calls. This should have been outside the semaphore.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding was that this limits the number of merkle trees in flight preventing OOMs.

@EdSchouten was this the intention?

also cc @tjgq (and leaving a reference to #21378 here)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm specifically talking about the ensureInputsPresent call, which performs remote calls, and is now limited.

Copy link
Contributor Author

@EdSchouten EdSchouten Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, it has to be limited to make sure we don't run OOM because memory is filled up with Merkle trees. Yes, this does mean that concurrency of outgoing network calls is limited as well.

One way to make this less problematic would be to implement buildInputMerkleTree() in such a way that it streams directories that it creates, and that FindMissingBlobs() is called in batches. But because buildInputMerkleTree() is not built like that, we currently don't have a lot of options here.

I guess I never observed this to be problematic for performance, because Buildbarn's FindMissingBlobs() implementation is sufficiently fast. It typically completes in ~2 milliseconds, even if 50k digests are provided.

Copy link
Contributor

@werkt werkt Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Limiting this by the # of cores is nonsensical when concerned with memory exhaustion - I might as well limit by the number 42 - you're no more or less likely to create memory pressure with this based on the number of cores on a host.

We should limit by a heap pool size, charge by the weight of the merkle tree, and if it turns out that portions of that tree are shared across merkle trees (not sure if that optimization exists), then it should be the net weight of all unique merkle tree components.

Measuring this in milliseconds means that you're blocked on it - for a remote that responds in 1ms, the throughput cap is N * 1000, decided arbitrarily. Utilizing the network also means that there's a potential for async resource exhaustion there that will impact the runtime (but has nothing to do with remediating memory). The case where this was discovered was pathologically small input trees, --jobs=10000 with sub-second execution times, where bandwidth consumption, queued request processing on the channel, and the like could drastically increase the time in the critical section, through no fault of any remote service.

Copy link
Contributor Author

@EdSchouten EdSchouten Apr 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Limiting it by the number of cores is a better heuristic than limiting it by the --jobs flag. The reason being that the amount of RAM a system has tends to be proportional to the number of cores. The RAM on my system is not necessarily proportional to the number of worker threads my build cluster running on {AWS,GCP,...} has.

I'm fine with whatever limit it is. As long as there is a sane one. Ulf mentioned that this code should not have been covered by a semaphore. This is something I object to, give the fact that the Merkle tree computation code is not written in such a way that it can return partial Merkle trees.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am working on async execution (running actions in virtual thread, check --experimental_async_execution in HEAD) which might allow us to remove this limiting because the size of the underlying thread pool equals to the number of cores.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-user-response Awaiting a response from the author team-Remote-Exec Issues and PRs for the Execution (Remote) team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants