Slow action cache checks - please revert 1e17348da7e45c00cb474390a3b8ed3103b6b5cf #19924

ulfjack · 2023-10-23T20:59:55Z

Description of the bug:

@meisterT enabled the action cache throttle in 1e17348, but the description doesn't have any benchmark results or any other data supporting the claim that nobody would want this to be disabled. I had to go back to a fairly old commit in our own repo, but it looks like it has a significant impact on build times for us:

with the throttle enabled: 1m 31s
with the throttle disabled: 1m 19s

I wasn't able to get a cleaner signal, but we can clearly see the "acquiring semaphore" pieces in the profile:

throttle enabled:

throttle disabled

We can see that the action cache checks with throttling take until ~60s, while the action cache checks without throttling take until ~40s.

Which category does this issue belong to?

Performance

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

No response

Which operating system are you running Bazel on?

Linux

What is the output of `bazel info release`?

7.0.0-pre.20230530.3

If `bazel info release` returns `development version` or `(@non-git)`, tell us how you built Bazel.

No response

What's the output of `git remote get-url origin; git rev-parse master; git rev-parse HEAD` ?

No response

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

It looks like commit 1e17348 removed the flag which can be used to work around the issue.

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

The text was updated successfully, but these errors were encountered:

ulfjack · 2023-10-23T21:03:45Z

This was running on an arm64 instance, inside a container restricted to 4 cores. I tried to make sure that it would get local action cache hits only, but there are a few tests that don't pass on arm64 at that commit (I had to use an old commit because the repo no longer builds with a bazel binary this old).

ulfjack · 2023-10-27T08:39:35Z

@werkt

werkt · 2023-10-27T10:07:37Z

This is similar to my experience with #17120, I hope that we can sort some of these out and put protection in to avoid unnecessary performance regressions in the future

meisterT · 2023-10-27T11:52:50Z

@ulfjack for the one benchmark you showed, what is the action cache size on disk? Is this on public code, so I can repro? What is the number of jobs (compared with the number of cores, which seems to be 4)?

@werkt in what environment did you experience the slowdown?

In general, from what I have seen is that in most cases it has been a no-op performance wise and the larger the action cache size and the higher the number of jobs, the more of an improvement it is.

ulfjack · 2023-10-27T12:59:46Z

That is from our internal repo, I'm afraid. I can try to repro with Bazel source, but I'm out for a week. This ran with --jobs=100.

I suspect that it's hashing output files while holding the lock, which could be I/O bound.

ulfjack · 2023-10-27T13:00:43Z

(I also tried to repro with x64, but the setup I was using was too flaky. I think the next time I will just run bazel build/test twice in a row with a bazel shutdown inbetween.)

werkt · 2023-10-29T12:08:18Z

@werkt in what environment did you experience the slowdown?

My issue (which is tangential to @ulfjack's issue here) was that N-processor thread calls ensureInputsPresent waits for remote requests (FMB/Writes) to complete, which block all other threads attempting to buildRemoteAction. This effectively limited the remote saturation to approximately 4x N-processors. I used https://github.com/werkt/bazel-stress to generate synthetic parallel-capable load, and for --jobs=5000, my state on a 16-core client was as follows:
16 threads holding remoteActionBuildingSemaphore leases during ensureInputsPresent, in remote request stacks
4936 threads waiting for remoteActionBuildingSemaphore leases during buildRemoteActions
48 threads performing other non-remoteActionBuildingSemaphore activities

Decreasing --jobs leaves the 16 and 48 meters intact, only changing the remaining buildRemoteActions waits.

remoteActionBuildingSemaphore is being used to regulate CPU and RAM pressure - merkle or generalized RAM estimation should be used to regulate the latter, distinct from the former, in low overhead merkle tree situations (bazel-stress uses minimal inputs and only measures action throughput, so it is a pathological representation of the lowest possible memory overhead).

meisterT · 2023-11-08T13:33:14Z

Ulf, is this on Apple Silicon or Linux Arm64? Did you have any luck reproducing this on a publicly available example?

This reverts commit 1e17348. This was requested in bazelbuild#19924.

meisterT · 2023-11-13T09:50:12Z

I have created a PR to revert said commit as we are close to creating the final RC for Bazel 7 and it seems this investigation will take a little bit more time.

I have tried a little bit more and have not seen any slowdown, so answering the questions above would be good to understand why this is happening.

This reverts commit 1e17348. This was requested in #19924. Closes #20162. PiperOrigin-RevId: 581897901 Change-Id: Ifea2330c45c97db4454ffdcc31b7b7af640cd659

This reverts commit 1e17348. This was requested in bazelbuild#19924. Closes bazelbuild#20162. PiperOrigin-RevId: 581897901 Change-Id: Ifea2330c45c97db4454ffdcc31b7b7af640cd659

…lag." (#20164) This reverts commit 1e17348. This was requested in #19924. Closes #20162. Commit 1f75299 PiperOrigin-RevId: 581897901 Change-Id: Ifea2330c45c97db4454ffdcc31b7b7af640cd659 Co-authored-by: Tobias Werth <twerth@google.com>

ulfjack · 2023-11-13T22:20:23Z

I ran this on both linux x64 and arm64, but didn't get a clean sample (due to how I set it up) on x64.

ulfjack · 2023-11-14T18:02:52Z

I unsuccessfully tried to repro on x64 yesterday.

brentleyjones · 2023-12-13T17:34:37Z

We've just encountered a build where with --jobs=2000, this semaphore took over 157s before the build was able to effectively start. Thankfully we can turn the flag off, but ideally, for Bazel 7.1 maybe, the value should be reverted back to false?

meisterT · 2023-12-13T17:41:35Z

How large was the local action cache in this case? Can you share the (perhaps redacted) blaze trace?

If you have ways to repro this, please let me know. I have not seen any case myself where the current flag setting was slower, and lots where it was faster, so I am wondering what's different.

meisterT · 2023-12-14T10:55:42Z

The one profile that I saw did have 2000 jobs and only 3 cores which is suspicious, so I tried to reproduce locally with a large artificial build (all cached) and was not able to.

If anyone has a public repro (even artificial) of the slowdown you guys are seeing here I would like to see that. In all cases where I tested the semaphore was wall time neutral or positive.

Now while looking into this, I saw that changing jobs from 2000 to 50 did speed up the build significantly. Now I assume you have a high number of jobs because of remote caching and execution. In general, I hope that @coeuvre's work on threadpool overhaul and async execution will make the need to tune jobs unnecessary.

ulfjack · 2023-12-20T13:06:03Z

I'm currently trying to update our codebase to 7.0.0 which has the flag again. Unfortunately, I'm seeing a bunch of failures, which I haven't tracked down yet.

ulfjack · 2024-01-11T22:20:39Z

I managed to upgrade to 7.0.0.

ulfjack added type: bug untriaged labels Oct 23, 2023

ulfjack assigned sgowroji, Pavank1992 and iancha1992 Oct 23, 2023

iancha1992 added the team-Performance Issues for Performance teams label Oct 23, 2023

iancha1992 unassigned sgowroji, Pavank1992 and iancha1992 Oct 23, 2023

tjgq assigned meisterT Nov 7, 2023

meisterT mentioned this issue Nov 13, 2023

Revert "Remove --experimental_throttle_action_cache_check flag." #20162

Closed

meisterT added a commit to meisterT/bazel that referenced this issue Nov 13, 2023

Revert "Remove --experimental_throttle_action_cache_check flag."

ed4d809

This reverts commit 1e17348. This was requested in bazelbuild#19924.

bazel-io mentioned this issue Nov 13, 2023

[7.0.0] Revert "Remove --experimental_throttle_action_cache_check flag." #20164

Merged

meisterT added P2 We'll consider working on this in future. (Assignee optional) and removed untriaged labels Nov 14, 2023

brentleyjones mentioned this issue Dec 8, 2023

Make Bazel more responsive and use less memory when --jobs is high #17120

Closed

brentleyjones mentioned this issue Dec 11, 2023

remoteActionBuildingSemaphore slows down highly parallel remote builds #20478

Open

joeleba mentioned this issue May 6, 2024

Bazel 7 Skymeld Regression #22233

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow action cache checks - please revert 1e17348da7e45c00cb474390a3b8ed3103b6b5cf #19924

Slow action cache checks - please revert 1e17348da7e45c00cb474390a3b8ed3103b6b5cf #19924

ulfjack commented Oct 23, 2023

ulfjack commented Oct 23, 2023 •

edited

Loading

ulfjack commented Oct 27, 2023

werkt commented Oct 27, 2023

meisterT commented Oct 27, 2023

ulfjack commented Oct 27, 2023

ulfjack commented Oct 27, 2023

werkt commented Oct 29, 2023 •

edited

Loading

meisterT commented Nov 8, 2023

meisterT commented Nov 13, 2023

ulfjack commented Nov 13, 2023

ulfjack commented Nov 14, 2023

brentleyjones commented Dec 13, 2023

meisterT commented Dec 13, 2023

meisterT commented Dec 14, 2023

ulfjack commented Dec 20, 2023

ulfjack commented Jan 11, 2024

Slow action cache checks - please revert 1e17348da7e45c00cb474390a3b8ed3103b6b5cf #19924

Slow action cache checks - please revert 1e17348da7e45c00cb474390a3b8ed3103b6b5cf #19924

Comments

ulfjack commented Oct 23, 2023

Description of the bug:

Which category does this issue belong to?

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Which operating system are you running Bazel on?

What is the output of bazel info release?

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

Have you found anything relevant by searching the web?

Any other information, logs, or outputs that you want to share?

ulfjack commented Oct 23, 2023 • edited Loading

ulfjack commented Oct 27, 2023

werkt commented Oct 27, 2023

meisterT commented Oct 27, 2023

ulfjack commented Oct 27, 2023

ulfjack commented Oct 27, 2023

werkt commented Oct 29, 2023 • edited Loading

meisterT commented Nov 8, 2023

meisterT commented Nov 13, 2023

ulfjack commented Nov 13, 2023

ulfjack commented Nov 14, 2023

brentleyjones commented Dec 13, 2023

meisterT commented Dec 13, 2023

meisterT commented Dec 14, 2023

ulfjack commented Dec 20, 2023

ulfjack commented Jan 11, 2024

What is the output of `bazel info release`?

If `bazel info release` returns `development version` or `(@non-git)`, tell us how you built Bazel.

What's the output of `git remote get-url origin; git rev-parse master; git rev-parse HEAD` ?

ulfjack commented Oct 23, 2023 •

edited

Loading

werkt commented Oct 29, 2023 •

edited

Loading