Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow action cache checks - please revert 1e17348da7e45c00cb474390a3b8ed3103b6b5cf #19924

Open
ulfjack opened this issue Oct 23, 2023 · 16 comments
Assignees
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Performance Issues for Performance teams type: bug

Comments

@ulfjack
Copy link
Contributor

ulfjack commented Oct 23, 2023

Description of the bug:

@meisterT enabled the action cache throttle in 1e17348, but the description doesn't have any benchmark results or any other data supporting the claim that nobody would want this to be disabled. I had to go back to a fairly old commit in our own repo, but it looks like it has a significant impact on build times for us:

with the throttle enabled: 1m 31s
with the throttle disabled: 1m 19s

I wasn't able to get a cleaner signal, but we can clearly see the "acquiring semaphore" pieces in the profile:

throttle enabled:
with-throttling

throttle disabled
without-throttling

We can see that the action cache checks with throttling take until ~60s, while the action cache checks without throttling take until ~40s.

Which category does this issue belong to?

Performance

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

No response

Which operating system are you running Bazel on?

Linux

What is the output of bazel info release?

7.0.0-pre.20230530.3

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

No response

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

It looks like commit 1e17348 removed the flag which can be used to work around the issue.

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

@ulfjack
Copy link
Contributor Author

ulfjack commented Oct 23, 2023

This was running on an arm64 instance, inside a container restricted to 4 cores. I tried to make sure that it would get local action cache hits only, but there are a few tests that don't pass on arm64 at that commit (I had to use an old commit because the repo no longer builds with a bazel binary this old).

@iancha1992 iancha1992 added the team-Performance Issues for Performance teams label Oct 23, 2023
@ulfjack
Copy link
Contributor Author

ulfjack commented Oct 27, 2023

@werkt

@werkt
Copy link
Contributor

werkt commented Oct 27, 2023

This is similar to my experience with #17120, I hope that we can sort some of these out and put protection in to avoid unnecessary performance regressions in the future

@meisterT
Copy link
Member

@ulfjack for the one benchmark you showed, what is the action cache size on disk? Is this on public code, so I can repro? What is the number of jobs (compared with the number of cores, which seems to be 4)?

@werkt in what environment did you experience the slowdown?

In general, from what I have seen is that in most cases it has been a no-op performance wise and the larger the action cache size and the higher the number of jobs, the more of an improvement it is.

@ulfjack
Copy link
Contributor Author

ulfjack commented Oct 27, 2023

That is from our internal repo, I'm afraid. I can try to repro with Bazel source, but I'm out for a week. This ran with --jobs=100.

I suspect that it's hashing output files while holding the lock, which could be I/O bound.

@ulfjack
Copy link
Contributor Author

ulfjack commented Oct 27, 2023

(I also tried to repro with x64, but the setup I was using was too flaky. I think the next time I will just run bazel build/test twice in a row with a bazel shutdown inbetween.)

@werkt
Copy link
Contributor

werkt commented Oct 29, 2023

@werkt in what environment did you experience the slowdown?

My issue (which is tangential to @ulfjack's issue here) was that N-processor thread calls ensureInputsPresent waits for remote requests (FMB/Writes) to complete, which block all other threads attempting to buildRemoteAction. This effectively limited the remote saturation to approximately 4x N-processors. I used https://github.com/werkt/bazel-stress to generate synthetic parallel-capable load, and for --jobs=5000, my state on a 16-core client was as follows:
16 threads holding remoteActionBuildingSemaphore leases during ensureInputsPresent, in remote request stacks
4936 threads waiting for remoteActionBuildingSemaphore leases during buildRemoteActions
48 threads performing other non-remoteActionBuildingSemaphore activities

Decreasing --jobs leaves the 16 and 48 meters intact, only changing the remaining buildRemoteActions waits.

remoteActionBuildingSemaphore is being used to regulate CPU and RAM pressure - merkle or generalized RAM estimation should be used to regulate the latter, distinct from the former, in low overhead merkle tree situations (bazel-stress uses minimal inputs and only measures action throughput, so it is a pathological representation of the lowest possible memory overhead).

@meisterT
Copy link
Member

meisterT commented Nov 8, 2023

Ulf, is this on Apple Silicon or Linux Arm64? Did you have any luck reproducing this on a publicly available example?

@meisterT
Copy link
Member

I have created a PR to revert said commit as we are close to creating the final RC for Bazel 7 and it seems this investigation will take a little bit more time.

I have tried a little bit more and have not seen any slowdown, so answering the questions above would be good to understand why this is happening.

copybara-service bot pushed a commit that referenced this issue Nov 13, 2023
This reverts commit 1e17348.

This was requested in #19924.

Closes #20162.

PiperOrigin-RevId: 581897901
Change-Id: Ifea2330c45c97db4454ffdcc31b7b7af640cd659
bazel-io pushed a commit to bazel-io/bazel that referenced this issue Nov 13, 2023
This reverts commit 1e17348.

This was requested in bazelbuild#19924.

Closes bazelbuild#20162.

PiperOrigin-RevId: 581897901
Change-Id: Ifea2330c45c97db4454ffdcc31b7b7af640cd659
keertk pushed a commit that referenced this issue Nov 13, 2023
…lag." (#20164)

This reverts commit 1e17348.

This was requested in #19924.

Closes #20162.

Commit
1f75299

PiperOrigin-RevId: 581897901
Change-Id: Ifea2330c45c97db4454ffdcc31b7b7af640cd659

Co-authored-by: Tobias Werth <twerth@google.com>
@ulfjack
Copy link
Contributor Author

ulfjack commented Nov 13, 2023

I ran this on both linux x64 and arm64, but didn't get a clean sample (due to how I set it up) on x64.

@meisterT meisterT added P2 We'll consider working on this in future. (Assignee optional) and removed untriaged labels Nov 14, 2023
@ulfjack
Copy link
Contributor Author

ulfjack commented Nov 14, 2023

I unsuccessfully tried to repro on x64 yesterday.

@brentleyjones
Copy link
Contributor

We've just encountered a build where with --jobs=2000, this semaphore took over 157s before the build was able to effectively start. Thankfully we can turn the flag off, but ideally, for Bazel 7.1 maybe, the value should be reverted back to false?

@meisterT
Copy link
Member

How large was the local action cache in this case? Can you share the (perhaps redacted) blaze trace?

If you have ways to repro this, please let me know. I have not seen any case myself where the current flag setting was slower, and lots where it was faster, so I am wondering what's different.

@meisterT
Copy link
Member

The one profile that I saw did have 2000 jobs and only 3 cores which is suspicious, so I tried to reproduce locally with a large artificial build (all cached) and was not able to.

If anyone has a public repro (even artificial) of the slowdown you guys are seeing here I would like to see that. In all cases where I tested the semaphore was wall time neutral or positive.

Now while looking into this, I saw that changing jobs from 2000 to 50 did speed up the build significantly. Now I assume you have a high number of jobs because of remote caching and execution. In general, I hope that @coeuvre's work on threadpool overhaul and async execution will make the need to tune jobs unnecessary.

@ulfjack
Copy link
Contributor Author

ulfjack commented Dec 20, 2023

I'm currently trying to update our codebase to 7.0.0 which has the flag again. Unfortunately, I'm seeing a bunch of failures, which I haven't tracked down yet.

@ulfjack
Copy link
Contributor Author

ulfjack commented Jan 11, 2024

I managed to upgrade to 7.0.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Performance Issues for Performance teams type: bug
Projects
None yet
Development

No branches or pull requests

8 participants
@werkt @brentleyjones @meisterT @ulfjack @sgowroji @Pavank1992 @iancha1992 and others