-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bazel stuck during remote execution #21626
Comments
scratch_45.txt |
I recommend not using |
Let us try out to remove this! |
Should it TO even with |
Still happening even without
|
bazel_server_dump.txt |
FYI @werkt this is my colleague and we're using Buildfarm |
From both stack dumps, it seems like bazel is waiting for the remote execution to be done. Did it happen for Bazel 6? Do the logs from remote server tell anything? |
We are only starting to experiment with RBE so we don't have pre-7.0 baseline, unfortunately. Regarding remote server doing something: I've checked BuildFarm workers and there were no active tasks there. |
Another suspicious stack frame is:
It seems like the Bazel server has trouble sending message to the client. Can you share the list of flags your build uses? And maybe try disabling BEP/BES? |
BEP is async and this is confirmed to be working properly in an async way starting from 7.0
Here is the full list of remote-related configs
|
From the stack trace above, Bazel encountered errors when uploading blobs to BES/BEP. The errors were not printed out because Bazel server stuck at sending the message to client. I suggested to try disabling BES/BEP because by doing so we know whether this (the upload error) is the reason why Bazel stuck. |
I'll try to repro it without BES/BEP in next couple of days. The issue still should be fixed though even if this is because of BEP/BES ;) We specifically asking for What I remember from server logs earlier: there were no traces of any uploads in other Bazel server logs. |
This time RBE server might be not really healthy yet in this case I would expect Bazel to fail after a reasonable timeout. |
Right, we are still investigate what is the root cause and I am trying to remove the variances as much as possible. From the new stack trace:
it seems like Bazel stuck at uploading inputs to remote server. Since you set |
It seems relate to |
As you can see here #21626 (comment) we don't have this enabled. I still can turn it off if required though. |
Ok... I see it is enabled by default. Let me turn that off
|
What to do with BES/BEP and |
Let's keep them off for now. |
This flag is 8.x only, based on 294c904 I got
for
So I'd say this is definitely not this property. |
Although the flag was not available for 7.0.1, the feature is still there and on. The flag is available in Bazel 7.1.0. Can you use that version instead and try? |
@coeuvre were you able to identify the issue? How else we can help here? |
I was able to reproduce a hang but the stack trace is different. I am not sure whether they share the same root cause. In the mean time, can you check whether your build hangs with |
Something has changed in our RBE server config for this not to happen anymore :( Yet the fact that it was so consistent before means that some more resilience on client side won't hurt. |
I think I am seeing the same issue as reported here, on 7.1.1.
Some relevant flags
As per last suggestion, I will try with --remote_download_all |
@joeljeske Your build is endeavoring to build the merkle trees of 199 actions, slowly, it seems, but not obviously stuck. 64 (presumably your # of cores, throttled to the remote action building semaphore) are proceeding in this task. You do have one evaluator |
Understood! Thank you for the analysis. I’m curious then, why I don’t I see observe the same behavior on Bazel 6.4.0. I can consistently see slow build speeds on 7.1.1 that are not present before the upgrade. Do you have any ideas? |
In the interest of not hijacking this issue, can you file a separate one @joeljeske against a performance regression between 6.4.0. and 7.1.1? |
Can you patch eda0fe4 and check whether this issue is still reproducible? |
@coeuvre will this be merged to 7.2? Patching is hard and we not even fully moved to 7.1 yet... I'd rather try to repro this with 7.0 like before and then switch to "head" of 7.2 to see whether the issue is there. My CAS state is changing and evictions and cleanup not happening recently. This is why I'm unable to repro currently. Yet we are verifying some CAS-related issue with BuildFarm and I hope to get to the point when CAS is mutating again to trigger the issue for us. |
Yes, the fix will be merged into 7.2. No pressure! I will keep this issue open (but with a lower priority) for the time being. Feel free to close it once you can verify the fix. |
and when it's missing, treat it as remote cache eviction. Also revert the workaround for bazelbuild#19513. Fixes bazelbuild#21777. Potential fix for bazelbuild#21626 and bazelbuild#21778. Closes bazelbuild#21825. PiperOrigin-RevId: 619877088 Change-Id: Ib1204de8440b780e5a6ee6a563a87da08f196ca5
@coeuvre could you please tell where we can get the custom Bazelisk-ready version that contains the fix? Rolling releases seems having only 7.0 and 8.0 pre-releases and I don't see any 7.1 and 7.2 nightly builds here https://bazel.build/release/rolling |
and when it's missing, treat it as remote cache eviction. Also revert the workaround for #19513. Fixes #21777. Potential fix for #21626 and #21778. Closes #21825. PiperOrigin-RevId: 619877088 Change-Id: Ib1204de8440b780e5a6ee6a563a87da08f196ca5 Commit eda0fe4 Co-authored-by: Chi Wang <chiwang@google.com>
and when it's missing, treat it as remote cache eviction. Also revert the workaround for bazelbuild#19513. Fixes bazelbuild#21777. Potential fix for bazelbuild#21626 and bazelbuild#21778. Closes bazelbuild#21825. PiperOrigin-RevId: 619877088 Change-Id: Ib1204de8440b780e5a6ee6a563a87da08f196ca5
and when it's missing, treat it as remote cache eviction. Also revert the workaround for #19513. Fixes #21777. Potential fix for #21626 and #21778. Closes #21825. PiperOrigin-RevId: 619877088 Change-Id: Ib1204de8440b780e5a6ee6a563a87da08f196ca5 Commit eda0fe4 Co-authored-by: Chi Wang <chiwang@google.com>
Tried
this seems not ok. It doesn't look that https://github.com/bazelbuild/bazel/pull/21941/files is merged to @iancha1992 @coeuvre could you please suggest which |
A fix for this issue has been included in Bazel 7.2.0 RC1. Please test out the release candidate and report any issues as soon as possible. |
Hopefully this includes a fix for Bazel getting stuck using remote execution. bazelbuild/bazel#21626 Change-Id: I0ae977bc492facc0ebfe8571660a3b02cfb14960 Former-commit-id: 5d91fa83c02fb7c5d995498cb1ecb98edcd65c1e
This issue has recurred under bazel 7.2. I have a stuck java process awaiting countdown latches for all ensureInputsPresent running. |
Description of the bug:
Bazel
build
current statewhen there is no such issue observed this run takes less than a minute if not less than 10s.
Which category does this issue belong to?
No response
What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
There is no easy way to reproduce it and some noticeable % of our RBE runs are stuck forever without Bazel finishing properly.
Here is the thread dump of Bazel server and there is no activity on RBE side workers at the same time
Which operating system are you running Bazel on?
5.14.0-362.18.1.el9_3.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Jan 3 15:54:45 EST 2024 x86_64 x86_64 x86_64 GNU/Linux
What is the output of
bazel info release
?release 7.0.1
If
bazel info release
returnsdevelopment version
or(@non-git)
, tell us how you built Bazel.No response
What's the output of
git remote get-url origin; git rev-parse HEAD
?No response
Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.
No response
Have you found anything relevant by searching the web?
No response
Any other information, logs, or outputs that you want to share?
We do have the following in
.bazelrc
The text was updated successfully, but these errors were encountered: