-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bazel build hung with jobs waiting on blockingAwait() fromRemoteExecutionCache.ensureInputsPresent
#19513
Comments
The build hangs might be caused by the fact that many actions in the new build happen to depends on a lot of things, but in general, the blockingAwait call could use a timeout and some logging to unstuck the whole build. |
I work with @jacobmou on this and we were able to figure out the triggering condition to mitigate the problem for our build. I believe that there is still a bug somewhere, and hope that some of this information about what triggered the hangs helps in root-causing the issue. We saw these hangs most often on a configuration that updated our hermetic python toolchain to a new version. Specifically on bazel invocations that were building a large part of the repository. Smaller invocations involving testing 10-100 targets would not exhibit the hang. Further investigation showed that our new python tar.gz was improperly built and had multiple copes of before:
After:
These files were exposed to bazel via a custom simplified rule based on py_runtime_pair that returned ToolchainInfo with a PyRuntimeInfo provider. Upon fixing this issue with the duplicated It does still raise concern that there is some threshold for toolchain/artifact size that could be crossed elsewhere and lead to these same remote execution hangs. |
Hi @coeuvre i saw the ticket was assigned to you! Any thoughts about adding the timeout to the blockingAwait() call? |
I think it's reasonable to add a timeout as specified by |
Yes, i agree. We do have many actions with a lot of inputs, but it's still super surprising that an extra 100MB of inputs would cause the whole build to hang and stuck at this status. |
as a workaround for #19513. PiperOrigin-RevId: 569152388 Change-Id: I51e64f4708fc62ca3078290231e4195a081855df
Added the timeout in 95d6ccc as a workaround for now. |
The timeout is unnecessarily constraining - all uploads now need to complete from the start of the action upload creation within the timeout, without progressability, and I believe the cause of this problem is a bug with async task cache. I've updated #21626 with my current test and workaround. I will find the fault in AsyncTaskCache and correct it soon. |
and when it's missing, treat it as remote cache eviction. Also revert the workaround for bazelbuild#19513. Fixes bazelbuild#21777. Potential fix for bazelbuild#21626 and bazelbuild#21778. Closes bazelbuild#21825. PiperOrigin-RevId: 619877088 Change-Id: Ib1204de8440b780e5a6ee6a563a87da08f196ca5
and when it's missing, treat it as remote cache eviction. Also revert the workaround for #19513. Fixes #21777. Potential fix for #21626 and #21778. Closes #21825. PiperOrigin-RevId: 619877088 Change-Id: Ib1204de8440b780e5a6ee6a563a87da08f196ca5 Commit eda0fe4 Co-authored-by: Chi Wang <chiwang@google.com>
and when it's missing, treat it as remote cache eviction. Also revert the workaround for bazelbuild#19513. Fixes bazelbuild#21777. Potential fix for bazelbuild#21626 and bazelbuild#21778. Closes bazelbuild#21825. PiperOrigin-RevId: 619877088 Change-Id: Ib1204de8440b780e5a6ee6a563a87da08f196ca5
and when it's missing, treat it as remote cache eviction. Also revert the workaround for #19513. Fixes #21777. Potential fix for #21626 and #21778. Closes #21825. PiperOrigin-RevId: 619877088 Change-Id: Ib1204de8440b780e5a6ee6a563a87da08f196ca5 Commit eda0fe4 Co-authored-by: Chi Wang <chiwang@google.com>
This should have been fixed by eda0fe4. If not, please reopen. |
Description of the bug:
We have been using bazel 5.3.1 + cherry-picked #16819 and remote execution (Buildfarm, no BwoB) in our CI/CD system for years and they have been pretty stable in general. Unfortunately, when we add a new build variant (which is pretty different from any existing builds), we start to see build timeout very consistently with bazel daemon hanging like
We only hit the issue when running the build with remote execution(Buildfarm).
We tried a couple of different bazel releases including:
but still hit the same issue.
We also collected jstacks from the hang build
jstack-bazel-server-hang-6.3.2.txt
jstack-bazel-server-hang-5.4.1.txt
From the jstack, apparently, many threads are in
WAITING (parking)
status likeWe are also able to get grpc logs by using
--experimental_remote_grpc_log
, but log files are too large to post here.Which category does this issue belong to?
Remote Execution
What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
i'm not able to reproduce the issue with smaller builds.
Which operating system are you running Bazel on?
x86_64 Linux
What is the output of
bazel info release
?If
bazel info release
returnsdevelopment version
or(@non-git)
, tell us how you built Bazel.No response
What's the output of
git remote get-url origin; git rev-parse master; git rev-parse HEAD
?No response
Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.
No response
Have you found anything relevant by searching the web?
The issue we hit is very similar to the one described in #16445, but we tried newer bazel release that contain the fix #16819 from @coeuvre , like 5.4.1 and 6.3.2 but still hit the issue
Any other information, logs, or outputs that you want to share?
One thing could be missing here is to add a timeout to the
blockingAwait
call in theensureInputsPresent
function like the following. This would be able to unstuck the build after--remote_timeout
and fall back these actions to local execution and allow the build to complete, but we were not able to root-cause the issue so far.The text was updated successfully, but these errors were encountered: