Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bazel build hung with jobs waiting on blockingAwait() fromRemoteExecutionCache.ensureInputsPresent #19513

Closed
jacobmou opened this issue Sep 13, 2023 · 8 comments
Assignees
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug

Comments

@jacobmou
Copy link

jacobmou commented Sep 13, 2023

Description of the bug:

We have been using bazel 5.3.1 + cherry-picked #16819 and remote execution (Buildfarm, no BwoB) in our CI/CD system for years and they have been pretty stable in general. Unfortunately, when we add a new build variant (which is pretty different from any existing builds), we start to see build timeout very consistently with bazel daemon hanging like

(19:09:37) [188,650 / 188,787] 7948 / 31730 tests; ...; 2481s remote ... (128 actions, 0 running)
(19:18:34) [188,650 / 188,787] 7948 / 31730 tests; ...; 3018s remote ... (128 actions, 0 running)
(19:28:46) [188,650 / 188,787] 7948 / 31730 tests; ...; 3629s remote ... (128 actions, 0 running)
(19:40:37) [188,650 / 188,787] 7948 / 31730 tests; ...; 4341s remote ... (128 actions, 0 running)
(19:54:37) [188,650 / 188,787] 7948 / 31730 tests; ...; 5181s remote ... (128 actions, 0 running)
# Received cancellation signal, interrupting
Terminated

We only hit the issue when running the build with remote execution(Buildfarm).

We tried a couple of different bazel releases including:

  1. bazel 5.3.1 release + cherry-pick [5.4.0] Fix hanging issue when Bazel failed to upload action inputs #16819
  2. 5.4.1
  3. 6.3.2

but still hit the same issue.

We also collected jstacks from the hang build
jstack-bazel-server-hang-6.3.2.txt
jstack-bazel-server-hang-5.4.1.txt

From the jstack, apparently, many threads are in WAITING (parking) status like

    java.lang.Thread.State: WAITING (parking)
   at jdk.internal.misc.Unsafe.park(java.base@11.0.6/Native Method)
   - parking to wait for  <0x00007f0d38580c38> (a java.util.concurrent.CountDownLatch$Sync)
   at java.util.concurrent.locks.LockSupport.park(java.base@11.0.6/Unknown Source)
   at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.6/Unknown Source)
   at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(java.base@11.0.6/Unknown Source)
   at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base@11.0.6/Unknown Source)
   at java.util.concurrent.CountDownLatch.await(java.base@11.0.6/Unknown Source)
   at io.reactivex.rxjava3.internal.observers.BlockingMultiObserver.blockingGet(BlockingMultiObserver.java:86)
   at io.reactivex.rxjava3.core.Completable.blockingAwait(Completable.java:1468)
   at com.google.devtools.build.lib.remote.RemoteExecutionCache.ensureInputsPresent(RemoteExecutionCache.java:101)
   at com.google.devtools.build.lib.remote.RemoteExecutionService.uploadInputsIfNotPresent(RemoteExecutionService.java:1350)
   at com.google.devtools.build.lib.remote.RemoteSpawnRunner.lambda$exec$2(RemoteSpawnRunner.java:251)
   at com.google.devtools.build.lib.remote.RemoteSpawnRunner$$Lambda$1322/0x00007ee9aa845040.call(Unknown Source)
   at com.google.devtools.build.lib.remote.Retrier.execute(Retrier.java:244)

We are also able to get grpc logs by using --experimental_remote_grpc_log, but log files are too large to post here.

Which category does this issue belong to?

Remote Execution

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

i'm not able to reproduce the issue with smaller builds.

Which operating system are you running Bazel on?

x86_64 Linux

What is the output of bazel info release?

  1. bazel 5.3.1 release + cherry-pick [5.4.0] Fix hanging issue when Bazel failed to upload action inputs #16819 2) 5.4.1 3) 6.3.2

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

No response

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

No response

Have you found anything relevant by searching the web?

The issue we hit is very similar to the one described in #16445, but we tried newer bazel release that contain the fix #16819 from @coeuvre , like 5.4.1 and 6.3.2 but still hit the issue

Any other information, logs, or outputs that you want to share?

One thing could be missing here is to add a timeout to the blockingAwait call in the ensureInputsPresent function like the following. This would be able to unstuck the build after --remote_timeout and fall back these actions to local execution and allow the build to complete, but we were not able to root-cause the issue so far.

 --- a/src/main/java/com/google/devtools/build/lib/remote/RemoteExecutionCache.java
 +++ b/src/main/java/com/google/devtools/build/lib/remote/RemoteExecutionCache.java
 @@ -28,6 +28,7 @@ import build.bazel.remote.execution.v2.Directory;
  import com.google.common.base.Throwables;
  import com.google.common.collect.ImmutableList;
  import com.google.common.collect.ImmutableSet;
 +import com.google.common.flogger.GoogleLogger;
  import com.google.common.util.concurrent.ListenableFuture;
  import com.google.devtools.build.lib.profiler.Profiler;
  import com.google.devtools.build.lib.profiler.SilentCloseable;
 @@ -50,12 +51,15 @@ import io.reactivex.rxjava3.core.SingleEmitter;
  import io.reactivex.rxjava3.disposables.Disposable;
  import io.reactivex.rxjava3.subjects.AsyncSubject;
  import java.io.IOException;
 +import java.nio.channels.InterruptedByTimeoutException;
  import java.util.List;
  import java.util.Map;
 +import java.util.concurrent.TimeUnit;
  import java.util.concurrent.atomic.AtomicReference;

  /** A {@link RemoteCache} with additional functionality needed for remote execution. */
  public class RemoteExecutionCache extends RemoteCache {
 +  static final GoogleLogger logger = GoogleLogger.forEnclosingClass();

    public RemoteExecutionCache(
        RemoteCacheClient protocolImpl,
 @@ -98,7 +102,12 @@ public class RemoteExecutionCache extends RemoteCache {
              .flatMapPublisher(this::waitForUploadTasks);

      try {
 -      mergeBulkTransfer(uploads).blockingAwait();
 +      // Set the blockingAwait call timeout to 15 minutes to keep consistent with
 +      // --remote_timeout setting in aurora/av for remtoe execution.
 +      if (!mergeBulkTransfer(uploads).blockingAwait(options.remoteTimeout.getSeconds(), TimeUnit.SECONDS)) {
 +        logger.atInfo().log("Error: Hitting blockingAwait() timeout error(--remote_timeout=%s seconds)", options.remoteTimeout.getSeconds());
 +        throw new InterruptedByTimeoutException();
 +      }
      } catch (RuntimeException e) {
        Throwable cause = e.getCause();
        if (cause != null) {
@jacobmou
Copy link
Author

jacobmou commented Sep 13, 2023

The build hangs might be caused by the fact that many actions in the new build happen to depends on a lot of things, but in general, the blockingAwait call could use a timeout and some logging to unstuck the whole build.

@iancha1992 iancha1992 added the team-Remote-Exec Issues and PRs for the Execution (Remote) team label Sep 13, 2023
@alex-torok
Copy link
Contributor

I work with @jacobmou on this and we were able to figure out the triggering condition to mitigate the problem for our build. I believe that there is still a bug somewhere, and hope that some of this information about what triggered the hangs helps in root-causing the issue.

We saw these hangs most often on a configuration that updated our hermetic python toolchain to a new version. Specifically on bazel invocations that were building a large part of the repository. Smaller invocations involving testing 10-100 targets would not exhibit the hang.

Further investigation showed that our new python tar.gz was improperly built and had multiple copes of libpython3. This bloated the python toolchain by roughly 100MB:

before:

15M     python3.7/.libs/libpython3-2ce1.7m.so.1.0

After:

16M     python3.10/.libs/libpython3-2897.10.so.1.0
16M     python3.10/.libs/libpython3-5d54.10.so.1.0
16M     python3.10/.libs/libpython3-81f7.10.so.1.0
16M     python3.10/.libs/libpython3-e36b.10.so.1.0
19M     python3.10/.libs/libpython3-252f.10.so.1.0
19M     python3.10/.libs/libpython3-7608.10.so.1.0
19M     python3.10/.libs/libpython3-ea3e.10.so.1.0

These files were exposed to bazel via a custom simplified rule based on py_runtime_pair that returned ToolchainInfo with a PyRuntimeInfo provider.

Upon fixing this issue with the duplicated libpython3 file, the hangs disappeared.

It does still raise concern that there is some threshold for toolchain/artifact size that could be crossed elsewhere and lead to these same remote execution hangs.

@coeuvre coeuvre self-assigned this Sep 19, 2023
@coeuvre coeuvre added P2 We'll consider working on this in future. (Assignee optional) and removed untriaged labels Sep 19, 2023
@jacobmou
Copy link
Author

Hi @coeuvre i saw the ticket was assigned to you! Any thoughts about adding the timeout to the blockingAwait() call?

@coeuvre
Copy link
Member

coeuvre commented Sep 26, 2023

I think it's reasonable to add a timeout as specified by --remote_timeout to the blockingAwait(). But I am more worried about the underlying issue. It looks like Bazel was not able to upload extra 100MB input.

@jacobmou
Copy link
Author

Yes, i agree. We do have many actions with a lot of inputs, but it's still super surprising that an extra 100MB of inputs would cause the whole build to hang and stuck at this status.
I can try to dig a little bit more about what going on and hopefully can gain more insights about the issue. At the same time, @coeuvre what do you think about landing a patch as above that would set a timeout to the blockingAwait() call?

copybara-service bot pushed a commit that referenced this issue Sep 28, 2023
as a workaround for #19513.

PiperOrigin-RevId: 569152388
Change-Id: I51e64f4708fc62ca3078290231e4195a081855df
@coeuvre
Copy link
Member

coeuvre commented Sep 28, 2023

Added the timeout in 95d6ccc as a workaround for now.

@werkt
Copy link
Contributor

werkt commented Mar 16, 2024

The timeout is unnecessarily constraining - all uploads now need to complete from the start of the action upload creation within the timeout, without progressability, and I believe the cause of this problem is a bug with async task cache.

I've updated #21626 with my current test and workaround. I will find the fault in AsyncTaskCache and correct it soon.

copybara-service bot pushed a commit that referenced this issue Mar 28, 2024
and when it's missing, treat it as remote cache eviction.

Also revert the workaround for #19513.

Fixes #21777.
Potential fix for #21626 and #21778.

Closes #21825.

PiperOrigin-RevId: 619877088
Change-Id: Ib1204de8440b780e5a6ee6a563a87da08f196ca5
iancha1992 pushed a commit to iancha1992/bazel that referenced this issue Mar 28, 2024
and when it's missing, treat it as remote cache eviction.

Also revert the workaround for bazelbuild#19513.

Fixes bazelbuild#21777.
Potential fix for bazelbuild#21626 and bazelbuild#21778.

Closes bazelbuild#21825.

PiperOrigin-RevId: 619877088
Change-Id: Ib1204de8440b780e5a6ee6a563a87da08f196ca5
iancha1992 added a commit that referenced this issue Apr 2, 2024
and when it's missing, treat it as remote cache eviction.

Also revert the workaround for #19513.

Fixes #21777.
Potential fix for #21626 and #21778.

Closes #21825.

PiperOrigin-RevId: 619877088
Change-Id: Ib1204de8440b780e5a6ee6a563a87da08f196ca5

Commit
eda0fe4

Co-authored-by: Chi Wang <chiwang@google.com>
iancha1992 pushed a commit to iancha1992/bazel that referenced this issue Apr 9, 2024
and when it's missing, treat it as remote cache eviction.

Also revert the workaround for bazelbuild#19513.

Fixes bazelbuild#21777.
Potential fix for bazelbuild#21626 and bazelbuild#21778.

Closes bazelbuild#21825.

PiperOrigin-RevId: 619877088
Change-Id: Ib1204de8440b780e5a6ee6a563a87da08f196ca5
iancha1992 added a commit that referenced this issue Apr 9, 2024
and when it's missing, treat it as remote cache eviction.

Also revert the workaround for #19513.

Fixes #21777.
Potential fix for #21626 and #21778.

Closes #21825.

PiperOrigin-RevId: 619877088
Change-Id: Ib1204de8440b780e5a6ee6a563a87da08f196ca5

Commit
eda0fe4

Co-authored-by: Chi Wang <chiwang@google.com>
@tjgq
Copy link
Contributor

tjgq commented Aug 29, 2024

This should have been fixed by eda0fe4. If not, please reopen.

@tjgq tjgq closed this as completed Aug 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug
Projects
None yet
Development

No branches or pull requests

8 participants