Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bazel stuck during remote execution #21626

Open
dkashyn-sfdc opened this issue Mar 8, 2024 · 48 comments
Open

Bazel stuck during remote execution #21626

dkashyn-sfdc opened this issue Mar 8, 2024 · 48 comments
Assignees
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug

Comments

@dkashyn-sfdc
Copy link

dkashyn-sfdc commented Mar 8, 2024

Description of the bug:

Bazel build current state

[65,456 / 65,465] PackageZip ...rcejar.jar; 20129s remote

when there is no such issue observed this run takes less than a minute if not less than 10s.

Which category does this issue belong to?

No response

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

There is no easy way to reproduce it and some noticeable % of our RBE runs are stuck forever without Bazel finishing properly.

Here is the thread dump of Bazel server and there is no activity on RBE side workers at the same time

Which operating system are you running Bazel on?

5.14.0-362.18.1.el9_3.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Jan 3 15:54:45 EST 2024 x86_64 x86_64 x86_64 GNU/Linux

What is the output of bazel info release?

release 7.0.1

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse HEAD ?

No response

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

We do have the following in .bazelrc

common:remote --jobs=300
common:remote --remote_timeout=3600
@dkashyn-sfdc
Copy link
Author

scratch_45.txt
here is the thread dump of Bazel server when this is happening

@benjaminp
Copy link
Collaborator

I recommend not using --experimental_remote_execution_keepalive.

@dkashyn-sfdc
Copy link
Author

Let us try out to remove this!

@dkashyn-sfdc
Copy link
Author

Should it TO even with keep alive though?

@sgowroji sgowroji added the team-Remote-Exec Issues and PRs for the Execution (Remote) team label Mar 9, 2024
@dkashyn-sfdc
Copy link
Author

Still happening even without keep alive

 cat .bazelrc |grep keep
#common:remote --experimental_remote_execution_keepalive
[65,553 / 65,563] .... 1966s remote

@dkashyn-sfdc
Copy link
Author

bazel_server_dump.txt
another dump with keep_alive off

@jasonschroeder-sfdc
Copy link
Contributor

FYI @werkt this is my colleague and we're using Buildfarm

@coeuvre
Copy link
Member

coeuvre commented Mar 12, 2024

From both stack dumps, it seems like bazel is waiting for the remote execution to be done. Did it happen for Bazel 6? Do the logs from remote server tell anything?

@dkashyn-sfdc
Copy link
Author

We are only starting to experiment with RBE so we don't have pre-7.0 baseline, unfortunately.

Regarding remote server doing something: I've checked BuildFarm workers and there were no active tasks there.

@coeuvre
Copy link
Member

coeuvre commented Mar 13, 2024

Another suspicious stack frame is:


"remote-executor-15" #3372 [22233] prio=5 os_prio=0 cpu=238.42ms elapsed=12851.39s allocated=100633K defined_classes=0 tid=0x00007fb6ac44d050 nid=22233 in Object.wait()  [0x00007fb7562ac000]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait0(java.base@21.0.1/Native Method)
	- waiting on <no object reference available>
	at java.lang.Object.wait(java.base@21.0.1/Object.java:366)
	at java.lang.Object.wait(java.base@21.0.1/Object.java:339)
	at com.google.devtools.build.lib.server.GrpcServerImpl$BlockingStreamObserver.onNext(GrpcServerImpl.java:175)
	- locked <0x000000042f67b158> (a com.google.devtools.build.lib.server.GrpcServerImpl$BlockingStreamObserver)
	at com.google.devtools.build.lib.server.GrpcServerImpl$RpcOutputStream.write(GrpcServerImpl.java:253)
	at com.google.devtools.build.lib.util.io.DelegatingOutErr$DelegatingOutputStream.write(DelegatingOutErr.java:89)
	at java.io.BufferedOutputStream.flushBuffer(java.base@21.0.1/BufferedOutputStream.java:125)
	at java.io.BufferedOutputStream.implFlush(java.base@21.0.1/BufferedOutputStream.java:252)
	at java.io.BufferedOutputStream.flush(java.base@21.0.1/BufferedOutputStream.java:240)
	at com.google.devtools.build.lib.runtime.UiEventHandler$FullyBufferedOutputStream.flush(UiEventHandler.java:146)
	at com.google.devtools.build.lib.util.io.AnsiTerminal.flush(AnsiTerminal.java:192)
	at com.google.devtools.build.lib.runtime.UiEventHandler.handleLocked(UiEventHandler.java:373)
	- locked <0x000000042f67b2f0> (a com.google.devtools.build.lib.runtime.UiEventHandler)
	at com.google.devtools.build.lib.runtime.UiEventHandler.handleInternal(UiEventHandler.java:443)
	at com.google.devtools.build.lib.runtime.UiEventHandler.handle(UiEventHandler.java:471)
	at com.google.devtools.build.lib.events.Reporter.handle(Reporter.java:127)
	at com.google.devtools.build.lib.remote.ByteStreamBuildEventArtifactUploader.reportUploadError(ByteStreamBuildEventArtifactUploader.java:342)
	at com.google.devtools.build.lib.remote.ByteStreamBuildEventArtifactUploader.lambda$uploadLocalFiles$5(ByteStreamBuildEventArtifactUploader.java:371)
	at com.google.devtools.build.lib.remote.ByteStreamBuildEventArtifactUploader$$Lambda/0x00007fb75c9f5568.apply(Unknown Source)
	at io.reactivex.rxjava3.internal.operators.single.SingleResumeNext$ResumeMainSingleObserver.onError(SingleResumeNext.java:73)
	at io.reactivex.rxjava3.internal.operators.completable.CompletableToSingle$ToSingle.onError(CompletableToSingle.java:73)
	at io.reactivex.rxjava3.internal.operators.completable.CompletableCreate$Emitter.tryOnError(CompletableCreate.java:91)
	at io.reactivex.rxjava3.internal.operators.completable.CompletableCreate$Emitter.onError(CompletableCreate.java:77)
	at com.google.devtools.build.lib.remote.util.RxFutures$OnceCompletableOnSubscribe$1.onFailure(RxFutures.java:102)
	at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1119)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@21.0.1/ThreadPoolExecutor.java:1144)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@21.0.1/ThreadPoolExecutor.java:642)
	at java.lang.Thread.runWith(java.base@21.0.1/Thread.java:1596)
	at java.lang.Thread.run(java.base@21.0.1/Thread.java:1583)

It seems like the Bazel server has trouble sending message to the client. Can you share the list of flags your build uses? And maybe try disabling BEP/BES?

@dkashyn-sfdc
Copy link
Author

BEP is async and this is confirmed to be working properly in an async way starting from 7.0

build --bes_backend=grpcs://...
build --bes_upload_mode=fully_async
build --build_event_text_file_upload_mode=fully_async
build --build_event_binary_file_upload_mode=fully_async
build --build_event_json_file_upload_mode=fully_async

Here is the full list of remote-related configs

common:remote_sfw_uswest2 --remote_executor=grpcs://bazel-buildfarm....
common:remote_sfw_uswest2 --remote_cache=grpcs://bazel-buildfarm....
common:remote_sfw_uswest2 --config=remote

# remote defaults.
common:remote --define=EXECUTOR=remote
common:remote --experimental_remote_cache_compression  # https://github.com/buildbuddy-io/buildbuddy/blob/master/.bazelrc#L32
common:remote --jobs=300
common:remote --remote_timeout=3600
common:remote --build_metadata=RBE=True
common:remote --platforms=@rbe_default//config:platform
common:remote --host_platform=@rbe_default//config:platform
common:remote --extra_execution_platforms=@rbe_default//config:platform
common:remote --crosstool_top=@rbe_default//cc:toolchain
common:remote --extra_toolchains=@rbe_default//config:cc-toolchain
common:remote --disk_cache=
common:remote --cpu=k8 --host_cpu=k8 # force tools to use intel x86_64 CPU (in case you are trying on an Apple Mac)
#common:remote --experimental_remote_execution_keepalive

@coeuvre
Copy link
Member

coeuvre commented Mar 13, 2024

From the stack trace above, Bazel encountered errors when uploading blobs to BES/BEP. The errors were not printed out because Bazel server stuck at sending the message to client.

I suggested to try disabling BES/BEP because by doing so we know whether this (the upload error) is the reason why Bazel stuck.

@dkashyn-sfdc
Copy link
Author

I'll try to repro it without BES/BEP in next couple of days. The issue still should be fixed though even if this is because of BEP/BES ;) We specifically asking for async.

What I remember from server logs earlier: there were no traces of any uploads in other Bazel server logs.

@dkashyn-sfdc
Copy link
Author

❯ cat .bazelrc | grep -e bes -e event
#build --bes_backend=grpcs://...
#build --bes_upload_mode=fully_async
#build --build_event_text_file_upload_mode=fully_async
#build --build_event_binary_file_upload_mode=fully_async
#build --build_event_json_file_upload_mode=fully_async

I see
image

rbe_another_dump.txt

This time RBE server might be not really healthy yet in this case I would expect Bazel to fail after a reasonable timeout.

@coeuvre
Copy link
Member

coeuvre commented Mar 14, 2024

I'll try to repro it without BES/BEP in next couple of days. The issue still should be fixed though even if this is because of BEP/BES ;) We specifically asking for async.

Right, we are still investigate what is the root cause and I am trying to remove the variances as much as possible.

From the new stack trace:

skyframe-evaluator-execution-150" #12989 [603421] daemon prio=5 os_prio=0 cpu=3342.37ms elapsed=1629.24s allocated=1920M defined_classes=0 tid=0x00007f7c5c24ee30 nid=603421 waiting on condition  [0x00007f7b13432000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@21.0.1/Native Method)
	- parking to wait for  <0x000000056776f8d0> (a java.util.concurrent.CountDownLatch$Sync)
	at java.util.concurrent.locks.LockSupport.parkNanos(java.base@21.0.1/LockSupport.java:269)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@21.0.1/AbstractQueuedSynchronizer.java:756)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(java.base@21.0.1/AbstractQueuedSynchronizer.java:1126)
	at java.util.concurrent.CountDownLatch.await(java.base@21.0.1/CountDownLatch.java:276)
	at io.reactivex.rxjava3.internal.observers.BlockingMultiObserver.blockingAwait(BlockingMultiObserver.java:134)
	at io.reactivex.rxjava3.core.Completable.blockingAwait(Completable.java:1497)
	at com.google.devtools.build.lib.remote.RemoteExecutionCache.ensureInputsPresent(RemoteExecutionCache.java:117)
	at com.google.devtools.build.lib.remote.RemoteExecutionService.uploadInputsIfNotPresent(RemoteExecutionService.java:1451)
	at com.google.devtools.build.lib.remote.RemoteSpawnRunner.lambda$exec$2(RemoteSpawnRunner.java:272)
	at com.google.devtools.build.lib.remote.RemoteSpawnRunner$$Lambda/0x00007f7d0c88b4e8.call(Unknown Source)
	at com.google.devtools.build.lib.remote.Retrier.execute(Retrier.java:245)
	at com.google.devtools.build.lib.remote.RemoteRetrier.execute(RemoteRetrier.java:127)
	at com.google.devtools.build.lib.remote.RemoteRetrier.execute(RemoteRetrier.java:116)
	at com.google.devtools.build.lib.remote.RemoteSpawnRunner.exec(RemoteSpawnRunner.java:265)
	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:159)
	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:119)
        ...

it seems like Bazel stuck at uploading inputs to remote server. Since you set --remote_timeout=3600, Bazel will only time out after 1 hour. Can you try with a lower value to check whether Bazel can time out?

@dkashyn-sfdc
Copy link
Author

dkashyn-sfdc commented Mar 14, 2024

image

rbe_yet_another_dump.txt

@coeuvre
Copy link
Member

coeuvre commented Mar 14, 2024

It seems relate to --experimental_throttle_remote_action_building. Can you try with --noexperimental_throttle_remote_action_building?

@dkashyn-sfdc
Copy link
Author

As you can see here #21626 (comment) we don't have this enabled. I still can turn it off if required though.

@dkashyn-sfdc
Copy link
Author

Ok... I see it is enabled by default. Let me turn that off

  @Option(
      name = "experimental_throttle_remote_action_building",
      defaultValue = "true",

@dkashyn-sfdc
Copy link
Author

What to do with BES/BEP and keep_alive that were turned off earlier ?

@coeuvre
Copy link
Member

coeuvre commented Mar 14, 2024

Let's keep them off for now.

@dkashyn-sfdc
Copy link
Author

This flag is 8.x only, based on 294c904

I got

ERROR: --noexperimental_throttle_remote_action_building :: Unrecognized option: --noexperimental_throttle_remote_action_building

for

cat .bazelversion 
7.0.1

So I'd say this is definitely not this property.

@coeuvre
Copy link
Member

coeuvre commented Mar 14, 2024

So I'd say this is definitely not this property.

Although the flag was not available for 7.0.1, the feature is still there and on.

The flag is available in Bazel 7.1.0. Can you use that version instead and try?

@dkashyn-sfdc
Copy link
Author

@coeuvre were you able to identify the issue? How else we can help here?

@coeuvre
Copy link
Member

coeuvre commented Mar 20, 2024

I was able to reproduce a hang but the stack trace is different. I am not sure whether they share the same root cause.

In the mean time, can you check whether your build hangs with --remote_download_all? (If it doesn't, then I know the root cause)

@dkashyn-sfdc
Copy link
Author

Something has changed in our RBE server config for this not to happen anymore :(

Yet the fact that it was so consistent before means that some more resilience on client side won't hurt.

@joeljeske
Copy link
Contributor

joeljeske commented Mar 27, 2024

I think I am seeing the same issue as reported here, on 7.1.1.

2024-03-26 21:28:30 CDT	(02:28:30) [755,647 / 761,751] 16720 / 24275 tests, 1 failed; [Prepa] <redacted target>; 2857s ... (200 actions, 0 running)
2024-03-26 21:29:58 CDT	(02:29:58) [756,389 / 762,222] 16768 / 24275 tests, 1 failed; [Prepa] <redacted target>; 2945s ... (200 actions, 2 running)
2024-03-26 21:31:04 CDT	(02:31:04) [764,176 / 769,524] 17127 / 24275 tests, 1 failed; [Prepa] <redacted target>; 3011s ... (200 actions, 1 running)
2024-03-26 21:32:04 CDT	(02:32:04) [766,330 / 771,249] 17205 / 24275 tests, 1 failed; [Prepa] <redacted target>; 3071s ... (200 actions, 0 running)
2024-03-26 21:33:35 CDT	(02:33:35) [766,330 / 771,249] 17205 / 24275 tests, 1 failed; [Prepa] <redacted target>; 3161s ... (200 actions, 0 running)

Some relevant flags

--jobs 200
--remote_download_minimal
--experimental_throttle_remote_action_building  # retained default value
--bes_backend=grpc://127.0.0.1:5555
--bes_upload_mode=FULLY_ASYNC
--remote_timeout=3600

threads.txt

As per last suggestion, I will try with --remote_download_all

@werkt
Copy link
Contributor

werkt commented Mar 27, 2024

@joeljeske Your build is endeavoring to build the merkle trees of 199 actions, slowly, it seems, but not obviously stuck. 64 (presumably your # of cores, throttled to the remote action building semaphore) are proceeding in this task.

You do have one evaluator skyframe-evaluator-40 waiting for a subprocess, but I don't think your build is shot, just a lot of work going on. Unless you're out of heap and bouncing against the GC ceiling, this will proceed, and definitely doesn't look like the hangs in ensureInputsPresent that indicate a lack of signaling to upload procedures.

@joeljeske
Copy link
Contributor

Understood! Thank you for the analysis. I’m curious then, why I don’t I see observe the same behavior on Bazel 6.4.0. I can consistently see slow build speeds on 7.1.1 that are not present before the upgrade.

Do you have any ideas?

@werkt
Copy link
Contributor

werkt commented Mar 27, 2024

In the interest of not hijacking this issue, can you file a separate one @joeljeske against a performance regression between 6.4.0. and 7.1.1?

copybara-service bot pushed a commit that referenced this issue Mar 28, 2024
and when it's missing, treat it as remote cache eviction.

Also revert the workaround for #19513.

Fixes #21777.
Potential fix for #21626 and #21778.

Closes #21825.

PiperOrigin-RevId: 619877088
Change-Id: Ib1204de8440b780e5a6ee6a563a87da08f196ca5
@coeuvre
Copy link
Member

coeuvre commented Mar 28, 2024

Can you patch eda0fe4 and check whether this issue is still reproducible?

@dkashyn-sfdc
Copy link
Author

@coeuvre will this be merged to 7.2? Patching is hard and we not even fully moved to 7.1 yet... I'd rather try to repro this with 7.0 like before and then switch to "head" of 7.2 to see whether the issue is there.

My CAS state is changing and evictions and cleanup not happening recently. This is why I'm unable to repro currently. Yet we are verifying some CAS-related issue with BuildFarm and I hope to get to the point when CAS is mutating again to trigger the issue for us.

@coeuvre
Copy link
Member

coeuvre commented Mar 28, 2024

Yes, the fix will be merged into 7.2.

No pressure! I will keep this issue open (but with a lower priority) for the time being. Feel free to close it once you can verify the fix.

@coeuvre coeuvre added P2 We'll consider working on this in future. (Assignee optional) and removed P1 I'll work on this now. (Assignee required) labels Mar 28, 2024
iancha1992 pushed a commit to iancha1992/bazel that referenced this issue Mar 28, 2024
and when it's missing, treat it as remote cache eviction.

Also revert the workaround for bazelbuild#19513.

Fixes bazelbuild#21777.
Potential fix for bazelbuild#21626 and bazelbuild#21778.

Closes bazelbuild#21825.

PiperOrigin-RevId: 619877088
Change-Id: Ib1204de8440b780e5a6ee6a563a87da08f196ca5
@dkashyn-sfdc
Copy link
Author

@coeuvre could you please tell where we can get the custom Bazelisk-ready version that contains the fix? Rolling releases seems having only 7.0 and 8.0 pre-releases and I don't see any 7.1 and 7.2 nightly builds here https://bazel.build/release/rolling

iancha1992 added a commit that referenced this issue Apr 2, 2024
and when it's missing, treat it as remote cache eviction.

Also revert the workaround for #19513.

Fixes #21777.
Potential fix for #21626 and #21778.

Closes #21825.

PiperOrigin-RevId: 619877088
Change-Id: Ib1204de8440b780e5a6ee6a563a87da08f196ca5

Commit
eda0fe4

Co-authored-by: Chi Wang <chiwang@google.com>
iancha1992 pushed a commit to iancha1992/bazel that referenced this issue Apr 9, 2024
and when it's missing, treat it as remote cache eviction.

Also revert the workaround for bazelbuild#19513.

Fixes bazelbuild#21777.
Potential fix for bazelbuild#21626 and bazelbuild#21778.

Closes bazelbuild#21825.

PiperOrigin-RevId: 619877088
Change-Id: Ib1204de8440b780e5a6ee6a563a87da08f196ca5
iancha1992 added a commit that referenced this issue Apr 9, 2024
and when it's missing, treat it as remote cache eviction.

Also revert the workaround for #19513.

Fixes #21777.
Potential fix for #21626 and #21778.

Closes #21825.

PiperOrigin-RevId: 619877088
Change-Id: Ib1204de8440b780e5a6ee6a563a87da08f196ca5

Commit
eda0fe4

Co-authored-by: Chi Wang <chiwang@google.com>
@dkashyn-sfdc
Copy link
Author

Tried 8.0.0-pre.20240404.3 and still see

[106,946 / 106,964] some target here....; 5713s remote, remote-cache, local

this seems not ok. It doesn't look that https://github.com/bazelbuild/bazel/pull/21941/files is merged to master though... So probably this is expected.

@iancha1992 @coeuvre could you please suggest which bazelisk builds can be used to test this change?

@iancha1992
Copy link
Member

A fix for this issue has been included in Bazel 7.2.0 RC1. Please test out the release candidate and report any issues as soon as possible.
If you're using Bazelisk, you can point to the latest RC by setting USE_BAZEL_VERSION=7.2.0rc1. Thanks!

github-actions bot pushed a commit to chachako/android-studio that referenced this issue Jun 13, 2024
Hopefully this includes a fix for Bazel getting stuck using remote execution. bazelbuild/bazel#21626

Change-Id: I0ae977bc492facc0ebfe8571660a3b02cfb14960

Former-commit-id: 5d91fa83c02fb7c5d995498cb1ecb98edcd65c1e
@werkt
Copy link
Contributor

werkt commented Jul 17, 2024

This issue has recurred under bazel 7.2. I have a stuck java process awaiting countdown latches for all ensureInputsPresent running.

stuck-7.2.0.jstack.txt

@werkt werkt reopened this Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug
Projects
None yet
Development

No branches or pull requests

9 participants