Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Actions stuck in "active" and thus "[Prepa]" #13985

Closed
DavidANeil opened this issue Sep 13, 2021 · 4 comments
Closed

Actions stuck in "active" and thus "[Prepa]" #13985

DavidANeil opened this issue Sep 13, 2021 · 4 comments
Assignees
Labels
P1 I'll work on this now. (Assignee required) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug

Comments

@DavidANeil
Copy link
Contributor

Description of the problem:

The UiStateTracker sometimes reports actions in "Preparing" status (shortened to "[Prepa]") indefinitely.
Because this "stuck" action will be the longest running, it will often be the only displayed action (when --curses=no).
For example, here is what a sample CI run looks like:

[0 / 69] [Prepa] BazelWorkspaceStatusAction stable-status.txt
[15,599 / 17,556] 5 / 877 tests; [Prepa] Compiling TypeScript (prodmode) //ts/lib:lib; 13s ... (296 actions, 90 running)
[26,933 / 50,441] 34 / 877 tests; [Prepa] Compiling TypeScript (prodmode) //ts/lib:lib; 29s ... (344 actions, 268 running)
[55,044 / 92,565] 89 / 877 tests; [Prepa] Compiling TypeScript (prodmode) //ts/lib:lib; 46s ... (438 actions, 275 running)
[57,666 / 93,495] 101 / 877 tests; [Prepa] Compiling TypeScript (prodmode) //ts/lib:lib; 66s ... (546 actions, 312 running)
[58,588 / 93,662] 111 / 877 tests; [Prepa] Compiling TypeScript (prodmode) //ts/lib:lib; 89s ... (577 actions, 323 running)
[58,915 / 93,691] 111 / 877 tests; [Prepa] Compiling TypeScript (prodmode) //ts/lib:lib; 116s ... (577 actions, 324 running)
[59,156 / 93,935] 121 / 877 tests; [Prepa] Compiling TypeScript (prodmode) //ts/lib:lib; 147s ... (572 actions, 321 running)
[59,582 / 94,344] 149 / 877 tests; [Prepa] Compiling TypeScript (prodmode) //ts/lib:lib; 182s ... (577 actions, 323 running)
[60,425 / 95,053] 168 / 877 tests; [Prepa] Compiling TypeScript (prodmode) //ts/lib:lib; 223s ... (578 actions, 325 running)
[61,058 / 95,161] 203 / 877 tests; [Prepa] Compiling TypeScript (prodmode) //ts/lib:lib; 270s ... (577 actions, 323 running)
[62,440 / 96,339] 351 / 877 tests; [Prepa] Compiling TypeScript (prodmode) //ts/lib:lib; 324s ... (577 actions, 324 running)
[63,079 / 96,913] 420 / 877 tests; [Prepa] Compiling TypeScript (prodmode) //ts/lib:lib; 386s ... (578 actions, 324 running)
[64,779 / 114,781] 485 / 877 tests; [Prepa] Compiling TypeScript (prodmode) //ts/lib:lib; 457s ... (581 actions, 326 running)
[65,352 / 115,200] 543 / 877 tests; [Prepa] Compiling TypeScript (prodmode) //ts/lib:lib; 539s ... (579 actions, 323 running)
[68,085 / 115,383] 589 / 877 tests; [Prepa] Compiling TypeScript (prodmode) //ts/lib:lib; 633s ... (581 actions, 323 running)
[76,800 / 115,433] 652 / 877 tests; [Prepa] Compiling TypeScript (prodmode) //ts/lib:lib; 741s ... (571 actions, 305 running)
[95,197 / 115,457] 654 / 877 tests; [Prepa] Compiling TypeScript (prodmode) //ts/lib:lib; 866s ... (404 actions, 137 running)
[116,032 / 116,315] 828 / 877 tests; [Prepa] Compiling TypeScript (prodmode) //ts/lib:lib; 1009s ... (162 actions, 102 running)

Which is just ... not useful.

Note that actual action execution is not affected by this. This is purely a visual bug.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

This bug appears to be caused by a race condition, so is not easily reproducible.
I reckon it is a case where an action is added to the activeActions map, then removed after it completes, then re-added by a call to computeIfAbsent being called in a racy manner.

What operating system are you running Bazel on?

Linux 20.04

What's the output of bazel info release?

release 4.2.0

I have not been able to reproduce this issue on 4.1.0

@brentleyjones
Copy link
Contributor

This issue also exists in the latest rolling release.

@bjacklyn
Copy link

cc @coeuvre

Also seeing this after upgrading to 4.2.1 from 4.0.0. These are the changes that have gone in since the 4.1.0 release:

$ git log HEAD ^4.1.0 --oneline src/main/java/com/google/devtools/build/lib/runtime/UiStateTracker.java
bf5e2b8ec0 Remote: Async upload (Part 8)
b51b31dbe7 Remote: Fix a race when reporting action progresses.
0f812eb5e5 Remote: Display download progress when actions are downloading outputs from remote cache.
f0983dfddb Remote: Report checking cache status before the action is scheduled to run remotely.

It looks like there was an attempt at fixing the UI bug in b51b31dbe7.

By adding the below patch you can reproduce this issue:

id 'File:[[<execution_root>]bazel-out/k8-opt/.../foo' is already finished, and event Downloading foo, 5.2 KiB / 5.2 KiB is being processed
...

The problem seems to be like @DavidANeil mentioned -- actionProgress(ActionProgressEvent event) is probably called in ThreadA before actionCompletion(ActionCompletionEvent event) in ThreadB but the ThreadA goes to sleep and when it revives it calls getActionState() which calls activeActions.computeIfAbsent() which re-adds the actions that was just removed by ThreadB. The other possibility is events are being fired out-of-order.

I wonder if the right fix is for UiEventHandler to queue and only fire 1 in-progress event per actionId and wait for each event completion before firing the next. Presumably computeIfAbsent is being used because of this exact problem -- events can be processed out-of-order so actionStarted() cannot be assumed to have already ran & created an action in activeActions.

Patch:

$ git diff
diff --git a/src/main/java/com/google/devtools/build/lib/runtime/UiStateTracker.java b/src/main/java/com/google/devtools/build/lib/runtime/UiStateTracker.java
index eef7ea5c64..99ae5e36d2 100644
--- a/src/main/java/com/google/devtools/build/lib/runtime/UiStateTracker.java
+++ b/src/main/java/com/google/devtools/build/lib/runtime/UiStateTracker.java
@@ -361,6 +361,8 @@ final class UiStateTracker {
 
   private final Map<Artifact, ActionState> activeActions;
 
+  private final Map<Artifact, Boolean> finishedActions;
+
   // running downloads are identified by the original URL they were trying to access.
   private final Deque<String> runningDownloads;
   private final Map<String, Long> downloadNanoStartTimes;
@@ -396,6 +398,7 @@ final class UiStateTracker {
 
   UiStateTracker(Clock clock, int targetWidth) {
     this.activeActions = new ConcurrentHashMap<>();
+    this.finishedActions = new ConcurrentHashMap<>();
 
     this.actionsCompleted = new AtomicInteger();
     this.testActions = new ConcurrentHashMap<>();
@@ -582,6 +585,9 @@ final class UiStateTracker {
     ActionExecutionMetadata action = event.action();
     Artifact actionId = event.action().getPrimaryOutput();
     long now = clock.nanoTime();
+    if (finishedActions.containsKey(actionId)) {
+        System.out.println(String.format("id '%s' is already finished, and event %s is being processed", actionId, event.progress()));
+    }
     getActionState(action, actionId, now).onProgressEvent(event, now);
   }
 
@@ -589,6 +595,7 @@ final class UiStateTracker {
     Action action = event.getAction();
     Artifact actionId = action.getPrimaryOutput();
     checkNotNull(activeActions.remove(actionId), "%s not active after %s", actionId, event);
+    finishedActions.put(actionId, true);
 
     // As callers to the experimental state tracker assume we will fully report the new state once
     // informed of an action completion, we need to make sure the progress receiver is aware of the
@@ -604,6 +611,7 @@ final class UiStateTracker {
     Artifact actionId = action.getPrimaryOutput();
 
     checkNotNull(activeActions.remove(actionId), "%s not active after %s", actionId, event);
+    finishedActions.put(actionId, true);
 
     if (action.getOwner() != null) {
       Label owner = action.getOwner().getLabel();

@bjacklyn
Copy link

Also for anyone who OK with compiling bazel from source and applying a patch (patch based off 4.2.1), this fixes it for me:

$ git diff
diff --git a/src/main/java/com/google/devtools/build/lib/runtime/UiStateTracker.java b/src/main/java/com/google/devtools/build/lib/runtime/UiStateTracker.java
index eef7ea5c64..9e5570e8a7 100644
--- a/src/main/java/com/google/devtools/build/lib/runtime/UiStateTracker.java
+++ b/src/main/java/com/google/devtools/build/lib/runtime/UiStateTracker.java
@@ -526,6 +526,11 @@ final class UiStateTracker {
     return activeActions.computeIfAbsent(actionId, (key) -> new ActionState(action, nanoTimeNow));
   }
 
+  @Nullable
+  private ActionState getActionStateIfPresent(Artifact actionId) {
+    return activeActions.get(actionId);
+  }
+
   void actionStarted(ActionStartedEvent event) {
     Action action = event.getAction();
     Artifact actionId = action.getPrimaryOutput();
@@ -582,7 +587,10 @@ final class UiStateTracker {
     ActionExecutionMetadata action = event.action();
     Artifact actionId = event.action().getPrimaryOutput();
     long now = clock.nanoTime();
-    getActionState(action, actionId, now).onProgressEvent(event, now);
+    ActionState actionState = getActionStateIfPresent(actionId);
+    if (actionState != null) {
+        actionState.onProgressEvent(event, now);
+    }
   }
 
   void actionCompletion(ActionScanningCompletedEvent event) {

@philwo philwo added P1 I'll work on this now. (Assignee required) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug labels Sep 21, 2021
@philwo
Copy link
Member

philwo commented Sep 21, 2021

@bjacklyn Would you mind sending your patch as a pull request?

@bjacklyn bjacklyn mentioned this issue Nov 24, 2021
9 tasks
bjacklyn pushed a commit to bjacklyn/bazel that referenced this issue Nov 24, 2021
Fixed bazelbuild#13985

UiStateTracker can process ActionProgress events after an ActionCompletion event
has been fired. This has the effect of recreating the action object in the activeActions
Map because Map.computeIfAbsent() is being used to retrieve the action.

Therefore, a new method getActionStateIfPresent is created which will return
the action if it exists, otherwise return null, and actionProgress() uses this
method so as to not inadverntantly recreate actions after they have already
completed and been removed from the Map.

Closes bazelbuild#14020.

PiperOrigin-RevId: 398165993
(cherry picked from commit 639f89d)
meteorcloudy pushed a commit that referenced this issue Nov 25, 2021
Fixed #13985

UiStateTracker can process ActionProgress events after an ActionCompletion event
has been fired. This has the effect of recreating the action object in the activeActions
Map because Map.computeIfAbsent() is being used to retrieve the action.

Therefore, a new method getActionStateIfPresent is created which will return
the action if it exists, otherwise return null, and actionProgress() uses this
method so as to not inadverntantly recreate actions after they have already
completed and been removed from the Map.

Closes #14020.

PiperOrigin-RevId: 398165993
(cherry picked from commit 639f89d)

Co-authored-by: Brandon Jacklyn <brandonjacklyn@gmail.com>
mcy added a commit to mcy/opentitan-1 that referenced this issue Apr 21, 2022
See bazelbuild/bazel#13985

Signed-off-by: Miguel Young de la Sota <mcyoung@google.com>
mcy added a commit to mcy/opentitan-1 that referenced this issue Apr 21, 2022
See bazelbuild/bazel#13985

Signed-off-by: Miguel Young de la Sota <mcyoung@google.com>
mcy added a commit to mcy/opentitan-1 that referenced this issue Apr 21, 2022
See bazelbuild/bazel#13985

Signed-off-by: Miguel Young de la Sota <mcyoung@google.com>
timothytrippel pushed a commit to lowRISC/opentitan that referenced this issue Apr 23, 2022
See bazelbuild/bazel#13985

Signed-off-by: Miguel Young de la Sota <mcyoung@google.com>
jeoongp pushed a commit to jeoongp/opentitan that referenced this issue May 7, 2022
See bazelbuild/bazel#13985

Signed-off-by: Miguel Young de la Sota <mcyoung@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 I'll work on this now. (Assignee required) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants