Harden runaway split detection #13262

losipiuk · 2022-07-20T16:06:50Z

Improve detection of runaway splits and related task killing code to
ensure that we do not kill a thread which we suppose hung, but moved to
execute on behalf of another query, just before we issue kill command.

Description

Bugfix (for a very low probablity race condition)

Related issues, pull requests, and links

Improvement on top of #12392

Documentation

(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

(x) No release notes entries required.
( ) Release notes entries required with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

losipiuk · 2022-07-20T16:07:37Z

cc: @arhimondr @phd3 @leetcode-1533

Improve detection of runaway splits and related task killing code to ensure that we do not kill a thread which we suppose hung, but moved to execute on behalf of another query, just before we issue kill command.

leetcode-1533 · 2022-07-20T19:25:23Z

This is fixing a regression from the PR: #12392.

When I moved from directly interrupting the thread to failing the associate task. The communication between TaskExecutor's TaskRunner class and the interrupting thread got removed.

martint · 2022-07-20T19:32:26Z

core/trino-main/src/main/java/io/trino/execution/SqlTaskManager.java

+                        // described by RunningSplitInfo.
+                        // There is still a chance that we may observe the stacktrace from execution of new split before thread name is set in io.trino.execution.executor.TaskExecutor.TaskRunner.run()
+                        // Yet, we assume that such stacktrace would not be classified as "stack" by stuckSplitStackTracePredicate.
+                        boolean splitAssignmentDidNotChange = splitInfo.getThread().getName().startsWith(splitInfo.getThreadId());


Relying on thread names is very brittle. There's no contract for thread names, no guarantee they are going to be the same or different for a given set of conditions, etc.

There are other possible ways to do this. Some random ideas:

Create our own version of Thread that contains additional information that we can inspect

Drive the process of finding stuck threads from the list of all running splits instead of trying to infer the split by inspecting the threads

Maintain an external mapping of split <-> thread

Also, for any approach that relies on looking at the thread and then checking if it corresponds to the split we want to kill, there will always be a small window for a race condition unless we force proper synchronization of these checks with split-to-thread assignments.

Yeah. I agree this is somewhat brittle. For my defence - there is test which (at least partially) verifies it works ;).
The simple change to that, keeping the same solution principle, would be to have thread <-> split-execution-id mapping, where split-execution-id would be incremented any time split assigned to a thread changes.

And yeah - there is a race-condition window still - I mention that in the comment.

martint · 2022-07-20T19:33:29Z

core/trino-main/src/main/java/io/trino/execution/executor/TaskExecutor.java

@@ -479,7 +479,7 @@ public void run()
                        return;
                    }

-                    String threadId = split.getTaskHandle().getTaskId() + "-" + split.getSplitId();
+                    String threadId = split.getTaskHandle().getTaskId() + "-" + split.getSplitId() + "-" + System.nanoTime();


This may negatively impact tools that gather thread dumps and attempt to group threads by name for analysis purposes.

arhimondr · 2022-07-20T23:46:40Z

@martint @losipiuk Since the right fix is rather complex I wonder whether it makes sense to fix it at all. In the worst case what could happen is that with a very tiny probability a task running stuck splits other than processing regular expressions with JONI may get killed. Somehow It doesn't feel like a big deal.

leetcode-1533 · 2022-07-21T00:06:15Z

Can we take a look?
#13272.

Basically what the lock does is maintaining a lookup for thread-> split that the thread processing.
So if for certain predicates(such as because of split's walltime, split's reference thread stacktrace), the code decided to fail the split's task(by calling the callback function). The code will only do so, when the thread is still processing THE SPLIT.

There are some drawbacks: the interrupt thread now will hold the lock for
the entire "failTask()",(so holding the lock for all the callback functions during the state transition. We decided to do so because we want to make sure we are failing the task instead of directly interrupting the thread.) But it will only do so if the split is indeed identified as "stuck", which is rare.

I also want to highlight, due to TaskExectuor is a dependency for SqlTaskManager. I have to let the SqlTaskManager to pass in predicates // consumers for TaskExecutor to process its runningSplits.

losipiuk · 2022-07-21T09:50:21Z

@martint @losipiuk Since the right fix is rather complex I wonder whether it makes sense to fix it at all. In the worst case what could happen is that with a very tiny probability a task running stuck splits other than processing regular expressions with JONI may get killed. Somehow It doesn't feel like a big deal.

Surely it is not a big deal - but it triggers my OCD :) If we are not to pursue proper fix, I like more what you proposed when we talked about it yesterday, to drop the stack trace analysis at all. And just kill any task where split processing takes more than 10 mins.

losipiuk · 2022-07-21T10:21:47Z

Convinced this not the way to go. Closing.

phd3 · 2022-07-21T19:58:58Z

superseded by #13272

cla-bot bot added the cla-signed label Jul 20, 2022

losipiuk requested review from phd3 and arhimondr July 20, 2022 16:07

losipiuk added 2 commits July 20, 2022 18:14

Make thread name for split runner threads depend on time

5684f90

Harden runaway split detection

d55973a

Improve detection of runaway splits and related task killing code to ensure that we do not kill a thread which we suppose hung, but moved to execute on behalf of another query, just before we issue kill command.

losipiuk force-pushed the lo/harden-runnaway-split-detection branch from e63e97a to d55973a Compare July 20, 2022 16:14

arhimondr approved these changes Jul 20, 2022

View reviewed changes

martint reviewed Jul 20, 2022

View reviewed changes

leetcode-1533 mentioned this pull request Jul 21, 2022

Use lock to ensure consistency during runaway split detection #13272

Closed

losipiuk closed this Jul 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harden runaway split detection #13262

Harden runaway split detection #13262

losipiuk commented Jul 20, 2022

losipiuk commented Jul 20, 2022

leetcode-1533 commented Jul 20, 2022 •

edited

Loading

martint Jul 20, 2022

martint Jul 20, 2022

losipiuk Jul 21, 2022

martint Jul 20, 2022

arhimondr commented Jul 20, 2022

leetcode-1533 commented Jul 21, 2022 •

edited

Loading

losipiuk commented Jul 21, 2022

losipiuk commented Jul 21, 2022

phd3 commented Jul 21, 2022

Harden runaway split detection #13262

Harden runaway split detection #13262

Conversation

losipiuk commented Jul 20, 2022

Description

Related issues, pull requests, and links

Documentation

Release notes

losipiuk commented Jul 20, 2022

leetcode-1533 commented Jul 20, 2022 • edited Loading

martint Jul 20, 2022

Choose a reason for hiding this comment

martint Jul 20, 2022

Choose a reason for hiding this comment

losipiuk Jul 21, 2022

Choose a reason for hiding this comment

martint Jul 20, 2022

Choose a reason for hiding this comment

arhimondr commented Jul 20, 2022

leetcode-1533 commented Jul 21, 2022 • edited Loading

losipiuk commented Jul 21, 2022

losipiuk commented Jul 21, 2022

phd3 commented Jul 21, 2022

leetcode-1533 commented Jul 20, 2022 •

edited

Loading

leetcode-1533 commented Jul 21, 2022 •

edited

Loading