-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Harden runaway split detection #13262
Harden runaway split detection #13262
Conversation
Improve detection of runaway splits and related task killing code to ensure that we do not kill a thread which we suppose hung, but moved to execute on behalf of another query, just before we issue kill command.
e63e97a
to
d55973a
Compare
This is fixing a regression from the PR: #12392. When I moved from directly interrupting the thread to failing the associate task. The communication between TaskExecutor's TaskRunner class and the interrupting thread got removed. |
// described by RunningSplitInfo. | ||
// There is still a chance that we may observe the stacktrace from execution of new split before thread name is set in io.trino.execution.executor.TaskExecutor.TaskRunner.run() | ||
// Yet, we assume that such stacktrace would not be classified as "stack" by stuckSplitStackTracePredicate. | ||
boolean splitAssignmentDidNotChange = splitInfo.getThread().getName().startsWith(splitInfo.getThreadId()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relying on thread names is very brittle. There's no contract for thread names, no guarantee they are going to be the same or different for a given set of conditions, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are other possible ways to do this. Some random ideas:
- Create our own version of Thread that contains additional information that we can inspect
- Drive the process of finding stuck threads from the list of all running splits instead of trying to infer the split by inspecting the threads
- Maintain an external mapping of split <-> thread
Also, for any approach that relies on looking at the thread and then checking if it corresponds to the split we want to kill, there will always be a small window for a race condition unless we force proper synchronization of these checks with split-to-thread assignments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I agree this is somewhat brittle. For my defence - there is test which (at least partially) verifies it works ;).
The simple change to that, keeping the same solution principle, would be to have thread <-> split-execution-id mapping, where split-execution-id would be incremented any time split assigned to a thread changes.
And yeah - there is a race-condition window still - I mention that in the comment.
@@ -479,7 +479,7 @@ public void run() | |||
return; | |||
} | |||
|
|||
String threadId = split.getTaskHandle().getTaskId() + "-" + split.getSplitId(); | |||
String threadId = split.getTaskHandle().getTaskId() + "-" + split.getSplitId() + "-" + System.nanoTime(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may negatively impact tools that gather thread dumps and attempt to group threads by name for analysis purposes.
@martint @losipiuk Since the right fix is rather complex I wonder whether it makes sense to fix it at all. In the worst case what could happen is that with a very tiny probability a task running stuck splits other than processing regular expressions with JONI may get killed. Somehow It doesn't feel like a big deal. |
Can we take a look? Basically what the lock does is maintaining a lookup for thread-> split that the thread processing. There are some drawbacks: the interrupt thread now will hold the lock for I also want to highlight, due to TaskExectuor is a dependency for SqlTaskManager. I have to let the SqlTaskManager to pass in predicates // consumers for TaskExecutor to process its runningSplits. |
Surely it is not a big deal - but it triggers my OCD :) If we are not to pursue proper fix, I like more what you proposed when we talked about it yesterday, to drop the stack trace analysis at all. And just kill any task where split processing takes more than 10 mins. |
Convinced this not the way to go. Closing. |
superseded by #13272 |
Improve detection of runaway splits and related task killing code to
ensure that we do not kill a thread which we suppose hung, but moved to
execute on behalf of another query, just before we issue kill command.
Description
Bugfix (for a very low probablity race condition)
Related issues, pull requests, and links
Improvement on top of #12392
Documentation
(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.
Release notes
(x) No release notes entries required.
( ) Release notes entries required with the following suggested text: