Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harden runaway split detection #13262

Closed

Conversation

losipiuk
Copy link
Member

Improve detection of runaway splits and related task killing code to
ensure that we do not kill a thread which we suppose hung, but moved to
execute on behalf of another query, just before we issue kill command.

Description

Bugfix (for a very low probablity race condition)

Related issues, pull requests, and links

Improvement on top of #12392

Documentation

(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

(x) No release notes entries required.
( ) Release notes entries required with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Jul 20, 2022
@losipiuk
Copy link
Member Author

cc: @arhimondr @phd3 @leetcode-1533

@losipiuk losipiuk requested review from phd3 and arhimondr July 20, 2022 16:07
losipiuk added 2 commits July 20, 2022 18:14
Improve detection of runaway splits and related task killing code to
ensure that we do not kill a thread which we suppose hung, but moved to
execute on behalf of another query, just before we issue kill command.
@losipiuk losipiuk force-pushed the lo/harden-runnaway-split-detection branch from e63e97a to d55973a Compare July 20, 2022 16:14
@leetcode-1533
Copy link
Contributor

leetcode-1533 commented Jul 20, 2022

This is fixing a regression from the PR: #12392.

When I moved from directly interrupting the thread to failing the associate task. The communication between TaskExecutor's TaskRunner class and the interrupting thread got removed.

// described by RunningSplitInfo.
// There is still a chance that we may observe the stacktrace from execution of new split before thread name is set in io.trino.execution.executor.TaskExecutor.TaskRunner.run()
// Yet, we assume that such stacktrace would not be classified as "stack" by stuckSplitStackTracePredicate.
boolean splitAssignmentDidNotChange = splitInfo.getThread().getName().startsWith(splitInfo.getThreadId());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relying on thread names is very brittle. There's no contract for thread names, no guarantee they are going to be the same or different for a given set of conditions, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are other possible ways to do this. Some random ideas:

  • Create our own version of Thread that contains additional information that we can inspect
  • Drive the process of finding stuck threads from the list of all running splits instead of trying to infer the split by inspecting the threads
  • Maintain an external mapping of split <-> thread

Also, for any approach that relies on looking at the thread and then checking if it corresponds to the split we want to kill, there will always be a small window for a race condition unless we force proper synchronization of these checks with split-to-thread assignments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I agree this is somewhat brittle. For my defence - there is test which (at least partially) verifies it works ;).
The simple change to that, keeping the same solution principle, would be to have thread <-> split-execution-id mapping, where split-execution-id would be incremented any time split assigned to a thread changes.

And yeah - there is a race-condition window still - I mention that in the comment.

@@ -479,7 +479,7 @@ public void run()
return;
}

String threadId = split.getTaskHandle().getTaskId() + "-" + split.getSplitId();
String threadId = split.getTaskHandle().getTaskId() + "-" + split.getSplitId() + "-" + System.nanoTime();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may negatively impact tools that gather thread dumps and attempt to group threads by name for analysis purposes.

@arhimondr
Copy link
Contributor

@martint @losipiuk Since the right fix is rather complex I wonder whether it makes sense to fix it at all. In the worst case what could happen is that with a very tiny probability a task running stuck splits other than processing regular expressions with JONI may get killed. Somehow It doesn't feel like a big deal.

@leetcode-1533
Copy link
Contributor

leetcode-1533 commented Jul 21, 2022

Can we take a look?
#13272.

Basically what the lock does is maintaining a lookup for thread-> split that the thread processing.
So if for certain predicates(such as because of split's walltime, split's reference thread stacktrace), the code decided to fail the split's task(by calling the callback function). The code will only do so, when the thread is still processing THE SPLIT.

There are some drawbacks: the interrupt thread now will hold the lock for
the entire "failTask()",(so holding the lock for all the callback functions during the state transition. We decided to do so because we want to make sure we are failing the task instead of directly interrupting the thread.) But it will only do so if the split is indeed identified as "stuck", which is rare.

I also want to highlight, due to TaskExectuor is a dependency for SqlTaskManager. I have to let the SqlTaskManager to pass in predicates // consumers for TaskExecutor to process its runningSplits.

@losipiuk
Copy link
Member Author

@martint @losipiuk Since the right fix is rather complex I wonder whether it makes sense to fix it at all. In the worst case what could happen is that with a very tiny probability a task running stuck splits other than processing regular expressions with JONI may get killed. Somehow It doesn't feel like a big deal.

Surely it is not a big deal - but it triggers my OCD :) If we are not to pursue proper fix, I like more what you proposed when we talked about it yesterday, to drop the stack trace analysis at all. And just kill any task where split processing takes more than 10 mins.

@losipiuk
Copy link
Member Author

Convinced this not the way to go. Closing.

@losipiuk losipiuk closed this Jul 21, 2022
@phd3
Copy link
Member

phd3 commented Jul 21, 2022

superseded by #13272

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

5 participants