Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Trino to detect and fail queries that have tasks stuck in long running JONI parsing. #12392

Merged
merged 2 commits into from
Jul 20, 2022

Conversation

leetcode-1533
Copy link
Contributor

@leetcode-1533 leetcode-1533 commented May 14, 2022

Description

Enable Trino to detect and fail tasks that are stuck in long running JONI parsing.

Is this change a fix, improvement, new feature, refactoring, or other?

Fix

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

Query Engine

How would you describe this change to a non-technical end user or system administrator?

Trino is a time shared multi-tenant system. For context switching, trino relies on the threadpool workers cooperatively yield itself to the scheduling logic. It is not like an operating system which used a hard interrupt signal to force the process to do the context switch.

Most of the split processing can be finished in a relative short interval, whereas the JONI processing is an exception.

Furthermore, some of the trino features: gathering statistics, relaying on the trino to do callback function(to execute splitFinished() after yield) after the context switch, since there is no context switch in this case, trino can't gather accurate CPU usage for that split.

This PR allows trino to detect and fail tasks that are stuck in long running JONI parsing.

Documentation

( ) No documentation is needed.
(x) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
(x) Release notes entries required with the following suggested text:

Enable Trino to detect and fail queries that have tasks stuck in long running JONI parsing.

@cla-bot
Copy link

cla-bot bot commented May 14, 2022

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: yluan.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email email@example.com
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@github-actions github-actions bot added the docs label May 14, 2022
@leetcode-1533 leetcode-1533 changed the title Backport PrestoDB 16111 and add back changes from 74b79aaba2b7654d2ce… Enable TaskExecutor to detect and interrupt run away task runner threads: Backport PrestoDB 16111 and add back changes from 74b79aaba2b7654d2ce… May 14, 2022
@leetcode-1533 leetcode-1533 changed the title Enable TaskExecutor to detect and interrupt run away task runner threads: Backport PrestoDB 16111 and add back changes from 74b79aaba2b7654d2ce… Enable TaskExecutor to detect and interrupt run away task runner threads: Backport PrestoDB 16111 May 14, 2022
@cla-bot cla-bot bot added the cla-signed label May 14, 2022
@cla-bot
Copy link

cla-bot bot commented May 14, 2022

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: yluan.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email email@example.com
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@cla-bot cla-bot bot removed the cla-signed label May 14, 2022
@cla-bot cla-bot bot added the cla-signed label May 14, 2022
@findepi findepi changed the title Enable TaskExecutor to detect and interrupt run away task runner threads: Backport PrestoDB 16111 Enable TaskExecutor to detect and interrupt run away task runner threads May 16, 2022
Copy link
Member

@losipiuk losipiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few editorials, but major concern is there is no synchronization between code which interrupts Split runner threads and the threads themselves. It very much looks like we may interrupt thread which already moved to another split.

@@ -69,6 +69,7 @@

private Duration statusRefreshMaxWait = new Duration(1, TimeUnit.SECONDS);
private Duration infoUpdateInterval = new Duration(3, TimeUnit.SECONDS);
private Duration interruptRunawaySplitsTimeout = new Duration(600, TimeUnit.SECONDS);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also have a warning about the runaway splits, currently hardcoded. IMO we should validate the two thresholds are coherent (i.e. warning <= timeout), which would require making the warning configurable too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, will configure: "LONG_SPLIT_WARNING_THRESHOLD"

Copy link
Contributor Author

@leetcode-1533 leetcode-1533 May 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree we should make warning <= timeout, because if the thread runtime for the split > than timeout, the system had already interrupted the split.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also felt "/v1/maxActiveSplits" is unnecessary.. Since this PR will print out the run away splits in the log..

MultilevelSplitQueue splitQueue,
Ticker ticker)
{
checkArgument(runnerThreads > 0, "runnerThreads must be at least 1");
checkArgument(guaranteedNumberOfDriversPerTask > 0, "guaranteedNumberOfDriversPerTask must be at least 1");
checkArgument(maximumNumberOfDriversPerTask > 0, "maximumNumberOfDriversPerTask must be at least 1");
checkArgument(guaranteedNumberOfDriversPerTask <= maximumNumberOfDriversPerTask, "guaranteedNumberOfDriversPerTask cannot be greater than maximumNumberOfDriversPerTask");
checkArgument(interruptRunawaySplitsTimeout.getValue(TimeUnit.SECONDS) >= 1.0, "interruptRunawaySplitsTimeout must be at least 1 second");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is 1s based off of SPLIT_RUN_QUANTA? if so let's just use that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.

@leetcode-1533
Copy link
Contributor Author

leetcode-1533 commented May 19, 2022

Hey, I found out the issue is very similar to #7213, in which we want to limit the total time for SQLQueryExecution. In that case, the thread is reused between queries, in our case, the thread is reused between tasks. In either case, I abstract the following programming pattern:

  1. When assigning task to the thread:

AtomicReference<Thread> sharedThread = new AtomicReference<>(currentThread())

  1. After 1, when processing:
try {
	process using the shared thread for the assigned task
} finally {
	synchronized (exectuorObject) {
		sharedThread = null;
		Thread.interrupted();
	}
}

  1. When interrupting:
{
	if can interrupt();
	synchronized (exectuorObject) {
// double check once get into the critical section, 
// to avoid race condition happening at the short interval between the check and get into the race condition. 
                if can interrupt();  
		thread = sharedThread.get();
		if (thread != null) {
			thread.interrupt();
		}
	}
}

From my personally understanding, the AtomicReference can also be replaced by volatile keyword.

@cla-bot
Copy link

cla-bot bot commented May 19, 2022

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: yluan.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email email@example.com
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@leetcode-1533 leetcode-1533 changed the title Enable TaskExecutor to detect and interrupt run away task runner threads Enable Trino to detect and fail query that has tasks stuck in long running JONI parsing. Jul 13, 2022
@leetcode-1533 leetcode-1533 changed the title Enable Trino to detect and fail query that has tasks stuck in long running JONI parsing. Enable Trino to detect and fail queries that have tasks stuck in long running JONI parsing. Jul 13, 2022
@leetcode-1533 leetcode-1533 force-pushed the tempBranch branch 3 times, most recently from 14a554d to 8c35f16 Compare July 13, 2022 04:35
Copy link
Contributor

@jhlodin jhlodin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs LGTM, just need to word wrap at 80 characters

docs/src/main/sphinx/admin/properties-task.rst Outdated Show resolved Hide resolved
Copy link
Member

@phd3 phd3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, mostly nits

Copy link
Contributor

@arhimondr arhimondr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good to me % nits.

Please update the commit message according to the guideline: https://github.com/trinodb/trino/blob/master/.github/DEVELOPMENT.md#format-git-commit-messages

implements Comparable<RunningSplitInfo>
{
private final long startTime;
private final String threadId;
private final Thread thread;
private boolean printed;
private final PrioritizedSplitRunner split;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Store the task id and split info directly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please respond (let me know if you missed it or not going to address)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline. The split info cannot be cached as it changes as the split execution progresses.

@leetcode-1533 leetcode-1533 force-pushed the tempBranch branch 2 times, most recently from adb169e to 93e5f09 Compare July 18, 2022 23:54
@leetcode-1533
Copy link
Contributor Author

leetcode-1533 commented Jul 19, 2022

Hi, I have addressed all the comments.

I also changed the minimum value for the configs. Due to the async nature, the check based on the walltime is inaccurate when the timeout is at a similar scale to the time quota. It can only detect splits that significantly run longer than the split's 1 second time quota

Copy link
Contributor

@arhimondr arhimondr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % comments

implements Comparable<RunningSplitInfo>
{
private final long startTime;
private final String threadId;
private final Thread thread;
private boolean printed;
private final PrioritizedSplitRunner split;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please respond (let me know if you missed it or not going to address)

@leetcode-1533 leetcode-1533 force-pushed the tempBranch branch 2 times, most recently from c8d31e7 to 89d7dee Compare July 19, 2022 19:50
@phd3 phd3 merged commit 72ac665 into trinodb:master Jul 20, 2022
@github-actions github-actions bot added this to the 391 milestone Jul 20, 2022
@phd3
Copy link
Member

phd3 commented Jul 20, 2022

Merged, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

8 participants