Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix memory leak when query fails with permission error #12270

Merged
merged 2 commits into from
Feb 25, 2019

Conversation

viczhang861
Copy link
Contributor

@viczhang861 viczhang861 commented Jan 28, 2019

When permission error is thrown inside creation of QueryExecution object,
transaction metadata is not marked as inactive resulting in memory leak.

Existing handling of error via method TransactionMetadata::fail is not
working when transaction id is empty in session object.

A heapdump is taken to test that previously leaked objects become
unreachable from GC roots after this fix.

@viczhang861 viczhang861 self-assigned this Jan 28, 2019
@mbasmanova mbasmanova requested review from wenleix and removed request for wenleix January 28, 2019 20:31
@viczhang861 viczhang861 changed the title Fix memory leak when query fails with permission error [WIP] Fix memory leak when query fails with permission error Jan 28, 2019
Copy link
Contributor

@elonazoulay elonazoulay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, can you add a test?

Copy link
Member

@arhimondr arhimondr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@picrinite Looks good, thanks for figuring this out.

@elonazoulay Yeah, i think we should add a test. I think what we can try to do, is to create a new integration test class, create a distributed query runner, run different scenarios upon that query runner when checking that transaction manager is empty after every scenario.

queryStateMachine.addStateChangeListener(newState -> {
QUERY_STATE_LOG.debug("Query %s is %s", queryStateMachine.getQueryId(), newState);
// mark finished or failed transaction as inactive
if (newState.isDone()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@picrinite Did you check that this listener is triggered when the analysis fails (the case with permission error)?

The last time we discussed it, we were thinking about wrapping the analysis into a try-catch, because we thought that the state change listener may not always be triggered. Is that not the case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arhimondr The first commit is an improvement to make it safer for "setInactive" job. It is not triggered for permission error.
The second commit guarantees "setInactive" job is triggered when there is permission error, but there are some tests failed and I don't know how to reproduce. Help appreciated !
Regarding automatic test, cleanup of transaction manager is an asynchronous job with a delay (current default is 5 minutes), it is not deterministic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An integration test is added, the added test will fail without current fix.

@viczhang861 viczhang861 force-pushed the cleanFailedQuery branch 2 times, most recently from 3efda78 to bced22d Compare January 29, 2019 21:06
@viczhang861 viczhang861 changed the title [WIP] Fix memory leak when query fails with permission error Fix memory leak when query fails with permission error Jan 29, 2019
@viczhang861 viczhang861 force-pushed the cleanFailedQuery branch 2 times, most recently from d287858 to 28d7fbf Compare February 4, 2019 18:29
Copy link
Contributor

@nezihyigitbasi nezihyigitbasi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge StateChangeListener for QueryStateMachine

  • IIUC this commit just moves the listener registration to QueryStateMachine right after the creation of the state machine. If this is correct we can update the commit title as Register state change listener in QueryStateMachine or sth similar.

Copy link
Contributor

@nezihyigitbasi nezihyigitbasi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix memory leak when query fails with permission error

  • The commit message talks about permission error only, but the fix is actually a generic fix, so I think we can make the commit title more generic.

@@ -496,6 +489,11 @@ public ThreadPoolExecutorMBean getManagementExecutor()
return queryManagementExecutorMBean;
}

public TransactionManager getTransactionManager()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Annotate with @VisibleForTesting.

@@ -56,6 +60,30 @@ public void tearDown()
queryRunner = null;
}

@Test(timeOut = 100_000L)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not 60s? 100s is a bit on the long side.

QueryInfo queryInfo = sqlQueryManager.getFullQueryInfo(queryId);
assertEquals(queryInfo.getState(), FAILED);
assertNotNull(queryInfo.getFailureInfo());
TimeUnit.MILLISECONDS.sleep(50);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there a sleep call here? Does the correctness of the test depend on this call?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get a future on that metadata cleanup asyn job and wait until it's done? Sleeping is fragile, this test can fail arbitrarily depending on the interleaving of the async job and the thread running this test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nezihyigitbasi After further investigation with @arhimondr, we confirmed sleep() is not necessary since cleanup is synchronous. I have removed it from test.

Register transaction cleanup job immediatelly after QueryStateMachine
object is created to avoid failure.
Copy link
Member

@arhimondr arhimondr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % comment

When semantic error (including permission error) is thrown
during creation of SqlQueryExecution object, transaction
metadata is not cleaned up resulting in memory leak.

Current handling of error via method TransactionMetadata::fail is not
working if there is no existing transaction id in session object.
@viczhang861 viczhang861 merged commit b7d3c85 into prestodb:master Feb 25, 2019
@viczhang861 viczhang861 deleted the cleanFailedQuery branch February 25, 2019 18:57
@viczhang861 viczhang861 restored the cleanFailedQuery branch February 26, 2019 23:04
@viczhang861 viczhang861 deleted the cleanFailedQuery branch March 4, 2019 15:52
natashasehgal added a commit to natashasehgal/presto that referenced this pull request Feb 5, 2025
Summary: Pull Request resolved: facebookincubator/velox#12270

Differential Revision: D69208299
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants