Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure tasks are killed for fault-tolerant queries when cluster out of memory #11800

Merged
merged 13 commits into from
Apr 9, 2022

Conversation

losipiuk
Copy link
Member

@losipiuk losipiuk commented Apr 5, 2022

Description

Ensure that OOM killer which triggers when cluster is out of memory will not kill whole
query if query has task-level retries enabled.

Is this change a fix, improvement, new feature, refactoring, or other?

improvement

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

engine

Documentation

(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

(x) No release notes entries required.
( ) Release notes entries required with the following suggested text:

@cla-bot cla-bot bot added the cla-signed label Apr 5, 2022
@losipiuk losipiuk requested review from arhimondr and linzebing April 5, 2022 15:40
log.debug("Last killed target is still not gone: %s", lastKillTarget);
}
nanosSince(lastTimeNotOutOfMemory).compareTo(killOnOutOfMemoryDelay) > 0 &&
lastKillTarget.isEmpty() && nanosSince(lastTimeKillCompleted).compareTo(killOnOutOfMemoryAfterKillDelay) > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the extra delay needed? Doesn't isLastKillTargetGone protect the OOM killer from being invoked prematurely?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is problematic when we are working with a query with task level fault tolerance.

The list of tasks and pool memory info are harvested on the worker side independently (those are held by two different structures).
And you can get the memory counter saying "pool is full" and high memory usage for query Q. While the list of tasks for query Q on the node empty (as tasks were just killed and are already marked as failed).

In this scenario previously we ended up killing whole query, even though query was running with task-based retries (the information that the query uses task-based retries was derived from existence of list of tasks for query in MemoryInfo.tasksMemoryInfo).
With last commit from this PR (Do not kill whole queries with task retries enabled on oom) we would no longer whole fault-tolerant query. But without extra delay, we would still assume the node is blocked on memory, and kill something else on this node. Hence extra delay is useful here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to think about a scenario when nodes are not getting blocked on memory all at once. I'm afraid that the extra delay may slow down scheduling for large clusters (100 - 1000 nodes).

From what i understand the last commit from this PR should prevent an entire fault tolerant query from being killed. However there's still a chance that a task might get killed unnecessarily if the memory pool reports that it is still blocked but the tasks are already finished. I wonder how difficult is it to return a consistent information, when the memory pool reservation is consistent with a list of tasks?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to think about a scenario when nodes are not getting blocked on memory all at once. I'm afraid that the extra delay may slow down scheduling for large clusters (100 - 1000 nodes).

If we consider cluster oom a common thing - delay may slow done things surely. The question is should we consider it a common thing. Task killing itself also slows things down so we should rather try to minimize that.

From what i understand the last commit from this PR should prevent an entire fault tolerant query from being killed. However there's still a chance that a task might get killed unnecessarily if the memory pool reports that it is still blocked but the tasks are already finished. I wonder how difficult is it to return a consistent information, when the memory pool reservation is consistent with a list of tasks?

It is exactly as you write.
It does not look it is possible to get these two pieces of information consistently. They come from very different places with no shared synchronization paradigms:

  • memory info we get from MemoryPoolInfo.getInfo (via LocalMemoryManager)
  • task info we get from SqlTaskManager

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be good now.
PTAL @arhimondr

}
nanosSince(lastTimeNotOutOfMemory).compareTo(killOnOutOfMemoryDelay) > 0 &&
lastKillTarget.isEmpty() && nanosSince(lastTimeKillCompleted).compareTo(killOnOutOfMemoryAfterKillDelay) > 0) {
callOomKiller(runningQueries);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sequence of operations in the process method looks very weird.

I would rather expect it to be something like this

// update memory reservation on each node
updateNodes();

// update memory pool
updateMemoryPool(Iterables.size(runningQueries));

... logic related to enforcing limits and freeing up memory pools ...

Do you think it might be the reason of these weird inconsistencies?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it does not matter; updateNodes() is asynchronous anyway.

}

return areTasksGone(lastKillTarget.getTasks(), runningQueries);
return areTasksGone(lastKillTarget.get().getTasks(), runningQueries);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally the implementation of this method should be based on the information about the memory pool, as the decision to kill tasks is based on that view. What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically it does not matter much I think. As we coordinator learns that task is dead only after it is actually killed on the worker. I can change that but I think it is low-prio cleanup and will not update this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may technically introduce another inconsistency. On workers tasks are first being transitioned to the FAILED state and the cleanup of active operators happens asynchronously. Thus it is possible for a worker to report that a task is done, yet it may take some time to cleanup the memory pool. Hard to say how much does it matter in practice. It feels safer though. The isQueryGone is also implemented based on the ClusterMemoryPool which is based on the MemoryInfo. I wonder if it makes sense to make it consistent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be ok now. Though I did not test it yet. Will take a look tomorrow as I do not think it is covered by automation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL at last commit.

@losipiuk losipiuk force-pushed the lo/improve-oom-killing-task-retries branch from d60e113 to f0c7fe0 Compare April 6, 2022 08:06
@losipiuk losipiuk force-pushed the lo/improve-oom-killing-task-retries branch 2 times, most recently from 45528bc to f936248 Compare April 7, 2022 19:32
Copy link
Contributor

@arhimondr arhimondr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good % small comments

}

return areTasksGone(lastKillTarget.getTasks(), runningQueries);
return areTasksGone(lastKillTarget.get().getTasks(), runningQueries);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may technically introduce another inconsistency. On workers tasks are first being transitioned to the FAILED state and the cleanup of active operators happens asynchronously. Thus it is possible for a worker to report that a task is done, yet it may take some time to cleanup the memory pool. Hard to say how much does it matter in practice. It feels safer though. The isQueryGone is also implemented based on the ClusterMemoryPool which is based on the MemoryInfo. I wonder if it makes sense to make it consistent.

@@ -88,7 +89,26 @@ public synchronized MemoryPoolInfo getInfo()
}
memoryAllocations.put(entry.getKey(), allocations);
}
return new MemoryPoolInfo(maxBytes, reservedBytes, reservedRevocableBytes, queryMemoryReservations, memoryAllocations, queryRevocableMemoryReservations);

Map<String, Long> stringKeyedTaskMemoryReservations = taskMemoryReservations.entrySet().stream()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why stringKeyed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MemoryPoolInfo is in SPI and does not see TaskID.
To be cleaned up as a followup. See: https://trinodb.slack.com/archives/CP1MUNEUX/p1649269291183389 if you are interested

@losipiuk losipiuk force-pushed the lo/improve-oom-killing-task-retries branch 2 times, most recently from 7a3e8c6 to cd5f888 Compare April 8, 2022 10:40
@losipiuk losipiuk force-pushed the lo/improve-oom-killing-task-retries branch from cd5f888 to 73f7ef6 Compare April 8, 2022 19:31
@losipiuk
Copy link
Member Author

losipiuk commented Apr 8, 2022

I needed to make a fix to Use data from MemoryPoolInfo to determine if killed tasks are gone. Will merge after CI.

@losipiuk losipiuk force-pushed the lo/improve-oom-killing-task-retries branch from 73f7ef6 to 8993061 Compare April 9, 2022 07:35
@losipiuk losipiuk merged commit a18f6e5 into trinodb:master Apr 9, 2022
@github-actions github-actions bot added this to the 377 milestone Apr 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

3 participants