-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure tasks are killed for fault-tolerant queries when cluster out of memory #11800
Ensure tasks are killed for fault-tolerant queries when cluster out of memory #11800
Conversation
core/trino-main/src/main/java/io/trino/memory/ClusterMemoryManager.java
Outdated
Show resolved
Hide resolved
log.debug("Last killed target is still not gone: %s", lastKillTarget); | ||
} | ||
nanosSince(lastTimeNotOutOfMemory).compareTo(killOnOutOfMemoryDelay) > 0 && | ||
lastKillTarget.isEmpty() && nanosSince(lastTimeKillCompleted).compareTo(killOnOutOfMemoryAfterKillDelay) > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the extra delay needed? Doesn't isLastKillTargetGone
protect the OOM killer from being invoked prematurely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is problematic when we are working with a query with task level fault tolerance.
The list of tasks and pool memory info are harvested on the worker side independently (those are held by two different structures).
And you can get the memory counter saying "pool is full" and high memory usage for query Q. While the list of tasks for query Q on the node empty (as tasks were just killed and are already marked as failed).
In this scenario previously we ended up killing whole query, even though query was running with task-based retries (the information that the query uses task-based retries was derived from existence of list of tasks for query in MemoryInfo.tasksMemoryInfo
).
With last commit from this PR (Do not kill whole queries with task retries enabled on oom) we would no longer whole fault-tolerant query. But without extra delay, we would still assume the node is blocked on memory, and kill something else on this node. Hence extra delay is useful here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to think about a scenario when nodes are not getting blocked on memory all at once. I'm afraid that the extra delay may slow down scheduling for large clusters (100 - 1000 nodes).
From what i understand the last commit from this PR should prevent an entire fault tolerant query from being killed. However there's still a chance that a task might get killed unnecessarily if the memory pool reports that it is still blocked but the tasks are already finished. I wonder how difficult is it to return a consistent information, when the memory pool reservation is consistent with a list of tasks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to think about a scenario when nodes are not getting blocked on memory all at once. I'm afraid that the extra delay may slow down scheduling for large clusters (100 - 1000 nodes).
If we consider cluster oom a common thing - delay may slow done things surely. The question is should we consider it a common thing. Task killing itself also slows things down so we should rather try to minimize that.
From what i understand the last commit from this PR should prevent an entire fault tolerant query from being killed. However there's still a chance that a task might get killed unnecessarily if the memory pool reports that it is still blocked but the tasks are already finished. I wonder how difficult is it to return a consistent information, when the memory pool reservation is consistent with a list of tasks?
It is exactly as you write.
It does not look it is possible to get these two pieces of information consistently. They come from very different places with no shared synchronization paradigms:
- memory info we get from
MemoryPoolInfo.getInfo
(viaLocalMemoryManager
) - task info we get from
SqlTaskManager
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be good now.
PTAL @arhimondr
} | ||
nanosSince(lastTimeNotOutOfMemory).compareTo(killOnOutOfMemoryDelay) > 0 && | ||
lastKillTarget.isEmpty() && nanosSince(lastTimeKillCompleted).compareTo(killOnOutOfMemoryAfterKillDelay) > 0) { | ||
callOomKiller(runningQueries); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sequence of operations in the process
method looks very weird.
I would rather expect it to be something like this
// update memory reservation on each node
updateNodes();
// update memory pool
updateMemoryPool(Iterables.size(runningQueries));
... logic related to enforcing limits and freeing up memory pools ...
Do you think it might be the reason of these weird inconsistencies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it does not matter; updateNodes()
is asynchronous anyway.
} | ||
|
||
return areTasksGone(lastKillTarget.getTasks(), runningQueries); | ||
return areTasksGone(lastKillTarget.get().getTasks(), runningQueries); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally the implementation of this method should be based on the information about the memory pool, as the decision to kill tasks is based on that view. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically it does not matter much I think. As we coordinator learns that task is dead only after it is actually killed on the worker. I can change that but I think it is low-prio cleanup and will not update this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may technically introduce another inconsistency. On workers tasks are first being transitioned to the FAILED
state and the cleanup of active operators happens asynchronously. Thus it is possible for a worker to report that a task is done, yet it may take some time to cleanup the memory pool. Hard to say how much does it matter in practice. It feels safer though. The isQueryGone
is also implemented based on the ClusterMemoryPool
which is based on the MemoryInfo
. I wonder if it makes sense to make it consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be ok now. Though I did not test it yet. Will take a look tomorrow as I do not think it is covered by automation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL at last commit.
d60e113
to
f0c7fe0
Compare
core/trino-main/src/main/java/io/trino/memory/MemoryManagerConfig.java
Outdated
Show resolved
Hide resolved
45528bc
to
f936248
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good % small comments
.../trino-main/src/main/java/io/trino/memory/TotalReservationOnBlockedNodesLowMemoryKiller.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
return areTasksGone(lastKillTarget.getTasks(), runningQueries); | ||
return areTasksGone(lastKillTarget.get().getTasks(), runningQueries); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may technically introduce another inconsistency. On workers tasks are first being transitioned to the FAILED
state and the cleanup of active operators happens asynchronously. Thus it is possible for a worker to report that a task is done, yet it may take some time to cleanup the memory pool. Hard to say how much does it matter in practice. It feels safer though. The isQueryGone
is also implemented based on the ClusterMemoryPool
which is based on the MemoryInfo
. I wonder if it makes sense to make it consistent.
@@ -88,7 +89,26 @@ public synchronized MemoryPoolInfo getInfo() | |||
} | |||
memoryAllocations.put(entry.getKey(), allocations); | |||
} | |||
return new MemoryPoolInfo(maxBytes, reservedBytes, reservedRevocableBytes, queryMemoryReservations, memoryAllocations, queryRevocableMemoryReservations); | |||
|
|||
Map<String, Long> stringKeyedTaskMemoryReservations = taskMemoryReservations.entrySet().stream() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why stringKeyed
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MemoryPoolInfo is in SPI and does not see TaskID.
To be cleaned up as a followup. See: https://trinodb.slack.com/archives/CP1MUNEUX/p1649269291183389 if you are interested
7a3e8c6
to
cd5f888
Compare
For queries which run with task-level it makes sense to kill tasks immediatelly after we encounter cluster-level out-of-memory scenario. Cost of killing tasks is small (whole query survives) and we are not introducint artificial latency.
cd5f888
to
73f7ef6
Compare
I needed to make a fix to |
The information is now present as part of MemoryPoolInfo which is wrapped by MemoryInfo
73f7ef6
to
8993061
Compare
Description
Ensure that OOM killer which triggers when cluster is out of memory will not kill whole
query if query has task-level retries enabled.
improvement
engine
Documentation
(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.
Release notes
(x) No release notes entries required.
( ) Release notes entries required with the following suggested text: