Ensure tasks are killed for fault-tolerant queries when cluster out of memory #11800

losipiuk · 2022-04-05T15:40:01Z

Description

Ensure that OOM killer which triggers when cluster is out of memory will not kill whole
query if query has task-level retries enabled.

Is this change a fix, improvement, new feature, refactoring, or other?

improvement

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

engine

Documentation

(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

(x) No release notes entries required.
( ) Release notes entries required with the following suggested text:

core/trino-main/src/main/java/io/trino/memory/ClusterMemoryManager.java

arhimondr · 2022-04-05T19:50:32Z

core/trino-main/src/main/java/io/trino/memory/ClusterMemoryManager.java

-                log.debug("Last killed target is still not gone: %s", lastKillTarget);
-            }
+                nanosSince(lastTimeNotOutOfMemory).compareTo(killOnOutOfMemoryDelay) > 0 &&
+                lastKillTarget.isEmpty() && nanosSince(lastTimeKillCompleted).compareTo(killOnOutOfMemoryAfterKillDelay) > 0) {


Why is the extra delay needed? Doesn't isLastKillTargetGone protect the OOM killer from being invoked prematurely?

It is problematic when we are working with a query with task level fault tolerance.

The list of tasks and pool memory info are harvested on the worker side independently (those are held by two different structures).
And you can get the memory counter saying "pool is full" and high memory usage for query Q. While the list of tasks for query Q on the node empty (as tasks were just killed and are already marked as failed).

In this scenario previously we ended up killing whole query, even though query was running with task-based retries (the information that the query uses task-based retries was derived from existence of list of tasks for query in MemoryInfo.tasksMemoryInfo).
With last commit from this PR (Do not kill whole queries with task retries enabled on oom) we would no longer whole fault-tolerant query. But without extra delay, we would still assume the node is blocked on memory, and kill something else on this node. Hence extra delay is useful here.

I'm trying to think about a scenario when nodes are not getting blocked on memory all at once. I'm afraid that the extra delay may slow down scheduling for large clusters (100 - 1000 nodes).

From what i understand the last commit from this PR should prevent an entire fault tolerant query from being killed. However there's still a chance that a task might get killed unnecessarily if the memory pool reports that it is still blocked but the tasks are already finished. I wonder how difficult is it to return a consistent information, when the memory pool reservation is consistent with a list of tasks?

I'm trying to think about a scenario when nodes are not getting blocked on memory all at once. I'm afraid that the extra delay may slow down scheduling for large clusters (100 - 1000 nodes).

If we consider cluster oom a common thing - delay may slow done things surely. The question is should we consider it a common thing. Task killing itself also slows things down so we should rather try to minimize that.

From what i understand the last commit from this PR should prevent an entire fault tolerant query from being killed. However there's still a chance that a task might get killed unnecessarily if the memory pool reports that it is still blocked but the tasks are already finished. I wonder how difficult is it to return a consistent information, when the memory pool reservation is consistent with a list of tasks?

It is exactly as you write.
It does not look it is possible to get these two pieces of information consistently. They come from very different places with no shared synchronization paradigms:

memory info we get from MemoryPoolInfo.getInfo (via LocalMemoryManager)

task info we get from SqlTaskManager

Should be good now.
PTAL @arhimondr

arhimondr · 2022-04-05T19:53:08Z

core/trino-main/src/main/java/io/trino/memory/ClusterMemoryManager.java

-            }
+                nanosSince(lastTimeNotOutOfMemory).compareTo(killOnOutOfMemoryDelay) > 0 &&
+                lastKillTarget.isEmpty() && nanosSince(lastTimeKillCompleted).compareTo(killOnOutOfMemoryAfterKillDelay) > 0) {
+            callOomKiller(runningQueries);


The sequence of operations in the process method looks very weird.

I would rather expect it to be something like this

// update memory reservation on each node updateNodes(); // update memory pool updateMemoryPool(Iterables.size(runningQueries)); ... logic related to enforcing limits and freeing up memory pools ...

Do you think it might be the reason of these weird inconsistencies?

I think it does not matter; updateNodes() is asynchronous anyway.

arhimondr · 2022-04-05T21:55:41Z

core/trino-main/src/main/java/io/trino/memory/ClusterMemoryManager.java

        }

-        return areTasksGone(lastKillTarget.getTasks(), runningQueries);
+        return areTasksGone(lastKillTarget.get().getTasks(), runningQueries);


Ideally the implementation of this method should be based on the information about the memory pool, as the decision to kill tasks is based on that view. What do you think?

Technically it does not matter much I think. As we coordinator learns that task is dead only after it is actually killed on the worker. I can change that but I think it is low-prio cleanup and will not update this PR.

This may technically introduce another inconsistency. On workers tasks are first being transitioned to the FAILED state and the cleanup of active operators happens asynchronously. Thus it is possible for a worker to report that a task is done, yet it may take some time to cleanup the memory pool. Hard to say how much does it matter in practice. It feels safer though. The isQueryGone is also implemented based on the ClusterMemoryPool which is based on the MemoryInfo. I wonder if it makes sense to make it consistent.

Should be ok now. Though I did not test it yet. Will take a look tomorrow as I do not think it is covered by automation.

PTAL at last commit.

core/trino-main/src/main/java/io/trino/memory/MemoryManagerConfig.java

arhimondr

Looks good % small comments

core/trino-main/src/main/java/io/trino/memory/MemoryPool.java

.../trino-main/src/main/java/io/trino/memory/TotalReservationOnBlockedNodesLowMemoryKiller.java

core/trino-main/src/main/java/io/trino/memory/LowMemoryKiller.java

arhimondr · 2022-04-07T20:24:49Z

core/trino-main/src/main/java/io/trino/memory/ClusterMemoryManager.java

        }

-        return areTasksGone(lastKillTarget.getTasks(), runningQueries);
+        return areTasksGone(lastKillTarget.get().getTasks(), runningQueries);


This may technically introduce another inconsistency. On workers tasks are first being transitioned to the FAILED state and the cleanup of active operators happens asynchronously. Thus it is possible for a worker to report that a task is done, yet it may take some time to cleanup the memory pool. Hard to say how much does it matter in practice. It feels safer though. The isQueryGone is also implemented based on the ClusterMemoryPool which is based on the MemoryInfo. I wonder if it makes sense to make it consistent.

core/trino-main/src/main/java/io/trino/memory/MemoryPool.java

linzebing · 2022-04-07T21:56:01Z

core/trino-main/src/main/java/io/trino/memory/MemoryPool.java

@@ -88,7 +89,26 @@ public synchronized MemoryPoolInfo getInfo()
            }
            memoryAllocations.put(entry.getKey(), allocations);
        }
-        return new MemoryPoolInfo(maxBytes, reservedBytes, reservedRevocableBytes, queryMemoryReservations, memoryAllocations, queryRevocableMemoryReservations);
+
+        Map<String, Long> stringKeyedTaskMemoryReservations = taskMemoryReservations.entrySet().stream()


Why stringKeyed?

MemoryPoolInfo is in SPI and does not see TaskID.
To be cleaned up as a followup. See: https://trinodb.slack.com/archives/CP1MUNEUX/p1649269291183389 if you are interested

For queries which run with task-level it makes sense to kill tasks immediatelly after we encounter cluster-level out-of-memory scenario. Cost of killing tasks is small (whole query survives) and we are not introducint artificial latency.

losipiuk · 2022-04-08T19:32:38Z

I needed to make a fix to Use data from MemoryPoolInfo to determine if killed tasks are gone. Will merge after CI.

The information is now present as part of MemoryPoolInfo which is wrapped by MemoryInfo

cla-bot bot added the cla-signed label Apr 5, 2022

losipiuk requested review from arhimondr and linzebing April 5, 2022 15:40

arhimondr reviewed Apr 5, 2022

View reviewed changes

losipiuk force-pushed the lo/improve-oom-killing-task-retries branch from d60e113 to f0c7fe0 Compare April 6, 2022 08:06

losipiuk mentioned this pull request Apr 6, 2022

Refactor TestMemoryManager.testOutOfMemoryKiller #11818

Merged

linzebing reviewed Apr 6, 2022

View reviewed changes

core/trino-main/src/main/java/io/trino/memory/MemoryManagerConfig.java Outdated Show resolved Hide resolved

losipiuk force-pushed the lo/improve-oom-killing-task-retries branch 2 times, most recently from 45528bc to f936248 Compare April 7, 2022 19:32

arhimondr reviewed Apr 7, 2022

View reviewed changes

linzebing reviewed Apr 7, 2022

View reviewed changes

losipiuk force-pushed the lo/improve-oom-killing-task-retries branch 2 times, most recently from 7a3e8c6 to cd5f888 Compare April 8, 2022 10:40

arhimondr approved these changes Apr 8, 2022

View reviewed changes

linzebing approved these changes Apr 8, 2022

View reviewed changes

losipiuk added 9 commits April 8, 2022 21:31

Improve out of memory tasks/query logging

dea717d

Use Optional instead nullable

4c21deb

Do not kill whole queries with task retries enabled on oom

109404f

Use proper query id in TestingTaskContext

e46ff1b

Remove unused method

ec00195

Remove meaningless test

db054fa

Rename field to match getter

52ec856

Make assertion conditions more readable

76ac66c

losipiuk force-pushed the lo/improve-oom-killing-task-retries branch from cd5f888 to 73f7ef6 Compare April 8, 2022 19:31

losipiuk added 3 commits April 9, 2022 09:35

Account memory per task in memory pool

b3619a0

Add per-task memory allocation to MemoryPoolInfo

591b548

Drop task allocations informatin from MemoryInfo

36183d9

The information is now present as part of MemoryPoolInfo which is wrapped by MemoryInfo

Use data from MemoryPoolInfo to determine if killed tasks are gone

8993061

losipiuk force-pushed the lo/improve-oom-killing-task-retries branch from 73f7ef6 to 8993061 Compare April 9, 2022 07:35

losipiuk merged commit a18f6e5 into trinodb:master Apr 9, 2022

github-actions bot added this to the 377 milestone Apr 9, 2022

mosabua mentioned this pull request Apr 11, 2022

Add Trino 377 release notes #11859

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure tasks are killed for fault-tolerant queries when cluster out of memory #11800

Ensure tasks are killed for fault-tolerant queries when cluster out of memory #11800

losipiuk commented Apr 5, 2022

arhimondr Apr 5, 2022

losipiuk Apr 5, 2022

arhimondr Apr 5, 2022

losipiuk Apr 6, 2022

losipiuk Apr 7, 2022

arhimondr Apr 5, 2022

losipiuk Apr 6, 2022

arhimondr Apr 5, 2022

losipiuk Apr 6, 2022

arhimondr Apr 7, 2022

losipiuk Apr 7, 2022

losipiuk Apr 7, 2022

arhimondr left a comment

arhimondr Apr 7, 2022

linzebing Apr 7, 2022

losipiuk Apr 7, 2022

losipiuk commented Apr 8, 2022

Ensure tasks are killed for fault-tolerant queries when cluster out of memory #11800

Ensure tasks are killed for fault-tolerant queries when cluster out of memory #11800

Conversation

losipiuk commented Apr 5, 2022

Description

Documentation

Release notes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arhimondr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

losipiuk commented Apr 8, 2022