Add table stats cache #13047

findepi · 2022-06-30T10:30:34Z

#12196 with most comments applied

already reviewed at #12196

cc @lxynov @alexjo2144 @sopel39 @losipiuk

findepi · 2022-06-30T10:31:24Z

#12196 with most comments applied

this one i didn't apply: #12196 (comment)

already reviewed at #12196

review isn't required here.

Before the change, the planner cached table stats within `IterativeOptimizer` run (as part of `Memo`). After the change, there is another cache that spans the whole optimization process.

The overloads were introduced only temporarily.

findepi · 2022-06-30T16:05:40Z

this one i didn't apply: #12196 (comment)

per my comment in the original PR: dc8228a#r911207504
i decided to keep the current code as-is.

findepi · 2022-07-01T06:45:40Z

test (plugin/trino-iceberg)

Error:  java.lang.OutOfMemoryError: Java heap space
Error: 
Error:  Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "VM Pause Meter"
Error: 
Error:  Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "http-worker-8680"
Error: 
Error:  Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "VM Pause Meter"
Error:  Terminating due to java.lang.OutOfMemoryError: Java heap space

sopel39 · 2022-07-18T21:37:46Z

core/trino-main/src/main/java/io/trino/cost/StatsCalculator.java

     */
    PlanNodeStatsEstimate calculateStats(
            PlanNode node,
            StatsProvider sourceStats,
            Lookup lookup,
            Session session,
-            TypeProvider types);
+            TypeProvider types,
+            TableStatsProvider tableStatsProvider);


Why does TableStatsProvider needs to be 1st class citizen in StatsCalculator? It seems it should be local to TableScanStatsRule

TableScanStatsRule is effectively a singleton.
CachingTableStatsProvider lifecycle is "during plan optimization", so it's query-scoped.
we need to pass TableStatsProvider instance created in the optimizer back to the TableScanStatsRule rule

other option would be to have stats rules query-scoped and create new StatsCalculator for each query, but this is IMO generally not a good idea.

other option would be to have stats rules query-scoped and create new StatsCalculator for each query, but this is IMO generally not a good idea.

why? Maybe we can just preserve cached stats across planner/optimizer rules?

The root cause here is really that we have a lot of iterative optimizer instances. If we just had one, then there would be no caching problem.

Anyways, I'm more in favor of preserving cached stats (maybe with weak references) across optimizer executions rather than adding artificial components to a clean interface. This is similar issue as with caching expression optimizer/analyzer results.

why? Maybe we can just preserve cached stats across planner/optimizer rules?

that's what the PR is doing

The root cause here is really that we have a lot of iterative optimizer instances. If we just had one, then there would be no caching problem.

i agree. Memo would do all of the work for now.
We can remove this new concept once we get to that stage.

Anyways, I'm more in favor of preserving cached stats (maybe with weak references) across optimizer executions rather than adding artificial components to a clean interface. This is similar issue as with caching expression optimizer/analyzer results.

I am not following. @sopel39 what are you suggesting?

I am not following. @sopel39 what are you suggesting?

I suggest introducing some kind of context to doCalculate rather than adding another explicit parameter. I think @gaurav8297 has similar proposal for expression optimizer/analyzer results.

Yes, I've implemented it as part of this PR: #12016

Basically like we have io.trino.sql.planner.iterative.Rule.Context for iterative rules, we could have the same thing for StatsRule

I suggest introducing some kind of context to doCalculate rather than adding another explicit parameter.

We still need to pass a parameter.

I agree we could introduce a more generic context as well.
I don't see a need for this here yet, can be a follow up

sopel39 · 2022-07-18T21:41:21Z

core/trino-main/src/main/java/io/trino/cost/TableScanStatsRule.java

@@ -55,13 +51,13 @@ public Pattern<TableScanNode> getPattern()
    }

    @Override
-    protected Optional<PlanNodeStatsEstimate> doCalculate(TableScanNode node, StatsProvider sourceStats, Lookup lookup, Session session, TypeProvider types)
+    protected Optional<PlanNodeStatsEstimate> doCalculate(TableScanNode node, StatsProvider sourceStats, Lookup lookup, Session session, TypeProvider types, TableStatsProvider tableStatsProvider)


Why StatsProvider cannot handle caching? It does already, right? Something doesn't add up here.

that was my initial idea as well. see #12196 (comment) .

losipiuk · 2022-07-20T17:36:15Z

core/trino-main/src/main/java/io/trino/cost/CachingTableStatsProvider.java

+    private final Metadata metadata;
+    private final Session session;
+
+    private final Map<TableHandle, TableStatistics> cache = new WeakHashMap<>();


nit: why do you need WeakHashMap if CTSP is created per query. Do you expect that TableHandle objects will be GCed often while query is running. It is possible for sure but I am not sure if often in practice.

During optimizations we can create large number of TableHandle objects (eg subsequent pushdowns) and old TableHandle objects may not be reachable in the plan. I don't think "large number" would be a common situation (there isn't an unlimited number of pushdown opportunities for a given query), but i still don't think we should be using strong references here.

losipiuk

LGTM. Why not merge commits together?

findepi · 2022-07-21T07:26:07Z

LGTM. Why not merge commits together?

the commit were separated per my request in #12196
this is to separate interesting logical changes from mechanical ones.

findepi · 2022-07-21T11:28:25Z

This is an important change (see #11708 #13198 for some rationale) and the idea to introduce a context seems optional to me, and definitely can be taken care of as a follow-up, if it is needed. Let me merge this as is, hopefully no-one feels bad about that.

cla-bot bot added the cla-signed label Jun 30, 2022

findepi mentioned this pull request Jun 30, 2022

Add table stats cache #12196

Closed

lxynov added 2 commits June 30, 2022 12:37

Cache table stats during query planning

189cfc9

Before the change, the planner cached table stats within `IterativeOptimizer` run (as part of `Memo`). After the change, there is another cache that spans the whole optimization process.

Remove redundant method overloads in stats calculation

1b94f09

The overloads were introduced only temporarily.

findepi force-pushed the findepi/lxynov/caching-tablestats branch from cbe6f26 to 1b94f09 Compare June 30, 2022 15:43

findepi added 2 commits July 1, 2022 08:51

empty

6834dc3

empty

732895d

alexjo2144 mentioned this pull request Jul 15, 2022

Delta Lake: Improve coordinator memory consumption for large tables #13198

Closed

findepi requested review from losipiuk, sopel39 and alexjo2144 July 18, 2022 15:38

empty

cd868c0

sopel39 reviewed Jul 18, 2022

View reviewed changes

findepi requested a review from sopel39 July 19, 2022 10:36

losipiuk reviewed Jul 20, 2022

View reviewed changes

losipiuk approved these changes Jul 20, 2022

View reviewed changes

findepi merged commit 53bd064 into trinodb:master Jul 21, 2022

findepi deleted the findepi/lxynov/caching-tablestats branch July 21, 2022 11:28

findepi mentioned this pull request Jul 21, 2022

Release notes for 391 #13187

Closed

github-actions bot added this to the 391 milestone Jul 21, 2022

colebow mentioned this pull request Jul 21, 2022

Add Trino 391 release notes #13216

Merged

alexjo2144 mentioned this pull request Aug 1, 2022

Iceberg Metadata Cache for faster query planning #13338

Closed

bitsondatadev mentioned this pull request May 18, 2023

query-level caching for hive tables and iceberg table statistics #8659

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add table stats cache #13047

Add table stats cache #13047

findepi commented Jun 30, 2022

findepi commented Jun 30, 2022

findepi commented Jun 30, 2022

findepi commented Jul 1, 2022

sopel39 Jul 18, 2022 •

edited

Loading

findepi Jul 19, 2022

sopel39 Jul 19, 2022 •

edited

Loading

findepi Jul 19, 2022

sopel39 Jul 19, 2022

gaurav8297 Jul 19, 2022 •

edited

Loading

findepi Jul 20, 2022

sopel39 Jul 18, 2022

findepi Jul 19, 2022

losipiuk Jul 20, 2022

findepi Jul 21, 2022

losipiuk left a comment

findepi commented Jul 21, 2022

findepi commented Jul 21, 2022

Add table stats cache #13047

Add table stats cache #13047

Conversation

findepi commented Jun 30, 2022

findepi commented Jun 30, 2022

findepi commented Jun 30, 2022

findepi commented Jul 1, 2022

sopel39 Jul 18, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sopel39 Jul 19, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gaurav8297 Jul 19, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

losipiuk left a comment

Choose a reason for hiding this comment

findepi commented Jul 21, 2022

findepi commented Jul 21, 2022

sopel39 Jul 18, 2022 •

edited

Loading

sopel39 Jul 19, 2022 •

edited

Loading

gaurav8297 Jul 19, 2022 •

edited

Loading