-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add table stats cache #13047
Add table stats cache #13047
Conversation
this one i didn't apply: #12196 (comment)
review isn't required here. |
Before the change, the planner cached table stats within `IterativeOptimizer` run (as part of `Memo`). After the change, there is another cache that spans the whole optimization process.
The overloads were introduced only temporarily.
cbe6f26
to
1b94f09
Compare
per my comment in the original PR: dc8228a#r911207504 |
|
*/ | ||
PlanNodeStatsEstimate calculateStats( | ||
PlanNode node, | ||
StatsProvider sourceStats, | ||
Lookup lookup, | ||
Session session, | ||
TypeProvider types); | ||
TypeProvider types, | ||
TableStatsProvider tableStatsProvider); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does TableStatsProvider
needs to be 1st class citizen in StatsCalculator
? It seems it should be local to TableScanStatsRule
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TableScanStatsRule
is effectively a singleton.
CachingTableStatsProvider
lifecycle is "during plan optimization", so it's query-scoped.
we need to pass TableStatsProvider
instance created in the optimizer back to the TableScanStatsRule
rule
other option would be to have stats rules query-scoped and create new StatsCalculator
for each query, but this is IMO generally not a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other option would be to have stats rules query-scoped and create new StatsCalculator for each query, but this is IMO generally not a good idea.
why? Maybe we can just preserve cached stats across planner/optimizer rules?
The root cause here is really that we have a lot of iterative optimizer instances. If we just had one, then there would be no caching problem.
Anyways, I'm more in favor of preserving cached stats (maybe with weak references) across optimizer executions rather than adding artificial components to a clean interface. This is similar issue as with caching expression optimizer/analyzer results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why? Maybe we can just preserve cached stats across planner/optimizer rules?
that's what the PR is doing
The root cause here is really that we have a lot of iterative optimizer instances. If we just had one, then there would be no caching problem.
i agree. Memo would do all of the work for now.
We can remove this new concept once we get to that stage.
Anyways, I'm more in favor of preserving cached stats (maybe with weak references) across optimizer executions rather than adding artificial components to a clean interface. This is similar issue as with caching expression optimizer/analyzer results.
I am not following. @sopel39 what are you suggesting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not following. @sopel39 what are you suggesting?
I suggest introducing some kind of context to doCalculate
rather than adding another explicit parameter. I think @gaurav8297 has similar proposal for expression optimizer/analyzer results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I've implemented it as part of this PR: #12016
Basically like we have io.trino.sql.planner.iterative.Rule.Context
for iterative rules, we could have the same thing for StatsRule
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest introducing some kind of context to
doCalculate
rather than adding another explicit parameter.
We still need to pass a parameter.
I agree we could introduce a more generic context as well.
I don't see a need for this here yet, can be a follow up
@@ -55,13 +51,13 @@ public Pattern<TableScanNode> getPattern() | |||
} | |||
|
|||
@Override | |||
protected Optional<PlanNodeStatsEstimate> doCalculate(TableScanNode node, StatsProvider sourceStats, Lookup lookup, Session session, TypeProvider types) | |||
protected Optional<PlanNodeStatsEstimate> doCalculate(TableScanNode node, StatsProvider sourceStats, Lookup lookup, Session session, TypeProvider types, TableStatsProvider tableStatsProvider) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why StatsProvider
cannot handle caching? It does already, right? Something doesn't add up here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that was my initial idea as well. see #12196 (comment) .
private final Metadata metadata; | ||
private final Session session; | ||
|
||
private final Map<TableHandle, TableStatistics> cache = new WeakHashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: why do you need WeakHashMap if CTSP is created per query. Do you expect that TableHandle
objects will be GCed often while query is running. It is possible for sure but I am not sure if often in practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During optimizations we can create large number of TableHandle objects (eg subsequent pushdowns) and old TableHandle objects may not be reachable in the plan. I don't think "large number" would be a common situation (there isn't an unlimited number of pushdown opportunities for a given query), but i still don't think we should be using strong references here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Why not merge commits together?
the commit were separated per my request in #12196 |
#12196 with most comments applied
already reviewed at #12196
cc @lxynov @alexjo2144 @sopel39 @losipiuk