-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide estimates for partial and final topN in TopNStatsRule #15428
Conversation
293d496
to
f0eda09
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -47,8 +47,13 @@ protected Optional<PlanNodeStatsEstimate> doCalculate(TopNNode node, StatsProvid | |||
PlanNodeStatsEstimate sourceStats = statsProvider.getStats(node.getSource()); | |||
double rowCount = sourceStats.getOutputRowCount(); | |||
|
|||
if (node.getStep() != TopNNode.Step.SINGLE) { | |||
return Optional.empty(); | |||
/* CreatePartialTopN rule runs after ReorderJoins but before DetermineJoinDistributionType and DetermineSemiJoinDistributionType. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead, can we reorder them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my understanding, providing an estimate is a more flexible solution than moving rules around so that unestimated things are added to the plan after CBO rules.
We gain estimates in the output of EXPLAIN query and avoid adding another ordering requirement in PlanOptimizers by providing an estimate here.
* We use the conservative estimate of sourceStats for partial topN so that final topN can be accurately estimated. | ||
*/ | ||
if (node.getStep() == TopNNode.Step.PARTIAL) { | ||
return Optional.of(sourceStats); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could assume some minimal reasonable split size.
For example, "TOP 5" is going to reduce data heavily (except for a pretty uncommon situation where splits return up to 5 rows), and "TOP 1000000" is not going to reduce data much (splits are often less than million rows).
BTW, the comment here seems to assume partial TopN is applied directly in the source stage, which may or may not be the same.
Does it matter? Can we improve the comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I don't think partial TopN
will always be in source stage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the comment here to clarify that we're talking about an example scenario, not implying that partialTopN can only be in source stage.
For HASH stages maybe it could be estimated as (input / (hash partitions count * task concurrency)) * N
, but I don't think we know the stage partitioning at this point.
I've updated the partial topN estimate calculation to assume 1 million rows input.
could you rebase please? |
3aa3c01
to
7ff3142
Compare
7ff3142
to
de46ef9
Compare
@@ -47,8 +48,16 @@ protected Optional<PlanNodeStatsEstimate> doCalculate(TopNNode node, StatsProvid | |||
PlanNodeStatsEstimate sourceStats = statsProvider.getStats(node.getSource()); | |||
double rowCount = sourceStats.getOutputRowCount(); | |||
|
|||
if (node.getStep() != TopNNode.Step.SINGLE) { | |||
return Optional.empty(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think this comment will be obsolete pretty quickly.
@@ -28,6 +28,7 @@ | |||
public class TopNStatsRule |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this PR fix any existing cost-based use-cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It solves the problem reported in https://trinodb.slack.com/archives/C0305TQ05KL/p1671020850646689?thread_ts=1669977747.493179&cid=C0305TQ05KL
Description
Provide estimates for partial and final topN in TopNStatsRule
Additional context and related issues
https://trinodb.slack.com/archives/C0305TQ05KL/p1671020850646689?thread_ts=1669977747.493179&cid=C0305TQ05KL
Release notes
( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text: