Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide estimates for partial and final topN in TopNStatsRule #15428

Merged
merged 1 commit into from
Jan 10, 2023

Conversation

raunaqmorarka
Copy link
Member

@raunaqmorarka raunaqmorarka commented Dec 16, 2022

Description

Provide estimates for partial and final topN in TopNStatsRule

Additional context and related issues

https://trinodb.slack.com/archives/C0305TQ05KL/p1671020850646689?thread_ts=1669977747.493179&cid=C0305TQ05KL

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# General
* Improve planner estimates for queries containing outer joins over a topN operation. ({issue}`15428`)

Copy link
Member

@losipiuk losipiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -47,8 +47,13 @@ protected Optional<PlanNodeStatsEstimate> doCalculate(TopNNode node, StatsProvid
PlanNodeStatsEstimate sourceStats = statsProvider.getStats(node.getSource());
double rowCount = sourceStats.getOutputRowCount();

if (node.getStep() != TopNNode.Step.SINGLE) {
return Optional.empty();
/* CreatePartialTopN rule runs after ReorderJoins but before DetermineJoinDistributionType and DetermineSemiJoinDistributionType.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead, can we reorder them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my understanding, providing an estimate is a more flexible solution than moving rules around so that unestimated things are added to the plan after CBO rules.
We gain estimates in the output of EXPLAIN query and avoid adding another ordering requirement in PlanOptimizers by providing an estimate here.

Comment on lines 53 to 56
* We use the conservative estimate of sourceStats for partial topN so that final topN can be accurately estimated.
*/
if (node.getStep() == TopNNode.Step.PARTIAL) {
return Optional.of(sourceStats);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could assume some minimal reasonable split size.
For example, "TOP 5" is going to reduce data heavily (except for a pretty uncommon situation where splits return up to 5 rows), and "TOP 1000000" is not going to reduce data much (splits are often less than million rows).

BTW, the comment here seems to assume partial TopN is applied directly in the source stage, which may or may not be the same.
Does it matter? Can we improve the comment?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't think partial TopN will always be in source stage

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the comment here to clarify that we're talking about an example scenario, not implying that partialTopN can only be in source stage.
For HASH stages maybe it could be estimated as (input / (hash partitions count * task concurrency)) * N, but I don't think we know the stage partitioning at this point.
I've updated the partial topN estimate calculation to assume 1 million rows input.

@sopel39
Copy link
Member

sopel39 commented Dec 22, 2022

could you rebase please?

@raunaqmorarka raunaqmorarka force-pushed the topn-stats branch 2 times, most recently from 3aa3c01 to 7ff3142 Compare January 7, 2023 16:18
@@ -47,8 +48,16 @@ protected Optional<PlanNodeStatsEstimate> doCalculate(TopNNode node, StatsProvid
PlanNodeStatsEstimate sourceStats = statsProvider.getStats(node.getSource());
double rowCount = sourceStats.getOutputRowCount();

if (node.getStep() != TopNNode.Step.SINGLE) {
return Optional.empty();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think this comment will be obsolete pretty quickly.

@@ -28,6 +28,7 @@
public class TopNStatsRule
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this PR fix any existing cost-based use-cases?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raunaqmorarka raunaqmorarka merged commit c065de4 into trinodb:master Jan 10, 2023
@raunaqmorarka raunaqmorarka deleted the topn-stats branch January 10, 2023 14:56
@github-actions github-actions bot added this to the 406 milestone Jan 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants