Add streaming TopN rank() implementation #6333

erichwang · 2020-12-14T10:55:02Z

No description provided.

findepi · 2020-12-14T12:40:55Z

presto-main/src/main/java/io/prestosql/SystemSessionProperties.java

@@ -103,7 +103,7 @@
    public static final String MAX_RECURSION_DEPTH = "max_recursion_depth";
    public static final String USE_MARK_DISTINCT = "use_mark_distinct";
    public static final String PREFER_PARTIAL_AGGREGATION = "prefer_partial_aggregation";
-    public static final String OPTIMIZE_TOP_N_ROW_NUMBER = "optimize_top_n_row_number";
+    public static final String OPTIMIZE_TOP_N_RANKING = "optimize_top_n_ranking";


We should have equivalent of @LegacyConfig for session toggles.

@findepi, are you saying that we have an equivalent of @LegacyConfig, and that we should use it for this PR? Or that someone should implement an equivalent at some point?

i am not aware that we have

Yea, that's what I understood too, but wasn't sure if I missed something. It sounds like a reasonable feature request. Would you think that this blocks any of this PR? If not, we can file an issue for that.

I don't think it blocks here. But would be nice to have. Would you be able to implement this?

I might take a look at this if I get some time, but no guarantees

sopel39

Some initial comments (up to Rename TopNRowNumber* => TopNRanking* )

presto-main/src/main/java/io/prestosql/util/LongLong2LongOpenCustomBigHashMap.java

presto-main/src/main/java/io/prestosql/operator/TopNRankingOperator.java

presto-main/src/main/java/io/prestosql/sql/planner/LocalExecutionPlanner.java

presto-main/src/main/java/io/prestosql/sql/planner/optimizations/PlanNodeDecorrelator.java

...main/java/io/prestosql/sql/planner/iterative/rule/PushPredicateThroughProjectIntoWindow.java

presto-main/src/main/java/io/prestosql/sql/planner/optimizations/WindowFilterPushDown.java

presto-main/src/main/java/io/prestosql/util/LongLong2LongOpenCustomBigHashMap.java

presto-main/src/main/java/io/prestosql/operator/GroupedTopNRankAccumulator.java

sopel39

Reviewed Add GroupedTopNRankAccumulator for streaming rank

presto-main/src/main/java/io/prestosql/operator/GroupedTopNRankAccumulator.java

presto-main/src/test/java/io/prestosql/operator/TestGroupedTopNRankAccumulator.java

sopel39

Only Add optimizer capability to produce streaming topN rank() plans for review

presto-main/src/main/java/io/prestosql/operator/GroupedTopNRankBuilder.java

presto-main/src/main/java/io/prestosql/operator/PageWithPositionEqualsAndHash.java

presto-main/src/main/java/io/prestosql/operator/GroupedTopNRankBuilder.java

presto-main/src/main/java/io/prestosql/operator/SimplePageWithPositionEqualsAndHash.java

presto-main/src/main/java/io/prestosql/operator/GroupedTopNRankBuilder.java

presto-main/src/test/java/io/prestosql/operator/TestGroupedTopNRankBuilder.java

presto-main/src/main/java/io/prestosql/sql/analyzer/FeaturesConfig.java

presto-main/src/main/java/io/prestosql/sql/planner/optimizations/WindowFilterPushDown.java

...main/java/io/prestosql/sql/planner/iterative/rule/PushPredicateThroughProjectIntoWindow.java

presto-main/src/main/java/io/prestosql/sql/planner/optimizations/WindowFilterPushDown.java

sopel39

great job! I've added comments, but overall it looks good!

presto-main/src/main/java/io/prestosql/operator/GroupedTopNRankAccumulator.java

presto-main/src/test/java/io/prestosql/operator/TestGroupedTopNRankBuilder.java

presto-main/src/test/java/io/prestosql/sql/planner/optimizations/TestWindowFilterPushDown.java

sopel39

lgtm % comment about NaN peer groups

presto-main/src/test/java/io/prestosql/operator/TestTopNRankingOperator.java

presto-main/src/main/java/io/prestosql/operator/SimplePageWithPositionEqualsAndHash.java

erichwang · 2020-12-30T00:45:11Z

Note: I dropped in a new commit:
5e4ff9e
to fix an existing bug in topn row number that was copied into the rank stuff.

And also a commit at the end:
d8d24ae
to bring back compatibility with window NaN peer groups. This last commit can be dropped if #6472 lands first.

sopel39

Extract Fix TopNRowNumberOperator incorrectly swapped types as separate PR

sopel39 · 2020-12-30T11:31:36Z

presto-main/src/main/java/io/prestosql/operator/GroupedTopNRankAccumulator.java

@@ -577,11 +577,11 @@ private IntegrityStats verifyHeapIntegrity(long groupId, long heapNodeIndex)
        verify(actualPeerGroupCount == peerGroupCount, "Recorded peer group count does not match actual");


Let's remove HACK from commit message. It's just maintaining current semantics. In fact, I would just squash it, but if it would make it easier to revert later on, we could keep it as separate commit

While it is maintaining current semantics, it is completely at odds with the established design invariants of this class's data structure and API, and it NEEDS to be rolled back asap for this class to become reasonable and cohesive. It happens to work today, but only accidentally. All comments and standard programming expectations on relationship between equals and compare are silently violated here in unexpected ways. The other classes don't have this problem because they use a single comparison method and data structure -- here we need two coordinated data structures that don't agree anymore, and hence my apprehension about this. It can be very error prone, which is why I added the integrity checks to the system to enforce these invariants.

All comments and standard programming expectations on relationship between equals and compare are silently violated here in unexpected ways

Hard relationship is only between equals and hashCode. compare and equals do not have strict relationship. Consider example, nulls:

are nulls equal? no (equals(null, null)==false)

are nulls placed in same place in global ordering? yes (compare(null, null)==0)

Sorry, what i mean is that the design of this specific data structure optimization has this requirement to make sense. In plain java, you are correct, but in this case, the strategies need to be consistent for any input we are providing to this optimization.

Anyways, I can change the commit message, but we need to get out of this state asap, because I don't even trust myself adding more code here until the invariants are re-established.

sopel39 · 2020-12-30T11:32:54Z

presto-main/src/main/java/io/prestosql/operator/TopNRowNumberOperator.java

@@ -33,7 +33,6 @@
 import static com.google.common.base.Preconditions.checkState;


Please create separate PR. This commit is unrelated to rank improvements

@sopel39, this is a prerequisite for this refactor. The rank changes are dependent on these changes to function correctly in the Builder refactor.

sopel39 · 2020-12-30T11:33:46Z

presto-main/src/test/java/io/prestosql/operator/TestTopNRowNumberOperator.java

                .build();

        TopNRowNumberOperatorFactory operatorFactory = new TopNRowNumberOperatorFactory(
                0,
                new PlanNodeId("test"),
-                ImmutableList.of(BIGINT, DOUBLE),
+                ImmutableList.of(VARCHAR, DOUBLE),


Why would that fail with previous TopNRowNumberOperatorCode code? Was it because BIGINT and DOUBLE comparisons were compatible?

This did not fail before, which was the problem -- when it should have. BIGINT and DOUBLE comparisons at the binary are entirely the same, except when it comes to the special Double values like NaN etc. It was previously using BIGINT to process DOUBLE data, and silently succeeding. VARCHAR makes this much more apparent.

sopel39 · 2020-12-30T11:40:23Z

Benchmarks comparison-rank.pdf

q67 (the slowest query that takes 1/3 or tpcds 1tb walltime) is now fixed!

erichwang · 2020-12-31T19:28:57Z

Does anyone know about the web-ui-checks failure? Did I forget to update some files?

sopel39 · 2021-01-04T10:56:44Z

Does anyone know about the web-ui-checks failure? Did I forget to update some files?

I'm pretty sure these are intermittent

TopNRowNumberOperator was previously incorrectly using the output type order for the SimplePageWithPositionComparator strategy, when the channels were all defined in terms of the input types. This issue is not visible in production code because the LocalExecutionPlanner always puts the outputs in the same order as the inputs, but this means that the current set of tests were accidentally correct. The tests have been updated to fail if this occurs.

LongLong2LongOpenCustomBigHashMap originally uses the value zero to represent keys that haven't been mapped yet (fastutil calls these null keys in their code). However, this means that the custom HashStrategy will sometimes be asked to check equality on zero valued keys, even though a zero value key may not exist from the strategies perspective. To help callers better disambiguate this situation, we now allow the callers to configure the null keys to be used on instance creation.

Renames: GroupedTopNBuilder => GroupedTopNRowNumberBuilder BenchmarkGroupedTopNBuilder => BenchmarkGroupedTopNRowNumberBuilder TestGroupedTopNBuilder => TestGroupedTopNRowNumberBuilder

Generalizing TopNRowNumber components as a more generic top N ranking system to allow inclusions of rank and dense_rank

Provides the template to quickly enable streaming topn RANK and DENSE_RANK, but does not enable them yet.

WindowFilterPushDown was previously too loose inchecking for rank bounds between 0 to N when comparing with a TopN operator. All rank values start at 1.

The default Window implementation uses equalsNullSafe rather than the expected IS NOT DISTINCT FROM semantics to determine peer groups. This means values such as NaN, positive/negative zero, and nested null structure types will be incorrectly treated as separate peer groups. We are putting in this temporary hack to retain compatibility with the current window behavior, but will need to revert this after it gets fixed.

sopel39 · 2021-01-05T12:37:29Z

Flaky due to: https://github.com/trinodb/trino/pull/6333/checks?check_run_id=1646241915

sopel39 · 2021-01-05T12:39:58Z

merged, thanks!

cla-bot bot added the cla-signed label Dec 14, 2020

erichwang requested review from sopel39 and dain December 14, 2020 10:55

findepi reviewed Dec 14, 2020

View reviewed changes

sopel39 reviewed Dec 16, 2020

View reviewed changes

presto-main/src/main/java/io/prestosql/util/LongLong2LongOpenCustomBigHashMap.java Outdated Show resolved Hide resolved

presto-main/src/main/java/io/prestosql/util/LongLong2LongOpenCustomBigHashMap.java Outdated Show resolved Hide resolved

sopel39 reviewed Dec 16, 2020

View reviewed changes

sopel39 reviewed Dec 18, 2020

View reviewed changes

presto-main/src/main/java/io/prestosql/operator/GroupedTopNRankAccumulator.java Outdated Show resolved Hide resolved

sopel39 reviewed Dec 18, 2020

View reviewed changes

sopel39 reviewed Dec 21, 2020

View reviewed changes

presto-main/src/main/java/io/prestosql/operator/GroupedTopNRankAccumulator.java Outdated Show resolved Hide resolved

sopel39 reviewed Dec 23, 2020

View reviewed changes

...main/java/io/prestosql/sql/planner/iterative/rule/PushPredicateThroughProjectIntoWindow.java Outdated Show resolved Hide resolved

presto-main/src/main/java/io/prestosql/sql/planner/optimizations/WindowFilterPushDown.java Outdated Show resolved Hide resolved

sopel39 reviewed Dec 23, 2020

View reviewed changes

erichwang force-pushed the topnrank branch 4 times, most recently from 88c87fa to fca1c1a Compare December 29, 2020 01:38

sopel39 reviewed Dec 29, 2020

View reviewed changes

presto-main/src/test/java/io/prestosql/operator/TestTopNRankingOperator.java Outdated Show resolved Hide resolved

presto-main/src/main/java/io/prestosql/operator/SimplePageWithPositionEqualsAndHash.java Outdated Show resolved Hide resolved

erichwang force-pushed the topnrank branch 2 times, most recently from 08f40f1 to d8d24ae Compare December 30, 2020 00:16

sopel39 approved these changes Dec 30, 2020

View reviewed changes

erichwang force-pushed the topnrank branch 2 times, most recently from 3f35fa6 to 50d4676 Compare December 31, 2020 02:10

erichwang force-pushed the topnrank branch from 50d4676 to 2a40046 Compare January 4, 2021 20:55

Remove extra space from TopNRowNumberOperator

182fac1

erichwang added 14 commits January 4, 2021 13:01

Fix typo in GroupedTopNBuilder

015ce17

Extract GroupedTopNBuilder as an interface

960b363

Renames: GroupedTopNBuilder => GroupedTopNRowNumberBuilder BenchmarkGroupedTopNBuilder => BenchmarkGroupedTopNRowNumberBuilder TestGroupedTopNBuilder => TestGroupedTopNRowNumberBuilder

Rename TopNRowNumber* => TopNRanking*

309c208

Generalizing TopNRowNumber components as a more generic top N ranking system to allow inclusions of rank and dense_rank

Add planner scaffolding to enable streaming topn RANK and DENSE_RANK

1a626f0

Provides the template to quickly enable streaming topn RANK and DENSE_RANK, but does not enable them yet.

Add GroupedTopNRankAccumulator for streaming rank

b73dd12

Copy GroupedTopNRowNumberBuilder as new rank builder template

5643850

Properly implement GroupedTopNRankBuilder within stub

7de1cf1

Replace optimizer.optimize-top-n-row-number config with general version

6318b53

Add optimizer capability to produce streaming topN rank() plans

c98eb08

Tighten WindowFilterPushDown predicate domain check

ceabdd9

WindowFilterPushDown was previously too loose inchecking for rank bounds between 0 to N when comparing with a TopN operator. All rank values start at 1.

Clean up for GroupedTopNRowNumberAccumulator

7a8c8af

erichwang force-pushed the topnrank branch from 2a40046 to fee1af8 Compare January 4, 2021 21:01

sopel39 merged commit ade8af1 into trinodb:master Jan 5, 2021

sopel39 mentioned this pull request Jan 5, 2021

Release notes for 352 #6502

Closed

10 tasks

wendigo mentioned this pull request Jan 5, 2021

Fix master failure with out-of-date webui build #6517

Merged

erichwang deleted the topnrank branch January 7, 2021 22:12

erichwang mentioned this pull request Jan 12, 2021

Optimize rank function execution #1073

Closed

martint added this to the 352 milestone Jan 28, 2021

sopel39 mentioned this pull request Feb 5, 2021

Choose whether to use WindowOperator or TopNRowNumberOperator based on stats #5319

Closed

aaneja mentioned this pull request Aug 20, 2024

Add support for streaming TopN rank prestodb/presto#23477

Open

aaneja mentioned this pull request Dec 18, 2024

Optimizer improvements for TPCDS prestodb/presto#24276

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add streaming TopN rank() implementation #6333

Add streaming TopN rank() implementation #6333

erichwang commented Dec 14, 2020

findepi Dec 14, 2020

erichwang Dec 14, 2020

findepi Dec 14, 2020

erichwang Dec 15, 2020 •

edited

Loading

findepi Dec 16, 2020

erichwang Dec 21, 2020

sopel39 left a comment

sopel39 left a comment

sopel39 left a comment

sopel39 left a comment

sopel39 left a comment

erichwang commented Dec 30, 2020 •

edited

Loading

sopel39 left a comment

sopel39 Dec 30, 2020

erichwang Dec 30, 2020 •

edited

Loading

sopel39 Dec 30, 2020

erichwang Dec 30, 2020 •

edited

Loading

sopel39 Dec 30, 2020

erichwang Dec 30, 2020

sopel39 Dec 30, 2020

erichwang Dec 30, 2020 •

edited

Loading

sopel39 commented Dec 30, 2020

erichwang commented Dec 31, 2020

sopel39 commented Jan 4, 2021

sopel39 commented Jan 5, 2021

sopel39 commented Jan 5, 2021

		@@ -577,11 +577,11 @@ private IntegrityStats verifyHeapIntegrity(long groupId, long heapNodeIndex)
		verify(actualPeerGroupCount == peerGroupCount, "Recorded peer group count does not match actual");

		@@ -33,7 +33,6 @@
		import static com.google.common.base.Preconditions.checkState;

Add streaming TopN rank() implementation #6333

Add streaming TopN rank() implementation #6333

Conversation

erichwang commented Dec 14, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erichwang Dec 15, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sopel39 left a comment

Choose a reason for hiding this comment

sopel39 left a comment

Choose a reason for hiding this comment

sopel39 left a comment

Choose a reason for hiding this comment

sopel39 left a comment

Choose a reason for hiding this comment

sopel39 left a comment

Choose a reason for hiding this comment

erichwang commented Dec 30, 2020 • edited Loading

sopel39 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erichwang Dec 30, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erichwang Dec 30, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erichwang Dec 30, 2020 • edited Loading

Choose a reason for hiding this comment

sopel39 commented Dec 30, 2020

erichwang commented Dec 31, 2020

sopel39 commented Jan 4, 2021

sopel39 commented Jan 5, 2021

sopel39 commented Jan 5, 2021

erichwang Dec 15, 2020 •

edited

Loading

erichwang commented Dec 30, 2020 •

edited

Loading

erichwang Dec 30, 2020 •

edited

Loading

erichwang Dec 30, 2020 •

edited

Loading

erichwang Dec 30, 2020 •

edited

Loading