Implement segmented aggregation execution #17886

zacw7 · 2022-06-15T22:46:44Z

Splitted from #17618

When running in segmented aggregation mode, a segment is finished, we need to close the aggregation builder, destroy the hash table, then reopen the aggregation builder and recreate the hash table. If the segments are very small, this process have to be repeated too many times, resulting in overhead costs.

To address the issue, we adjust the design - for each page, we process all the data before the last segment in the page all together (because we don't know if the last segment has more data in the next page but we do know the segments before the last one are done), flush then process the last segment. In the next page, we repeat the process - find where the last segment starts, process all the data before that and flush, then process the last segment.

For example, Say we have 3 pages:

page1 [1, 1, 1, 2, 2]
page2 [2, 3, 4, 5, 5]
page3 [6, 6, 7, 7, 7]

The segments will be [1, 1, 1], [2, 2, 2, 3, 4], [5, 5], [6, 6], [7, 7, 7].

Benchmark:

Benchmark                                                (operatorType)  (rowsPerSegment)  Mode  Cnt   Score   Error  Units
BenchmarkHashAndSegmentedAggregationOperators.benchmark       segmented                 1  avgt   30  47.965 ± 3.014  ms/op
BenchmarkHashAndSegmentedAggregationOperators.benchmark       segmented                10  avgt   30  40.314 ± 2.642  ms/op
BenchmarkHashAndSegmentedAggregationOperators.benchmark       segmented               800  avgt   30  37.676 ± 0.924  ms/op
BenchmarkHashAndSegmentedAggregationOperators.benchmark       segmented            100000  avgt   30   4.399 ± 0.093  ms/op
BenchmarkHashAndSegmentedAggregationOperators.benchmark            hash                 1  avgt   30  14.379 ± 0.647  ms/op
BenchmarkHashAndSegmentedAggregationOperators.benchmark            hash                10  avgt   30  16.570 ± 1.233  ms/op
BenchmarkHashAndSegmentedAggregationOperators.benchmark            hash               800  avgt   30  16.395 ± 0.756  ms/op
BenchmarkHashAndSegmentedAggregationOperators.benchmark            hash            100000  avgt   30   5.453 ± 0.183  ms/op

Manual test (Input: 799,180,100,298 rows / 3.36 TB):

	QueryID	Splits	Latency	CPU	Memory	Per wall sec
Baseline	20220615_223327_00006_cnhk9	94,921	10.90 s	18.20 hours	153.73 GB	310.40 GB
File Splittable Disabled	20220615_223515_00009_cnhk9	6,857	32.07 s	15.63 hours	126.55 GB	106.63 GB
Segmented Aggregation Enabled	20220615_223659_00013_cnhk9	2,057	50.83 s	24.44 hours	28.68 GB	67.47 GB

Latency increase is observed during testing which is expected. In order to enable segmented aggregation, splitting files needs to be disabled to preserve the order. As the result, much less splits are generated and it decreased the table scan concurrency drastically especially when there are a lot of big files to scan.

== RELEASE NOTES ==

General Changes
* Add ability to flush the aggregated data when at lease one segment from the input has been exhausted. It can help reduce the memory footprint and improves the performance of aggregation when the data is already ordered by a subset of the group-by keys.
This can be enabled with the ``segmented_aggregation_enabled`` session property or the ``optimizer.segmented-aggregation-enabled`` configuration property.

Hive Changes
* Add support for segmented aggregation to reduce the memory footprint and improve query performance when the order-by keys are a subset of the group-by keys. This can be enabled with the ``order_based_execution_enabled`` session property or the ``hive.order-based-execution-enabled`` configuration property.

presto-main/src/main/java/com/facebook/presto/operator/HashAggregationOperator.java

kewang1024 · 2022-06-16T19:06:03Z

presto-main/src/main/java/com/facebook/presto/operator/HashAggregationOperator.java

+                // If the current segment ends in the current page, flush it with all the segments (if exist) except the last segment of the current page.
+                int lastSegmentStart = findLastSegmentStart(preGroupedHashStrategy.get(), page.extractChannels(preGroupedChannels));
+                unfinishedWork = aggregationBuilder.processPage(page.getRegion(0, lastSegmentStart));
+                remainingPage = page.getRegion(0, lastRowInPage - lastSegmentStart + 1);


shouldn't the remainingPage be page.getRegion(lastSegmentStart, lastRowInPage - lastSegmentStart + 1);?

I think this indicates that our test coverage is not enough and didn't catch this issue, we need to enhance our test coverage

Added all the tests I can think of so far. The correctness will be well captured.

kewang1024 · 2022-06-17T06:35:07Z

I just tested again and found out why our test unit didn't catch the bug above. In the test, we only tested with one page. We need to add tests to cover multiple pages, let's test all the scenariors we discussed above

On top of my mind (but not limited to)
Test1: page 1 [1, 1, 1......1, 1], page 2 [2, 2, 2...2, 2], page 3 [2, 2, 3...3, 3]
Test1: page 1 [1, 1, 1......2, 2], page 2 [2, 2, 2...2, 2], page 3 [2, 3, 3...5, 5]
Test1: page 1 [1, 1, 1......1, 1], page 2 [1, 1, 1...1, 1], page 3 [1, 1, 1...1, 1]

kewang1024

Another point: let's separate out an indipendent test class TestSegmentedHashAggregationOperator given our test cases are growing and also I think we would be expecting more and more segmented aggregation specific test to be added in the future. But it's up to you if you want to do it in this PR

yuanzhanhku · 2022-06-17T17:26:53Z

presto-main/src/main/java/com/facebook/presto/operator/HashAggregationOperator.java

+            // Record the last segment.
+            firstUnfinishedSegment = page.getRegion(lastRowInPage, 1);
+        }
+        else {


Handle the smaller branch first and return to avoid indents for the larger branch. Reducing indents can improve the readability of the code.

presto-main/src/main/java/com/facebook/presto/operator/HashAggregationOperator.java

yuanzhanhku · 2022-06-17T17:34:57Z

presto-main/src/main/java/com/facebook/presto/operator/HashAggregationOperator.java

+                // If the current segment might have more data in the incoming pages, process the whole page.
+                unfinishedWork = aggregationBuilder.processPage(page);
+            }
+            else if (preGroupedHashStrategy.get().rowEqualsRow(0, firstUnfinishedSegment.extractChannels(preGroupedChannels), 0, page.extractChannels(preGroupedChannels))) {


If this is the first page, line 504 is executed, no need to make this comparison. Please try to avoid it in that case.

yuanzhanhku · 2022-06-17T17:37:02Z

presto-main/src/main/java/com/facebook/presto/operator/HashAggregationOperator.java

+                firstUnfinishedSegment = page.getRegion(0, 1);
+            }
+
+            if (preGroupedHashStrategy.get().rowEqualsRow(0, firstUnfinishedSegment.extractChannels(preGroupedChannels), lastRowInPage, page.extractChannels(preGroupedChannels))) {


There is an assumption that the preGroupedChannels must be a subset of the groupbyChannels. Maybe better to add a check for that.

I think this is guaranteed during planning. The operator should be able to trust the generated plan right?

yuanzhanhku · 2022-06-17T17:38:57Z

presto-main/src/main/java/com/facebook/presto/operator/HashAggregationOperator.java

+                firstUnfinishedSegment = page.getRegion(0, 1);
+            }
+
+            if (preGroupedHashStrategy.get().rowEqualsRow(0, firstUnfinishedSegment.extractChannels(preGroupedChannels), lastRowInPage, page.extractChannels(preGroupedChannels))) {


extractChannels is called many times in this function. let's call it once and reuse the result.

yuanzhanhku · 2022-06-17T17:43:15Z

presto-main/src/main/java/com/facebook/presto/operator/HashAggregationOperator.java

+                unfinishedWork = aggregationBuilder.processPage(page);
+            }
+            else if (preGroupedHashStrategy.get().rowEqualsRow(0, firstUnfinishedSegment.extractChannels(preGroupedChannels), 0, page.extractChannels(preGroupedChannels))) {
+                // If the current page starts with a new segment, flush before processing it.


IIUC, we are here if the first row in the page is same as the last unfinished segment. In that case, the current page doesn't not start with a new segment, right? Did I miss anything?

yuanzhanhku · 2022-06-17T17:45:02Z

presto-main/src/main/java/com/facebook/presto/operator/HashAggregationOperator.java

+                remainingPageForSegmentedAggregation = page.getRegion(lastSegmentStart, lastRowInPage - lastSegmentStart + 1);
+            }
+            // Record the last segment.
+            firstUnfinishedSegment = page.getRegion(lastRowInPage, 1);


Seems that lastRowInPage is never updated. So this firstUnfinishedSegment is always the last row in the page? What does it mean? Could you add member field comments for the new fields added in this PR?

presto-main/src/main/java/com/facebook/presto/operator/HashAggregationOperator.java

yuanzhanhku · 2022-06-17T17:59:58Z

High level design looks good. Just left some comments on the implementation details.

zacw7 · 2022-06-18T00:56:47Z

High level design looks good. Just left some comments on the implementation details.

There are some design details me and Ke would like to clarify:
When running in segmented aggregation mode, a segment is finished, we need to close the aggregation builder, destroy the hash table, then reopen the aggregation builder and recreate the hash table. If the segments are very small, this process have to be repeated too many times, resulting in overhead costs.
To address the issue, we adjust the design - for each page, we process all the data before the last segment in the page all together (because we don't know if the last segment has more data in the next page but we do know the segments before the last one are done), flush then process the last segment. In the next page, we repeat the process - find where the last segment starts, process all the data before that and flush, then process the last segment.

For example, Say we have 3 pages:

page1 [1, 1, 1, 2, 2]
page2 [2, 3, 4, 5, 5]
page3 [6, 6, 7, 7, 7]

The segments will be [1, 1, 1], [2, 2, 2, 3, 4], [5, 5], [6, 6], [7, 7, 7].

kewang1024

A few NIT, otherwise LGTM

presto-main/src/main/java/com/facebook/presto/operator/HashAggregationOperator.java

kewang1024 · 2022-06-20T18:05:50Z

Let's open an issue as a followup? introduce a threshold config / session property to control the flush timing of segmented aggregation. With that, we can tune the config to get a balance between memory and latency.

zacw7 · 2022-06-20T18:41:25Z

Let's open an issue as a followup? introduce a threshold config / session property to control the flush timing of segmented aggregation. With that, we can tune the config to get a balance between memory and latency.

#17908

highker · 2022-06-21T15:58:51Z

presto-main/src/main/java/com/facebook/presto/operator/HashAggregationOperator.java

        private final List<Integer> groupByChannels;
+        private final List<Integer> preGroupedChannels;


Aren't these two serving the same purpose except that one is sorted and the other is not? Can we merge these two into one and use something else (like a flag) to indicate the difference?

The channels of preGroupedChannels are already sorted, while not all channels in groupByChannels are. preGroupedChannels is a subset of groupByChannels so they don't contains exactly the same elements. I think it makes sense to keep them separated.

If that is the case, highly recommend to add a comment to explain. Also can we check preGroupedChannels is fully contained in groupByChannels.

highker · 2022-06-21T15:59:15Z

presto-main/src/main/java/com/facebook/presto/operator/HashAggregationOperator.java

+        if (preGroupedChannels.isEmpty()) {
+            preGroupedHashStrategy = Optional.empty();
+        }
+        else {


There are many if/else in the logic branching. Check my other comment below

highker · 2022-06-21T16:05:01Z

presto-main/src/main/java/com/facebook/presto/operator/HashAggregationOperator.java

+        if (!preGroupedHashStrategy.isPresent()) {
+            unfinishedWork = aggregationBuilder.processPage(page);
+            return;
+        }
+
+        // 2. segmented aggregation
+        if (firstUnfinishedSegment == null) {


Same here. Whether to use pre-grouped channel or not is actually determined at the beginning of the execution or planning phase. There is no need to check if/else for every page input. Instead, it would be good to abstract the design a bit. For example, can we have a base abstract hash operator with two implementations: Hash and Segmented. Then we can leave the branches within the different implementations

Whether to use pre-grouped channel or not is actually determined at the beginning of the execution or planning phase. There is no need to check if/else for every page input.

Great point.

can we have a base abstract hash operator with two implementations: Hash and Segmented.

I thought about this idea as well before implementing the PR. Not sure if it's worth it to restructure the whole operator given that Segmented aggregation is essentially still Hash aggregation just with a few tricks. The difference is not that big. WDYT? @kewang1024

highker · 2022-06-21T16:17:31Z

presto-main/src/main/java/com/facebook/presto/operator/HashAggregationOperator.java

+            // The whole page is in one segment.
+            if (preGroupedHashStrategy.get().rowEqualsRow(0, firstUnfinishedSegment.extractChannels(preGroupedChannels), 0, pageOnPreGroupedChannels)) {
+                // All rows in this page belong to the previous unfinished segment, process the whole page.
+                unfinishedWork = aggregationBuilder.processPage(page);


Assume we don't have to care about the pages for a segment once it's done right? Do we need to close the aggregationBuilder to clear the memory after having processed a segment? It would be good to reflect the memory usage in your benchmark as well.

Do we need to close the aggregationBuilder to clear the memory after having processed a segment

Yes, that how it is implemented currently. Once it has fully processed at least one segments, the aggregationBuilder will be closed and rebuilt if there are more segments to process.

to reflect the memory usage in your benchmark as well.

Regarding the memory usage, thought the memory comparison has been covered from the manual test attached in the PR description. Could you please elaborate a bit how to run benchmark against the memory usage? Any pointers I can refer to? Thanks!

kewang1024

segmented hash based aggregation itself doesn't have too much specific logic and heavily share hash based aggregation logic, only separating it out would still look a bit hacky
currently hash aggregation operator is intertwined with a lot of other features (spilling, partial aggregation mode, etc) in a hacky fashion, in order to do the refactor properly, we need to come up with a better design to refactor HashAggregationOperatorFactory as a whole considering all the features that's currently inside this class, which could be a non-trivial amount of work

What I think could be the next steps

Fix the current code to be more concise and reduce nested if statement
Systematically refactor HashAggregationOperatorFactory as a followup

kewang1024 · 2022-06-23T07:08:03Z

presto-main/src/main/java/com/facebook/presto/operator/HashAggregationOperator.java

+            if (remainingPageForSegmentedAggregation != null) {
+                // Running in segmented aggregation mode, reopen the aggregation builder and process the remaining page.
+                initializeAggregationBuilderIfNeeded();
+                unfinishedWork = aggregationBuilder.processPage(remainingPageForSegmentedAggregation);
+                remainingPageForSegmentedAggregation = null;
+            }


extract it to a private function to

reduce the nested if statement

we can reuse this function when we introduce the threshold

private void processRemainingPageForSegmentedAggregation() { // Running in segmented aggregation mode, reopen the aggregation builder and process the remaining page. if (remainingPageForSegmentedAggregation != null) { initializeAggregationBuilderIfNeeded(); unfinishedWork = aggregationBuilder.processPage(remainingPageForSegmentedAggregation); remainingPageForSegmentedAggregation = null; } }

kewang1024 · 2022-06-23T07:08:47Z

presto-main/src/main/java/com/facebook/presto/operator/HashAggregationOperator.java

+        if (preGroupedChannels.isEmpty()) {
+            preGroupedHashStrategy = Optional.empty();
+        }
+        else {


NIT: make it more concise

preGroupedHashStrategy = preGroupedChannels.isEmpty() ? Optional.empty() : Optional.of(joinCompiler.compilePagesHashStrategyFactory( preGroupedChannels.stream().map(channel -> groupByTypes.get(channel)).collect(toImmutableList()), preGroupedChannels, Optional.empty()).createPagesHashStrategy(groupByTypes.stream().map(type -> ImmutableList.<Block>of()).collect(toImmutableList()), OptionalInt.empty()));

zacw7 · 2022-06-24T18:08:05Z

Changing it back to linear search from binary, as this no improvement observed and it's more readable and clean. cc: @kewang1024

kewang1024 · 2022-06-24T18:18:36Z

Changing it back to linear search from binary, as this no improvement observed and it's more readable and clean. cc: @kewang1024

Attach the issue where we introduce a config to tune between linear and binary search, it could be a good bootcamp task

highker · 2022-06-24T23:12:32Z

presto-main/src/main/java/com/facebook/presto/operator/HashAggregationOperator.java

        private final List<Integer> groupByChannels;
+        private final List<Integer> preGroupedChannels;


If that is the case, highly recommend to add a comment to explain. Also can we check preGroupedChannels is fully contained in groupByChannels.

highker · 2022-06-24T23:17:06Z

presto-main/src/main/java/com/facebook/presto/operator/HashAggregationOperator.java

+        this.preGroupedHashStrategy = preGroupedChannels.isEmpty()
+                ? Optional.empty()
+                : Optional.of(joinCompiler.compilePagesHashStrategyFactory(
+                        preGroupedChannels.stream().map(channel -> groupByTypes.get(channel)).collect(toImmutableList()), preGroupedChannels, Optional.empty())


nit: groupByTypes::get

Checking if preGroupedChannels is fully contained in groupByChannels might not be necessary. It has been checked thoroughly during the planning phase.

It would be really good to check the containment in the constructor. Planner and execution are usually separated. Just like Presto frontend and Velox backend. The modularization would break the streamlined assumption. Also, it would make it much clearer when we read the code to understand the logic.

zacw7 force-pushed the seg-agg branch 2 times, most recently from e810b55 to f805c44 Compare June 15, 2022 23:06

zacw7 marked this pull request as ready for review June 15, 2022 23:31

zacw7 requested a review from a team as a code owner June 15, 2022 23:31

zacw7 requested review from presto-oss, yuanzhanhku and kewang1024 June 15, 2022 23:31

zacw7 force-pushed the seg-agg branch 3 times, most recently from b37f125 to 0062065 Compare June 16, 2022 02:11

kewang1024 reviewed Jun 16, 2022

View reviewed changes

zacw7 force-pushed the seg-agg branch 2 times, most recently from b0dcf74 to 96c95ce Compare June 16, 2022 15:58

zacw7 requested a review from kewang1024 June 16, 2022 15:59

zacw7 force-pushed the seg-agg branch from 96c95ce to 6b64da6 Compare June 16, 2022 16:57

kewang1024 reviewed Jun 16, 2022

View reviewed changes

zacw7 force-pushed the seg-agg branch 6 times, most recently from 6701a7c to dd31cf4 Compare June 17, 2022 04:33

kewang1024 self-requested a review June 17, 2022 06:35

kewang1024 reviewed Jun 17, 2022

View reviewed changes

yuanzhanhku reviewed Jun 17, 2022

View reviewed changes

zacw7 force-pushed the seg-agg branch from dd31cf4 to 1c50654 Compare June 18, 2022 00:42

zacw7 force-pushed the seg-agg branch from 1c50654 to 9b57b8b Compare June 18, 2022 01:01

zacw7 requested a review from yuanzhanhku June 18, 2022 01:03

zacw7 force-pushed the seg-agg branch from 9b57b8b to c13e13f Compare June 18, 2022 01:12

kewang1024 approved these changes Jun 20, 2022

View reviewed changes

presto-main/src/main/java/com/facebook/presto/operator/HashAggregationOperator.java Outdated Show resolved Hide resolved

presto-main/src/main/java/com/facebook/presto/operator/HashAggregationOperator.java Outdated Show resolved Hide resolved

zacw7 force-pushed the seg-agg branch from c13e13f to da40f00 Compare June 20, 2022 18:20

zacw7 requested a review from kewang1024 June 20, 2022 18:21

zacw7 mentioned this pull request Jun 20, 2022

Introduce segmented aggregation flush threshold #17908

Open

zacw7 requested a review from highker June 20, 2022 18:41

highker requested a review from pgupta2 June 21, 2022 04:46

highker reviewed Jun 21, 2022

View reviewed changes

zacw7 force-pushed the seg-agg branch from da40f00 to 4664482 Compare June 21, 2022 22:14

kewang1024 reviewed Jun 23, 2022

View reviewed changes

kewang1024 requested review from highker and kewang1024 June 24, 2022 16:53

zacw7 force-pushed the seg-agg branch 2 times, most recently from 186bcfd to 315023f Compare June 24, 2022 18:07

zacw7 mentioned this pull request Jun 24, 2022

Introduce a config to tune between linear and binary search for segmented aggregation #17941

Open

highker approved these changes Jun 24, 2022

View reviewed changes

highker self-assigned this Jun 24, 2022

kewang1024 approved these changes Jun 24, 2022

View reviewed changes

zacw7 force-pushed the seg-agg branch from 315023f to c08981e Compare June 25, 2022 01:03

zacw7 requested a review from highker June 25, 2022 03:42

Implement segmented aggregation execution

3eb1a18

zacw7 force-pushed the seg-agg branch from c08981e to 3eb1a18 Compare June 26, 2022 17:18

highker merged commit 3ee8cb3 into prestodb:master Jun 26, 2022

zacw7 deleted the seg-agg branch June 27, 2022 17:17

highker mentioned this pull request Jul 6, 2022

Add release notes for 0.274 #17987

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement segmented aggregation execution #17886

Implement segmented aggregation execution #17886

zacw7 commented Jun 15, 2022 •

edited

Loading

kewang1024 Jun 16, 2022

kewang1024 Jun 16, 2022

zacw7 Jun 18, 2022

kewang1024 commented Jun 17, 2022

kewang1024 left a comment

yuanzhanhku Jun 17, 2022

yuanzhanhku Jun 17, 2022

yuanzhanhku Jun 17, 2022

zacw7 Jun 18, 2022

yuanzhanhku Jun 17, 2022

yuanzhanhku Jun 17, 2022

yuanzhanhku Jun 17, 2022

yuanzhanhku commented Jun 17, 2022

zacw7 commented Jun 18, 2022

kewang1024 left a comment

kewang1024 commented Jun 20, 2022 •

edited

Loading

zacw7 commented Jun 20, 2022

highker Jun 21, 2022

zacw7 Jun 21, 2022

highker Jun 24, 2022

highker Jun 21, 2022

highker Jun 21, 2022

zacw7 Jun 21, 2022

highker Jun 21, 2022

zacw7 Jun 21, 2022

kewang1024 left a comment

kewang1024 Jun 23, 2022

kewang1024 Jun 23, 2022

zacw7 commented Jun 24, 2022

kewang1024 commented Jun 24, 2022

highker Jun 24, 2022

highker Jun 24, 2022

zacw7 Jun 25, 2022

highker Jun 25, 2022

		private final List<Integer> groupByChannels;
		private final List<Integer> preGroupedChannels;

Implement segmented aggregation execution #17886

Implement segmented aggregation execution #17886

Conversation

zacw7 commented Jun 15, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kewang1024 commented Jun 17, 2022

kewang1024 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yuanzhanhku commented Jun 17, 2022

zacw7 commented Jun 18, 2022

kewang1024 left a comment

Choose a reason for hiding this comment

kewang1024 commented Jun 20, 2022 • edited Loading

zacw7 commented Jun 20, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kewang1024 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zacw7 commented Jun 24, 2022

kewang1024 commented Jun 24, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zacw7 commented Jun 15, 2022 •

edited

Loading

kewang1024 commented Jun 20, 2022 •

edited

Loading