Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement segmented aggregation execution #17886

Merged
merged 1 commit into from
Jun 26, 2022
Merged

Conversation

zacw7
Copy link
Member

@zacw7 zacw7 commented Jun 15, 2022

Splitted from #17618

When running in segmented aggregation mode, a segment is finished, we need to close the aggregation builder, destroy the hash table, then reopen the aggregation builder and recreate the hash table. If the segments are very small, this process have to be repeated too many times, resulting in overhead costs.

To address the issue, we adjust the design - for each page, we process all the data before the last segment in the page all together (because we don't know if the last segment has more data in the next page but we do know the segments before the last one are done), flush then process the last segment. In the next page, we repeat the process - find where the last segment starts, process all the data before that and flush, then process the last segment.

For example, Say we have 3 pages:

page1 [1, 1, 1, 2, 2]
page2 [2, 3, 4, 5, 5]
page3 [6, 6, 7, 7, 7]

The segments will be [1, 1, 1], [2, 2, 2, 3, 4], [5, 5], [6, 6], [7, 7, 7].

Benchmark:

Benchmark                                                (operatorType)  (rowsPerSegment)  Mode  Cnt   Score   Error  Units
BenchmarkHashAndSegmentedAggregationOperators.benchmark       segmented                 1  avgt   30  47.965 ± 3.014  ms/op
BenchmarkHashAndSegmentedAggregationOperators.benchmark       segmented                10  avgt   30  40.314 ± 2.642  ms/op
BenchmarkHashAndSegmentedAggregationOperators.benchmark       segmented               800  avgt   30  37.676 ± 0.924  ms/op
BenchmarkHashAndSegmentedAggregationOperators.benchmark       segmented            100000  avgt   30   4.399 ± 0.093  ms/op
BenchmarkHashAndSegmentedAggregationOperators.benchmark            hash                 1  avgt   30  14.379 ± 0.647  ms/op
BenchmarkHashAndSegmentedAggregationOperators.benchmark            hash                10  avgt   30  16.570 ± 1.233  ms/op
BenchmarkHashAndSegmentedAggregationOperators.benchmark            hash               800  avgt   30  16.395 ± 0.756  ms/op
BenchmarkHashAndSegmentedAggregationOperators.benchmark            hash            100000  avgt   30   5.453 ± 0.183  ms/op

Manual test (Input: 799,180,100,298 rows / 3.36 TB):

QueryID Splits Latency CPU Memory Per wall sec
Baseline 20220615_223327_00006_cnhk9 94,921 10.90 s 18.20 hours 153.73 GB 310.40 GB
File Splittable Disabled 20220615_223515_00009_cnhk9 6,857 32.07 s 15.63 hours 126.55 GB 106.63 GB
Segmented Aggregation Enabled 20220615_223659_00013_cnhk9 2,057 50.83 s 24.44 hours 28.68 GB 67.47 GB

Latency increase is observed during testing which is expected. In order to enable segmented aggregation, splitting files needs to be disabled to preserve the order. As the result, much less splits are generated and it decreased the table scan concurrency drastically especially when there are a lot of big files to scan.

== RELEASE NOTES ==

General Changes
* Add ability to flush the aggregated data when at lease one segment from the input has been exhausted. It can help reduce the memory footprint and improves the performance of aggregation when the data is already ordered by a subset of the group-by keys.
This can be enabled with the ``segmented_aggregation_enabled`` session property or the ``optimizer.segmented-aggregation-enabled`` configuration property.

Hive Changes
* Add support for segmented aggregation to reduce the memory footprint and improve query performance when the order-by keys are a subset of the group-by keys. This can be enabled with the ``order_based_execution_enabled`` session property or the ``hive.order-based-execution-enabled`` configuration property.

@zacw7 zacw7 force-pushed the seg-agg branch 2 times, most recently from e810b55 to f805c44 Compare June 15, 2022 23:06
@zacw7 zacw7 marked this pull request as ready for review June 15, 2022 23:31
@zacw7 zacw7 requested a review from a team as a code owner June 15, 2022 23:31
@zacw7 zacw7 force-pushed the seg-agg branch 3 times, most recently from b37f125 to 0062065 Compare June 16, 2022 02:11
@zacw7 zacw7 force-pushed the seg-agg branch 2 times, most recently from b0dcf74 to 96c95ce Compare June 16, 2022 15:58
@zacw7 zacw7 requested a review from kewang1024 June 16, 2022 15:59
// If the current segment ends in the current page, flush it with all the segments (if exist) except the last segment of the current page.
int lastSegmentStart = findLastSegmentStart(preGroupedHashStrategy.get(), page.extractChannels(preGroupedChannels));
unfinishedWork = aggregationBuilder.processPage(page.getRegion(0, lastSegmentStart));
remainingPage = page.getRegion(0, lastRowInPage - lastSegmentStart + 1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't the remainingPage be page.getRegion(lastSegmentStart, lastRowInPage - lastSegmentStart + 1);?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this indicates that our test coverage is not enough and didn't catch this issue, we need to enhance our test coverage

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added all the tests I can think of so far. The correctness will be well captured.

@zacw7 zacw7 force-pushed the seg-agg branch 6 times, most recently from 6701a7c to dd31cf4 Compare June 17, 2022 04:33
@kewang1024
Copy link
Collaborator

I just tested again and found out why our test unit didn't catch the bug above. In the test, we only tested with one page. We need to add tests to cover multiple pages, let's test all the scenariors we discussed above

On top of my mind (but not limited to)
Test1: page 1 [1, 1, 1......1, 1], page 2 [2, 2, 2...2, 2], page 3 [2, 2, 3...3, 3]
Test1: page 1 [1, 1, 1......2, 2], page 2 [2, 2, 2...2, 2], page 3 [2, 3, 3...5, 5]
Test1: page 1 [1, 1, 1......1, 1], page 2 [1, 1, 1...1, 1], page 3 [1, 1, 1...1, 1]

@kewang1024 kewang1024 self-requested a review June 17, 2022 06:35
Copy link
Collaborator

@kewang1024 kewang1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another point: let's separate out an indipendent test class TestSegmentedHashAggregationOperator given our test cases are growing and also I think we would be expecting more and more segmented aggregation specific test to be added in the future. But it's up to you if you want to do it in this PR

// Record the last segment.
firstUnfinishedSegment = page.getRegion(lastRowInPage, 1);
}
else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handle the smaller branch first and return to avoid indents for the larger branch. Reducing indents can improve the readability of the code.

// If the current segment might have more data in the incoming pages, process the whole page.
unfinishedWork = aggregationBuilder.processPage(page);
}
else if (preGroupedHashStrategy.get().rowEqualsRow(0, firstUnfinishedSegment.extractChannels(preGroupedChannels), 0, page.extractChannels(preGroupedChannels))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is the first page, line 504 is executed, no need to make this comparison. Please try to avoid it in that case.

firstUnfinishedSegment = page.getRegion(0, 1);
}

if (preGroupedHashStrategy.get().rowEqualsRow(0, firstUnfinishedSegment.extractChannels(preGroupedChannels), lastRowInPage, page.extractChannels(preGroupedChannels))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an assumption that the preGroupedChannels must be a subset of the groupbyChannels. Maybe better to add a check for that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is guaranteed during planning. The operator should be able to trust the generated plan right?

firstUnfinishedSegment = page.getRegion(0, 1);
}

if (preGroupedHashStrategy.get().rowEqualsRow(0, firstUnfinishedSegment.extractChannels(preGroupedChannels), lastRowInPage, page.extractChannels(preGroupedChannels))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extractChannels is called many times in this function. let's call it once and reuse the result.

unfinishedWork = aggregationBuilder.processPage(page);
}
else if (preGroupedHashStrategy.get().rowEqualsRow(0, firstUnfinishedSegment.extractChannels(preGroupedChannels), 0, page.extractChannels(preGroupedChannels))) {
// If the current page starts with a new segment, flush before processing it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, we are here if the first row in the page is same as the last unfinished segment. In that case, the current page doesn't not start with a new segment, right? Did I miss anything?

remainingPageForSegmentedAggregation = page.getRegion(lastSegmentStart, lastRowInPage - lastSegmentStart + 1);
}
// Record the last segment.
firstUnfinishedSegment = page.getRegion(lastRowInPage, 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that lastRowInPage is never updated. So this firstUnfinishedSegment is always the last row in the page? What does it mean? Could you add member field comments for the new fields added in this PR?

@yuanzhanhku
Copy link
Contributor

High level design looks good. Just left some comments on the implementation details.

@zacw7
Copy link
Member Author

zacw7 commented Jun 18, 2022

High level design looks good. Just left some comments on the implementation details.

There are some design details me and Ke would like to clarify:
When running in segmented aggregation mode, a segment is finished, we need to close the aggregation builder, destroy the hash table, then reopen the aggregation builder and recreate the hash table. If the segments are very small, this process have to be repeated too many times, resulting in overhead costs.
To address the issue, we adjust the design - for each page, we process all the data before the last segment in the page all together (because we don't know if the last segment has more data in the next page but we do know the segments before the last one are done), flush then process the last segment. In the next page, we repeat the process - find where the last segment starts, process all the data before that and flush, then process the last segment.

For example, Say we have 3 pages:

page1 [1, 1, 1, 2, 2]
page2 [2, 3, 4, 5, 5]
page3 [6, 6, 7, 7, 7]

The segments will be [1, 1, 1], [2, 2, 2, 3, 4], [5, 5], [6, 6], [7, 7, 7].

Copy link
Collaborator

@kewang1024 kewang1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few NIT, otherwise LGTM

@kewang1024
Copy link
Collaborator

kewang1024 commented Jun 20, 2022

Let's open an issue as a followup? introduce a threshold config / session property to control the flush timing of segmented aggregation. With that, we can tune the config to get a balance between memory and latency.

@zacw7
Copy link
Member Author

zacw7 commented Jun 20, 2022

Let's open an issue as a followup? introduce a threshold config / session property to control the flush timing of segmented aggregation. With that, we can tune the config to get a balance between memory and latency.

#17908

@zacw7 zacw7 requested a review from highker June 20, 2022 18:41
@highker highker requested a review from pgupta2 June 21, 2022 04:46
Comment on lines 62 to 65
private final List<Integer> groupByChannels;
private final List<Integer> preGroupedChannels;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't these two serving the same purpose except that one is sorted and the other is not? Can we merge these two into one and use something else (like a flag) to indicate the difference?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The channels of preGroupedChannels are already sorted, while not all channels in groupByChannels are. preGroupedChannels is a subset of groupByChannels so they don't contains exactly the same elements. I think it makes sense to keep them separated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that is the case, highly recommend to add a comment to explain. Also can we check preGroupedChannels is fully contained in groupByChannels.

Comment on lines 345 to 348
if (preGroupedChannels.isEmpty()) {
preGroupedHashStrategy = Optional.empty();
}
else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many if/else in the logic branching. Check my other comment below

Comment on lines +500 to +497
if (!preGroupedHashStrategy.isPresent()) {
unfinishedWork = aggregationBuilder.processPage(page);
return;
}

// 2. segmented aggregation
if (firstUnfinishedSegment == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. Whether to use pre-grouped channel or not is actually determined at the beginning of the execution or planning phase. There is no need to check if/else for every page input. Instead, it would be good to abstract the design a bit. For example, can we have a base abstract hash operator with two implementations: Hash and Segmented. Then we can leave the branches within the different implementations

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether to use pre-grouped channel or not is actually determined at the beginning of the execution or planning phase. There is no need to check if/else for every page input.

Great point.

can we have a base abstract hash operator with two implementations: Hash and Segmented.

I thought about this idea as well before implementing the PR. Not sure if it's worth it to restructure the whole operator given that Segmented aggregation is essentially still Hash aggregation just with a few tricks. The difference is not that big. WDYT? @kewang1024

// The whole page is in one segment.
if (preGroupedHashStrategy.get().rowEqualsRow(0, firstUnfinishedSegment.extractChannels(preGroupedChannels), 0, pageOnPreGroupedChannels)) {
// All rows in this page belong to the previous unfinished segment, process the whole page.
unfinishedWork = aggregationBuilder.processPage(page);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assume we don't have to care about the pages for a segment once it's done right? Do we need to close the aggregationBuilder to clear the memory after having processed a segment? It would be good to reflect the memory usage in your benchmark as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to close the aggregationBuilder to clear the memory after having processed a segment

Yes, that how it is implemented currently. Once it has fully processed at least one segments, the aggregationBuilder will be closed and rebuilt if there are more segments to process.

to reflect the memory usage in your benchmark as well.

Regarding the memory usage, thought the memory comparison has been covered from the manual test attached in the PR description. Could you please elaborate a bit how to run benchmark against the memory usage? Any pointers I can refer to? Thanks!

Copy link
Collaborator

@kewang1024 kewang1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. segmented hash based aggregation itself doesn't have too much specific logic and heavily share hash based aggregation logic, only separating it out would still look a bit hacky
  2. currently hash aggregation operator is intertwined with a lot of other features (spilling, partial aggregation mode, etc) in a hacky fashion, in order to do the refactor properly, we need to come up with a better design to refactor HashAggregationOperatorFactory as a whole considering all the features that's currently inside this class, which could be a non-trivial amount of work

What I think could be the next steps

  1. Fix the current code to be more concise and reduce nested if statement
  2. Systematically refactor HashAggregationOperatorFactory as a followup

Comment on lines 473 to 478
if (remainingPageForSegmentedAggregation != null) {
// Running in segmented aggregation mode, reopen the aggregation builder and process the remaining page.
initializeAggregationBuilderIfNeeded();
unfinishedWork = aggregationBuilder.processPage(remainingPageForSegmentedAggregation);
remainingPageForSegmentedAggregation = null;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract it to a private function to

  1. reduce the nested if statement
  2. we can reuse this function when we introduce the threshold
    private void processRemainingPageForSegmentedAggregation()
    {
        // Running in segmented aggregation mode, reopen the aggregation builder and process the remaining page.
        if (remainingPageForSegmentedAggregation != null) {
            initializeAggregationBuilderIfNeeded();
            unfinishedWork = aggregationBuilder.processPage(remainingPageForSegmentedAggregation);
            remainingPageForSegmentedAggregation = null;
        }
    }

Comment on lines 345 to 348
if (preGroupedChannels.isEmpty()) {
preGroupedHashStrategy = Optional.empty();
}
else {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: make it more concise

preGroupedHashStrategy = preGroupedChannels.isEmpty() ? Optional.empty() : Optional.of(joinCompiler.compilePagesHashStrategyFactory(
                preGroupedChannels.stream().map(channel -> groupByTypes.get(channel)).collect(toImmutableList()),
                preGroupedChannels,
                Optional.empty()).createPagesHashStrategy(groupByTypes.stream().map(type -> ImmutableList.<Block>of()).collect(toImmutableList()), OptionalInt.empty()));

@kewang1024 kewang1024 requested review from highker and kewang1024 June 24, 2022 16:53
@zacw7 zacw7 force-pushed the seg-agg branch 2 times, most recently from 186bcfd to 315023f Compare June 24, 2022 18:07
@zacw7
Copy link
Member Author

zacw7 commented Jun 24, 2022

Changing it back to linear search from binary, as this no improvement observed and it's more readable and clean. cc: @kewang1024

@kewang1024
Copy link
Collaborator

Changing it back to linear search from binary, as this no improvement observed and it's more readable and clean. cc: @kewang1024

Attach the issue where we introduce a config to tune between linear and binary search, it could be a good bootcamp task

Comment on lines 62 to 65
private final List<Integer> groupByChannels;
private final List<Integer> preGroupedChannels;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that is the case, highly recommend to add a comment to explain. Also can we check preGroupedChannels is fully contained in groupByChannels.

this.preGroupedHashStrategy = preGroupedChannels.isEmpty()
? Optional.empty()
: Optional.of(joinCompiler.compilePagesHashStrategyFactory(
preGroupedChannels.stream().map(channel -> groupByTypes.get(channel)).collect(toImmutableList()), preGroupedChannels, Optional.empty())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: groupByTypes::get

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking if preGroupedChannels is fully contained in groupByChannels might not be necessary. It has been checked thoroughly during the planning phase.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be really good to check the containment in the constructor. Planner and execution are usually separated. Just like Presto frontend and Velox backend. The modularization would break the streamlined assumption. Also, it would make it much clearer when we read the code to understand the logic.

@highker highker self-assigned this Jun 24, 2022
@highker highker merged commit 3ee8cb3 into prestodb:master Jun 26, 2022
@zacw7 zacw7 deleted the seg-agg branch June 27, 2022 17:17
@highker highker mentioned this pull request Jul 6, 2022
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants