Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent redundant collection rewriting in AggregationAnalyzer #15292

Merged

Conversation

skrzypo987
Copy link
Member

Description

See commit message

Additional context and related issues

Replaces #14983

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Section
* Improve performance of analysis of queries with a lot of GROUP BY clauses

@skrzypo987
Copy link
Member Author

@findepi @martint
This replaces #14983

@skrzypo987 skrzypo987 requested review from martint and findepi December 5, 2022 11:59
this.columnReferences = analysis.getColumnReferenceFields()
.entrySet().stream()
.collect(toImmutableMap(Map.Entry::getKey, entry -> entry.getValue().getFieldId()));
this.columnReferences = analysis.getColumnReferenceFields();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this.columnReferences is (indirectly) mutable now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW looks like this place is the most important change in the commit.
How can we prevent it from being undone by an innocent refactor?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

analysis.getColumnReferenceFields()

returns an immutable map

How can we prevent it from being undone by an innocent refactor?

I don't quite follow why would anybody want to change that. I see no gain in this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we prevent it from being undone by an innocent refactor?

An automated benchmark for planner?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea long-term. But this exact problem emerges only for quite specific queries. Mine had 80k columns referenced and 556 group by clauses.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I don't like is that this now relies on getColumnReferenceFields returning an immutable map. If the implementation of that method changes, it could indirectly affect this usage without anyone noticing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can just make a sanity check, that the map is indeed immutable? If anyone changes the implementation it will fail fast

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can just make a sanity check, that the map is indeed immutable?

but it's not (it's just an unmodifiable wrapper)

also, you can't verify that it's what it is, since unmodifiableMap has no public class you can test for

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findepi Unfortunately you are right.
So what are the options?

@skrzypo987 skrzypo987 force-pushed the skrzypo/132-optimize-aggregation-analysis branch from a33efec to 4b61bba Compare December 5, 2022 12:34
@skrzypo987
Copy link
Member Author

@findepi AC
Thanks for the quick look

@skrzypo987 skrzypo987 force-pushed the skrzypo/132-optimize-aggregation-analysis branch from 4b61bba to 2505e40 Compare December 6, 2022 07:53
@skrzypo987
Copy link
Member Author

AC again.
Added a proper benchmark to BenchmarkPlanner class

@skrzypo987 skrzypo987 force-pushed the skrzypo/132-optimize-aggregation-analysis branch 3 times, most recently from 2368c85 to a482c1c Compare December 6, 2022 15:25
@findepi
Copy link
Member

findepi commented Dec 6, 2022

Added a proper benchmark to BenchmarkPlanner class

thank you!

do you have the results you can share?

@skrzypo987
Copy link
Member Author

Added a proper benchmark to BenchmarkPlanner class

thank you!

do you have the results you can share?

See commit message

Comment on lines 140 to 143
JOIN lineitem b ON a.l_orderkey = b.l_orderkey
JOIN lineitem c ON a.l_orderkey = c.l_orderkey
JOIN lineitem d ON a.l_orderkey = d.l_orderkey
JOIN lineitem e ON a.l_orderkey = e.l_orderkey
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a perfect case for optimizer to eliminate redundant reads from lineitem table, and read it only once.
To prevent such potential future optimization breaking your benchmark, eg use lineitem coming from different schemas.
(from planning time perspective, the scale factor (sf) factor does not matter)

.collect(toImmutableList());
for (Expression sortKey : sortKeys) {
if (!node.getArguments().contains(sortKey) && !fieldIds.contains(columnReferences.get(NodeRef.of(sortKey)))) {
ResolvedField field = columnReferences.get(NodeRef.of(sortKey));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why extracted?
previously the lookup was lazy, only when (!node.getArguments().contains(sortKey), now it's always.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now instead of X.contains(map.get(A)) we need to do

A a = map.getA();
return ( a == null ) || !X.contains(a.getB());

since the value type of the map changed.
But you are right, I replaced that with a double if

@skrzypo987 skrzypo987 force-pushed the skrzypo/132-optimize-aggregation-analysis branch from a482c1c to 99d63b0 Compare December 8, 2022 12:34
@skrzypo987
Copy link
Member Author

@findepi AC

@skrzypo987 skrzypo987 force-pushed the skrzypo/132-optimize-aggregation-analysis branch from 99d63b0 to 0fbcdef Compare December 8, 2022 17:19
skrzypo987 added 3 commits December 9, 2022 10:18
Make queries to be executed a parameter
Add a test case with huge number of column referenced in a query (86k) and
a lot of group by clauses. This benchmark highlights the quadratic complexity
of AggregationAnalyzer constructor, which will be fixed in subsequent commit.
Every time the `AggregationAnalyzer` class was initialized, all fields present
in analysis.getColumnReferenceFields() has been rewritten to the `columnReferences`
map which only trivially mapped values: entry -> entry.getValue().getFieldId().
This happened for every aggregation in the query. The problem is that
analysis.referencedFields scope is wider than that of AggregationAnalyzer and the
number of fields kept rising with time. The complexity of this is O(n^2) which led
to long analysis time after number of referenced fields got to tens of thousands.

This commit removes the problematic eager remapping of all values in favor of lazy
mapping when needed.

before:
BenchmarkPlanner.plan  GROUP_BY_WITH_MANY_REFERENCED_COLUMNS  CREATED  avgt   20  4835,674 ± 233,587  ms/op
after:
BenchmarkPlanner.plan  GROUP_BY_WITH_MANY_REFERENCED_COLUMNS  CREATED  avgt   20  863,968 ± 5,566  ms/op
@skrzypo987 skrzypo987 force-pushed the skrzypo/132-optimize-aggregation-analysis branch from 0fbcdef to 572cadb Compare December 9, 2022 08:51
@findepi findepi merged commit c68bc26 into trinodb:master Dec 9, 2022
@findepi findepi mentioned this pull request Dec 9, 2022
@github-actions github-actions bot added this to the 404 milestone Dec 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants