-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent redundant collection rewriting in AggregationAnalyzer #15292
Prevent redundant collection rewriting in AggregationAnalyzer #15292
Conversation
this.columnReferences = analysis.getColumnReferenceFields() | ||
.entrySet().stream() | ||
.collect(toImmutableMap(Map.Entry::getKey, entry -> entry.getValue().getFieldId())); | ||
this.columnReferences = analysis.getColumnReferenceFields(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this.columnReferences is (indirectly) mutable now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW looks like this place is the most important change in the commit.
How can we prevent it from being undone by an innocent refactor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
analysis.getColumnReferenceFields()
returns an immutable map
How can we prevent it from being undone by an innocent refactor?
I don't quite follow why would anybody want to change that. I see no gain in this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can we prevent it from being undone by an innocent refactor?
An automated benchmark for planner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea long-term. But this exact problem emerges only for quite specific queries. Mine had 80k columns referenced and 556 group by clauses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I don't like is that this now relies on getColumnReferenceFields
returning an immutable map. If the implementation of that method changes, it could indirectly affect this usage without anyone noticing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can just make a sanity check, that the map is indeed immutable? If anyone changes the implementation it will fail fast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can just make a sanity check, that the map is indeed immutable?
but it's not (it's just an unmodifiable wrapper)
also, you can't verify that it's what it is, since unmodifiableMap
has no public class you can test for
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@findepi Unfortunately you are right.
So what are the options?
core/trino-main/src/main/java/io/trino/sql/analyzer/AggregationAnalyzer.java
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/sql/analyzer/AggregationAnalyzer.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/sql/analyzer/AggregationAnalyzer.java
Outdated
Show resolved
Hide resolved
a33efec
to
4b61bba
Compare
@findepi AC |
4b61bba
to
2505e40
Compare
AC again. |
2368c85
to
a482c1c
Compare
thank you! do you have the results you can share? |
See commit message |
core/trino-main/src/test/java/io/trino/sql/planner/BenchmarkPlanner.java
Show resolved
Hide resolved
core/trino-main/src/test/java/io/trino/sql/planner/BenchmarkPlanner.java
Outdated
Show resolved
Hide resolved
JOIN lineitem b ON a.l_orderkey = b.l_orderkey | ||
JOIN lineitem c ON a.l_orderkey = c.l_orderkey | ||
JOIN lineitem d ON a.l_orderkey = d.l_orderkey | ||
JOIN lineitem e ON a.l_orderkey = e.l_orderkey |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a perfect case for optimizer to eliminate redundant reads from lineitem table, and read it only once.
To prevent such potential future optimization breaking your benchmark, eg use lineitem coming from different schemas.
(from planning time perspective, the scale factor (sf) factor does not matter)
.collect(toImmutableList()); | ||
for (Expression sortKey : sortKeys) { | ||
if (!node.getArguments().contains(sortKey) && !fieldIds.contains(columnReferences.get(NodeRef.of(sortKey)))) { | ||
ResolvedField field = columnReferences.get(NodeRef.of(sortKey)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why extracted?
previously the lookup was lazy, only when (!node.getArguments().contains(sortKey)
, now it's always.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now instead of X.contains(map.get(A))
we need to do
A a = map.getA();
return ( a == null ) || !X.contains(a.getB());
since the value type of the map changed.
But you are right, I replaced that with a double if
a482c1c
to
99d63b0
Compare
@findepi AC |
core/trino-main/src/test/java/io/trino/sql/planner/BenchmarkPlanner.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/sql/analyzer/AggregationAnalyzer.java
Outdated
Show resolved
Hide resolved
99d63b0
to
0fbcdef
Compare
Make queries to be executed a parameter
Add a test case with huge number of column referenced in a query (86k) and a lot of group by clauses. This benchmark highlights the quadratic complexity of AggregationAnalyzer constructor, which will be fixed in subsequent commit.
Every time the `AggregationAnalyzer` class was initialized, all fields present in analysis.getColumnReferenceFields() has been rewritten to the `columnReferences` map which only trivially mapped values: entry -> entry.getValue().getFieldId(). This happened for every aggregation in the query. The problem is that analysis.referencedFields scope is wider than that of AggregationAnalyzer and the number of fields kept rising with time. The complexity of this is O(n^2) which led to long analysis time after number of referenced fields got to tens of thousands. This commit removes the problematic eager remapping of all values in favor of lazy mapping when needed. before: BenchmarkPlanner.plan GROUP_BY_WITH_MANY_REFERENCED_COLUMNS CREATED avgt 20 4835,674 ± 233,587 ms/op after: BenchmarkPlanner.plan GROUP_BY_WITH_MANY_REFERENCED_COLUMNS CREATED avgt 20 863,968 ± 5,566 ms/op
0fbcdef
to
572cadb
Compare
Description
See commit message
Additional context and related issues
Replaces #14983
Release notes
( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text: