Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize aggregation analyzer #16198

Conversation

skrzypo987
Copy link
Member

@skrzypo987 skrzypo987 commented Feb 21, 2023

Description

Some performance tweaks to make analysis faster for certain type of large queries.
This is a sort-of follow-up to #15292

Additional context and related issues

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Section
* Speed up analysis time for queries with large number of group by clauses

// fields and expressions in the group by clause
private final Set<FieldId> groupingFields;
@Nullable
private Set<FieldId> groupingFields;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Supplier<Set<FieldId>> with memoization?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too many fancy words for a performance-related change.
Simple if (x==null) will provide the best branch prediction.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if com.google.common.base.Suppliers#memoize wouldn't be worse, we should use that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guava's memoized supplier is thread-safe, meaning it does some sync logic before any access. This code does not need to be thread safe so there is no need for it IMO. With a simple nullable field we can pretty-much guarantee there will be no performance regression as it is effectively a no-op. With some additional logic, I am not that certain

"Grouping field %s should originate from %s", fieldId, sourceScope.getRelationType());
});
this.groupingFields.forEach(fieldId -> {
checkState(isFieldFromScope(fieldId, sourceScope),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But now we are applying this check lazily, would it fail at a different place post Analysis phase or a with a different error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a sanity check related strictly to this calculations, so it will not fire unless groupingFields fields is actually accessed.
And it will never fire unless somebody screws up the code is some other place.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That last sentence turned out to be false

groupingFieldsBuilder.add(resolvedField.getFieldId());
private Set<FieldId> getGroupingFields()
{
if (groupingFields == null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we evaluate them lazily would there be a chance - where we won't use the groupingFIelds information ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only invocations are getGroupingFields().contains(fieldId) so it is always used, after init.

@skrzypo987 skrzypo987 force-pushed the skrzypo/267-optimize-aggregation-analyzer branch from f6a86eb to b8cf329 Compare February 21, 2023 11:04
// fields and expressions in the group by clause
private final Set<FieldId> groupingFields;
@Nullable
private Set<FieldId> groupingFields;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if com.google.common.base.Suppliers#memoize wouldn't be worse, we should use that

@@ -588,7 +596,7 @@ private boolean isGroupingKey(Expression node)
return true;
}

return groupingFields.contains(fieldId);
return getGroupingFields().contains(fieldId);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How common are queries when aggregation analyzer is constructed but getGroupingFields is not invoked?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have answer to that question.
It happens for "my query", which is absurdly complicated and I cannot share it here. I also struggle to make up any query that will suffer the same problems on the same scale.
That is why my approach here is to help in "my case" which may or may not be common and, what's more important, not add any regression.

skrzypo987 added 2 commits March 2, 2023 14:22
`groupingFields` set is expensive to construct for queries with a lot of
group by clauses and might not be needed during analysis.
We still need to iterate over `groupByExpressions` to carry out validation,
but the results do not need to be stored in a set.
This commit calculates it only when asked for, reducing analysis time for some
huge queries. Validation is done at the end, and it reuses already calculated
set, if it was created.
@skrzypo987 skrzypo987 force-pushed the skrzypo/267-optimize-aggregation-analyzer branch from b8cf329 to 278ba8b Compare March 2, 2023 14:49
@skrzypo987
Copy link
Member Author

@Praveen2112
It appeared that the sanity check that you asked about is indeed needed. And we do need to calculate groupingFields to make it work. We do not, however, need to store it in a set, if not used.
So I changed the last commit to validate this without constructing a set. PTAL

@skrzypo987
Copy link
Member Author

Closing this for now. I'll return to this topic, once i get a working benchmark repro.

@skrzypo987 skrzypo987 closed this Mar 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants