-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid relying on row-group row count for detecting only-null domain #15388
Conversation
f30d2c5
to
7f4026b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
lib/trino-parquet/src/main/java/io/trino/parquet/predicate/PredicateUtils.java
Show resolved
Hide resolved
ColumnChunkMetaData#getValueCount should be used to get total values count for a column instead of BlockMetadata#getRowCount because single row may contain multiple values for a nested column type. Currently row group pruning is not implemented for nested columns. This change fixes the logic for only-nulls domain detection in preparation for nested columns row group pruning.
7f4026b
to
f823780
Compare
@@ -118,10 +131,14 @@ public Optional<List<ColumnDescriptor>> getIndexLookupCandidates(long numberOfRo | |||
continue; | |||
} | |||
|
|||
Long columnValueCount = valueCounts.get(column); | |||
if (columnValueCount == null) { | |||
throw new IllegalArgumentException(format("Missing columnValueCount for column %s in %s", column, id)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this come up in the case where you add a column to a table and then insert new data? The old data files would not have the new column in the stats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case we reach here for that scenario, I assume that the above columnStatistics == null
check will help to bail out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, yep didn't see that. At least in Iceberg it is possible to configure collection of value counts but not min/max stats,. I guess in that case we'd still ignore them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya, without min/max stats but with nulls count and value count, the most we can do is prune a only-null row group for IS NOT NULL predicate and prune a non-nullable row group for IS NULL predicate.
we could consider doing that if that's a practically useful thing.
Description
ColumnChunkMetaData#getValueCount should be used to get total values count
for a column instead of BlockMetadata#getRowCount because single row may
contain multiple values for a nested column type.
Currently row group pruning is not implemented for nested columns.
This change fixes the logic for only-nulls domain detection in preparation
for nested columns row group pruning.
Additional context and related issues
#15163
Release notes
(x) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text: