Avoid relying on row-group row count for detecting only-null domain #15388

raunaqmorarka · 2022-12-13T20:58:44Z

Description

ColumnChunkMetaData#getValueCount should be used to get total values count
for a column instead of BlockMetadata#getRowCount because single row may
contain multiple values for a nested column type.
Currently row group pruning is not implemented for nested columns.
This change fixes the logic for only-nulls domain detection in preparation
for nested columns row group pruning.

Additional context and related issues

#15163

Release notes

(x) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text:

skrzypo987

nice

lib/trino-parquet/src/main/java/io/trino/parquet/predicate/PredicateUtils.java

ColumnChunkMetaData#getValueCount should be used to get total values count for a column instead of BlockMetadata#getRowCount because single row may contain multiple values for a nested column type. Currently row group pruning is not implemented for nested columns. This change fixes the logic for only-nulls domain detection in preparation for nested columns row group pruning.

alexjo2144 · 2022-12-19T19:53:15Z

lib/trino-parquet/src/main/java/io/trino/parquet/predicate/TupleDomainParquetPredicate.java

@@ -118,10 +131,14 @@ public Optional<List<ColumnDescriptor>> getIndexLookupCandidates(long numberOfRo
                continue;
            }

+            Long columnValueCount = valueCounts.get(column);
+            if (columnValueCount == null) {
+                throw new IllegalArgumentException(format("Missing columnValueCount for column %s in %s", column, id));


Would this come up in the case where you add a column to a table and then insert new data? The old data files would not have the new column in the stats.

In case we reach here for that scenario, I assume that the above columnStatistics == null check will help to bail out.

Gotcha, yep didn't see that. At least in Iceberg it is possible to configure collection of value counts but not min/max stats,. I guess in that case we'd still ignore them?

ya, without min/max stats but with nulls count and value count, the most we can do is prune a only-null row group for IS NOT NULL predicate and prune a non-nullable row group for IS NULL predicate.
we could consider doing that if that's a practically useful thing.

Remove unnecessary Predicate interface in parquet reader

8e4ff8f

cla-bot bot added the cla-signed label Dec 13, 2022

raunaqmorarka requested review from findepi, phd3 and alexjo2144 December 13, 2022 20:58

raunaqmorarka mentioned this pull request Dec 13, 2022

Implement predicate push down for parquet dereference column #15163

Merged

raunaqmorarka requested a review from skrzypo987 December 13, 2022 21:05

github-actions bot added the tests:hive label Dec 13, 2022

raunaqmorarka force-pushed the fix-row-count branch from f30d2c5 to 7f4026b Compare December 14, 2022 06:20

findepi requested review from martint and removed request for findepi December 14, 2022 08:14

skrzypo987 approved these changes Dec 14, 2022

View reviewed changes

lib/trino-parquet/src/main/java/io/trino/parquet/predicate/PredicateUtils.java Show resolved Hide resolved

raunaqmorarka requested a review from ebyhr December 15, 2022 06:11

sopel39 approved these changes Dec 16, 2022

View reviewed changes

raunaqmorarka force-pushed the fix-row-count branch from 7f4026b to f823780 Compare December 16, 2022 15:08

raunaqmorarka merged commit b50b6ce into trinodb:master Dec 17, 2022

raunaqmorarka deleted the fix-row-count branch December 17, 2022 02:33

github-actions bot added this to the 404 milestone Dec 17, 2022

alexjo2144 reviewed Dec 19, 2022

View reviewed changes

colebow mentioned this pull request Dec 21, 2022

Add Trino 405 release notes #15139

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid relying on row-group row count for detecting only-null domain #15388

Avoid relying on row-group row count for detecting only-null domain #15388

raunaqmorarka commented Dec 13, 2022

skrzypo987 left a comment

alexjo2144 Dec 19, 2022

raunaqmorarka Dec 19, 2022

alexjo2144 Dec 19, 2022

raunaqmorarka Dec 19, 2022

Avoid relying on row-group row count for detecting only-null domain #15388

Avoid relying on row-group row count for detecting only-null domain #15388

Conversation

raunaqmorarka commented Dec 13, 2022

Description

Additional context and related issues

Release notes

skrzypo987 left a comment

Choose a reason for hiding this comment

alexjo2144 Dec 19, 2022

Choose a reason for hiding this comment

raunaqmorarka Dec 19, 2022

Choose a reason for hiding this comment

alexjo2144 Dec 19, 2022

Choose a reason for hiding this comment

raunaqmorarka Dec 19, 2022

Choose a reason for hiding this comment