Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid relying on row-group row count for detecting only-null domain #15388

Merged
merged 2 commits into from
Dec 17, 2022

Conversation

raunaqmorarka
Copy link
Member

Description

ColumnChunkMetaData#getValueCount should be used to get total values count
for a column instead of BlockMetadata#getRowCount because single row may
contain multiple values for a nested column type.
Currently row group pruning is not implemented for nested columns.
This change fixes the logic for only-nulls domain detection in preparation
for nested columns row group pruning.

Additional context and related issues

#15163

Release notes

(x) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text:

Copy link
Member

@skrzypo987 skrzypo987 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

@raunaqmorarka raunaqmorarka requested a review from ebyhr December 15, 2022 06:11
ColumnChunkMetaData#getValueCount should be used to get total values count
for a column instead of BlockMetadata#getRowCount because single row may
contain multiple values for a nested column type.
Currently row group pruning is not implemented for nested columns.
This change fixes the logic for only-nulls domain detection in preparation
for nested columns row group pruning.
@raunaqmorarka raunaqmorarka merged commit b50b6ce into trinodb:master Dec 17, 2022
@raunaqmorarka raunaqmorarka deleted the fix-row-count branch December 17, 2022 02:33
@github-actions github-actions bot added this to the 404 milestone Dec 17, 2022
@@ -118,10 +131,14 @@ public Optional<List<ColumnDescriptor>> getIndexLookupCandidates(long numberOfRo
continue;
}

Long columnValueCount = valueCounts.get(column);
if (columnValueCount == null) {
throw new IllegalArgumentException(format("Missing columnValueCount for column %s in %s", column, id));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this come up in the case where you add a column to a table and then insert new data? The old data files would not have the new column in the stats.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case we reach here for that scenario, I assume that the above columnStatistics == null check will help to bail out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, yep didn't see that. At least in Iceberg it is possible to configure collection of value counts but not min/max stats,. I guess in that case we'd still ignore them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya, without min/max stats but with nulls count and value count, the most we can do is prune a only-null row group for IS NOT NULL predicate and prune a non-nullable row group for IS NULL predicate.
we could consider doing that if that's a practically useful thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants