Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support writing Parquet encoding stats #9569

Merged
merged 2 commits into from
Oct 13, 2021

Conversation

alexjo2144
Copy link
Member

@alexjo2144 alexjo2144 commented Oct 8, 2021

Encoding stats are used by the reader to check if the dictionary pages can be used for predicate pushdown.

Fixes #9554

TODO: Still need to add a test, but I did manually test that there's dictionary based pushdown happening now that didn't happen before.

@cla-bot cla-bot bot added the cla-signed label Oct 8, 2021
@alexjo2144 alexjo2144 force-pushed the parquet-writer-encoding-stats branch from b4478e9 to fd0c063 Compare October 8, 2021 19:05
@alexjo2144 alexjo2144 changed the title Support Writing Parquet encoding stats Support writing Parquet encoding stats Oct 8, 2021
@alexjo2144 alexjo2144 requested review from findepi and phd3 October 8, 2021 19:08
Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. is it testable?

@@ -74,6 +78,8 @@

// column meta data stats
private final Set<Encoding> encodings;
private final Map<org.apache.parquet.format.Encoding, Integer> dataPagesWithEncoding;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how small can the smallest page be? can we overflow here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The field this in PageEncodingStats this gets used for is also an int, so it should be fine.

@@ -178,6 +186,14 @@ private ColumnMetaData getColumnMetaData()
totalCompressedSize,
-1);
columnMetaData.setStatistics(ParquetMetadataConverter.toParquetStatistics(columnStatistics));
ImmutableList.Builder<PageEncodingStats> pageEncodingStats = ImmutableList.builder();
dataPagesWithEncoding.entrySet().stream()
.map(encodingAndCount -> new PageEncodingStats(PageType.DATA_PAGE_V2, encodingAndCount.getKey(), encodingAndCount.getValue()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should V2 here change in #9497?
if so, let's add a comment it in that PR... (also, would be yet another reason to simplify and not try to have a toggle there)

cc @losipiuk

@findepi
Copy link
Member

findepi commented Oct 8, 2021

Issue: #9554

nit: in the PR description change this to "Fixes #9554" to have github link the two.

@alexjo2144 alexjo2144 force-pushed the parquet-writer-encoding-stats branch from fd0c063 to 313923b Compare October 8, 2021 20:42
@alexjo2144
Copy link
Member Author

For testing, I can add a test with something like:
CREATE TABLE test_table (n) WITH (format = 'parquet') AS VALUES 1, 1, 2, 2, 4, 4, 5, 5

And assert that SELECT count(*) FROM test_table WHERE n = 3 reads no rows, if that's worthwhile. I'm not sure I can write a lower level test easily though.

@findepi
Copy link
Member

findepi commented Oct 11, 2021

And assert that SELECT count(*) FROM test_table WHERE n = 3 reads no rows, if that's worthwhile. I'm not sure I can write a lower level test easily though.

better such test than nothing, imo

Encoding stats are used by the reader to check if the dictionary
pages can be used for predicate pushdown.
@alexjo2144 alexjo2144 force-pushed the parquet-writer-encoding-stats branch from 313923b to 6b958f7 Compare October 11, 2021 17:43
@alexjo2144
Copy link
Member Author

And assert that SELECT count(*) FROM test_table WHERE n = 3 reads no rows, if that's worthwhile. I'm not sure I can write a lower level test easily though.

better such test than nothing, imo

Added

@findepi
Copy link
Member

findepi commented Oct 12, 2021

test (:trino-accumulo) failure seems unrelated, but we likely don't have an issue for that

Error:  io.trino.plugin.accumulo.TestAccumuloConnectorTest.init  Time elapsed: 13.082 s  <<< FAILURE!
java.lang.RuntimeException: org.apache.accumulo.core.client.impl.AccumuloServerException: Error on server localhost:9997

@findepi
Copy link
Member

findepi commented Oct 13, 2021

CI #9617

@findepi findepi merged commit c8b861c into trinodb:master Oct 13, 2021
@findepi findepi mentioned this pull request Oct 13, 2021
12 tasks
@github-actions github-actions bot added this to the 364 milestone Oct 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Optimized Parquet writer is missing EncodingStats
2 participants