-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Produce DictionaryBlock when reading parquet dictionary encoded columns #15269
Conversation
I understand this is on hold due to regression? |
lib/trino-parquet/src/main/java/io/trino/parquet/reader/flat/FlatColumnReader.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/flat/FlatColumnReader.java
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/flat/FlatColumnReader.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/flat/FlatColumnReader.java
Outdated
Show resolved
Hide resolved
d689d3a
to
3a6744f
Compare
approved, we probably need to address other dictionary issues first |
3a6744f
to
7512eb0
Compare
7512eb0
to
1448f30
Compare
lib/trino-parquet/src/main/java/io/trino/parquet/ParquetReaderUtils.java
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java
Show resolved
Hide resolved
|
||
void readNullable(ValueDecoder<T> valueDecoder, boolean[] isNull, int offset, int nonNullCount, int chunkSize); | ||
|
||
ColumnChunk createNonNullBlock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does the method name indicate "block"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The important thing inside ColumnChunk is a "block".
The existing class "ColumnChunk" is not appropriately named because we're returning a batch of rows from a parquet column chunk in the form of a block rather than the column chunk.
return new DataValuesBuffer<>(columnAdapter, batchSize); | ||
} | ||
|
||
private interface ValuesBuffer<T> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this different from ColumnAdapter? There seems to be some overlap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This owns state (buffer containing values for current batch) while ColumnAdapter
doesn't.
It also keeps track of count of nulls in the current batch and uses that to produce RLE block of nulls if all values are nulls or a non-block for a nullable column if the current batch had no nulls.
It's not doing anything type specific whereas ColumnAdapter
is mainly about things that need to be done differently for various types.
1448f30
to
51ef12a
Compare
dc577d2
to
31a0553
Compare
Parquet dictionary block sf1k partitioned.pdf Parquet dictionary block sf1k unpartitioned.pdf TPC results showed regressions for some queries if we tried to use dictionary blocks for all types (despite restricting dictionary blocks only to cases where the entire column chunk is dictionary encoded and the filtered row count is greater than dictionary size). |
Switch to producing DictionaryBlock from FlatColumnReader when a column is encoded fully using dictionary encoding and the count of rows read from the column chunk are larger than the size of the dictionary. We continue to resolve dictionary ids as before when there is a mix of dictionary and other encodings to keep the implementation simpler. DictionaryBlock creation is restricted to variable width data types where dictionary processing is most beneficial. Benchmark (benchmarkFileFormat) (compression) (dataSet) Mode Cnt Score Before Score After Units BenchmarkHiveFileFormat.read TRINO_OPTIMIZED_PARQUET NONE LINEITEM thrpt 15 12.900 ± 0.209 21.332 ± 0.895 ops/s BenchmarkHiveFileFormat.read TRINO_OPTIMIZED_PARQUET NONE VARCHAR_SMALL thrpt 15 12.210 ± 0.469 114.673 ± 3.483 ops/s BenchmarkHiveFileFormat.read TRINO_OPTIMIZED_PARQUET NONE VARCHAR_LARGE thrpt 15 12.056 ± 0.737 116.865 ± 1.482 ops/s BenchmarkHiveFileFormat.read TRINO_OPTIMIZED_PARQUET NONE VARCHAR_DICTIONARY thrpt 15 9.555 ± 0.729 164.582 ± 4.742 ops/s Co-authored-by: Krzysztof Skrzypczynski <krzysztof.skrzypczynski@starburstdata.com>
31a0553
to
e480db2
Compare
Description
Switch to producing DictionaryBlock from FlatColumnReader when a column
is encoded fully using dictionary encoding and the count of rows read
from the column chunk are larger than the size of the dictionary.
We continue to resolve dictionary ids as before when there is a mix of
dictionary and other encodings to keep the implementation simpler.
DictionaryBlock creation is restricted to variable width data types
where dictionary processing is most beneficial.
Additional context and related issues
Fixes #2020
Release notes
( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text: