Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Produce DictionaryBlock when reading parquet dictionary encoded columns #15269

Merged
merged 3 commits into from
Jan 6, 2023

Conversation

raunaqmorarka
Copy link
Member

@raunaqmorarka raunaqmorarka commented Dec 1, 2022

Description

Switch to producing DictionaryBlock from FlatColumnReader when a column
is encoded fully using dictionary encoding and the count of rows read
from the column chunk are larger than the size of the dictionary.
We continue to resolve dictionary ids as before when there is a mix of
dictionary and other encodings to keep the implementation simpler.
DictionaryBlock creation is restricted to variable width data types
where dictionary processing is most beneficial.

Benchmark                     (benchmarkFileFormat)    (compression)           (dataSet)   Mode Cnt  Score Before    Score After      Units
BenchmarkHiveFileFormat.read  TRINO_OPTIMIZED_PARQUET           NONE            LINEITEM  thrpt  15  12.900 ± 0.209   21.332 ± 0.895  ops/s
BenchmarkHiveFileFormat.read  TRINO_OPTIMIZED_PARQUET           NONE       VARCHAR_SMALL  thrpt  15  12.210 ± 0.469  114.673 ± 3.483  ops/s
BenchmarkHiveFileFormat.read  TRINO_OPTIMIZED_PARQUET           NONE       VARCHAR_LARGE  thrpt  15  12.056 ± 0.737  116.865 ± 1.482  ops/s
BenchmarkHiveFileFormat.read  TRINO_OPTIMIZED_PARQUET           NONE  VARCHAR_DICTIONARY  thrpt  15   9.555 ± 0.729  164.582 ± 4.742  ops/s

Additional context and related issues

Fixes #2020

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text:

# Hive, Hudi, Delta, Iceberg
* Improve performance of queries with filters or projections on low cardinality string columns stored in parquet files. ({issue}`15269`)

@sopel39
Copy link
Member

sopel39 commented Dec 1, 2022

I understand this is on hold due to regression?

@raunaqmorarka raunaqmorarka force-pushed the pqr-dict-block branch 4 times, most recently from d689d3a to 3a6744f Compare December 2, 2022 05:51
@sopel39
Copy link
Member

sopel39 commented Dec 2, 2022

approved, we probably need to address other dictionary issues first


void readNullable(ValueDecoder<T> valueDecoder, boolean[] isNull, int offset, int nonNullCount, int chunkSize);

ColumnChunk createNonNullBlock();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the method name indicate "block"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The important thing inside ColumnChunk is a "block".
The existing class "ColumnChunk" is not appropriately named because we're returning a batch of rows from a parquet column chunk in the form of a block rather than the column chunk.

return new DataValuesBuffer<>(columnAdapter, batchSize);
}

private interface ValuesBuffer<T>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this different from ColumnAdapter? There seems to be some overlap.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This owns state (buffer containing values for current batch) while ColumnAdapter doesn't.
It also keeps track of count of nulls in the current batch and uses that to produce RLE block of nulls if all values are nulls or a non-block for a nullable column if the current batch had no nulls.
It's not doing anything type specific whereas ColumnAdapter is mainly about things that need to be done differently for various types.

@raunaqmorarka raunaqmorarka force-pushed the pqr-dict-block branch 3 times, most recently from dc577d2 to 31a0553 Compare January 6, 2023 06:37
@raunaqmorarka
Copy link
Member Author

Parquet dictionary block sf1k partitioned.pdf

Parquet dictionary block sf1k unpartitioned.pdf

TPC results showed regressions for some queries if we tried to use dictionary blocks for all types (despite restricting dictionary blocks only to cases where the entire column chunk is dictionary encoded and the filtered row count is greater than dictionary size).
When it's restricted to only variable width types, we see small improvements overall. This makes the implementation consistent with ORC reader which also produces dictionary only for strings.

raunaqmorarka and others added 2 commits January 6, 2023 14:51
Switch to producing DictionaryBlock from FlatColumnReader when a column
is encoded fully using dictionary encoding and the count of rows read
from the column chunk are larger than the size of the dictionary.
We continue to resolve dictionary ids as before when there is a mix of
dictionary and other encodings to keep the implementation simpler.
DictionaryBlock creation is restricted to variable width data types
where dictionary processing is most beneficial.

Benchmark                     (benchmarkFileFormat)    (compression)           (dataSet)   Mode Cnt  Score Before    Score After      Units
BenchmarkHiveFileFormat.read  TRINO_OPTIMIZED_PARQUET           NONE            LINEITEM  thrpt  15  12.900 ± 0.209   21.332 ± 0.895  ops/s
BenchmarkHiveFileFormat.read  TRINO_OPTIMIZED_PARQUET           NONE       VARCHAR_SMALL  thrpt  15  12.210 ± 0.469  114.673 ± 3.483  ops/s
BenchmarkHiveFileFormat.read  TRINO_OPTIMIZED_PARQUET           NONE       VARCHAR_LARGE  thrpt  15  12.056 ± 0.737  116.865 ± 1.482  ops/s
BenchmarkHiveFileFormat.read  TRINO_OPTIMIZED_PARQUET           NONE  VARCHAR_DICTIONARY  thrpt  15   9.555 ± 0.729  164.582 ± 4.742  ops/s

Co-authored-by: Krzysztof Skrzypczynski <krzysztof.skrzypczynski@starburstdata.com>
@raunaqmorarka raunaqmorarka merged commit a3b24b6 into trinodb:master Jan 6, 2023
@raunaqmorarka raunaqmorarka deleted the pqr-dict-block branch January 6, 2023 11:34
@github-actions github-actions bot added this to the 406 milestone Jan 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Support DictionaryBlock for Parquet dictionary encoded columns
3 participants