Produce DictionaryBlock when reading parquet dictionary encoded columns #15269

raunaqmorarka · 2022-12-01T11:48:38Z

Description

Switch to producing DictionaryBlock from FlatColumnReader when a column
is encoded fully using dictionary encoding and the count of rows read
from the column chunk are larger than the size of the dictionary.
We continue to resolve dictionary ids as before when there is a mix of
dictionary and other encodings to keep the implementation simpler.
DictionaryBlock creation is restricted to variable width data types
where dictionary processing is most beneficial.

Benchmark                     (benchmarkFileFormat)    (compression)           (dataSet)   Mode Cnt  Score Before    Score After      Units
BenchmarkHiveFileFormat.read  TRINO_OPTIMIZED_PARQUET           NONE            LINEITEM  thrpt  15  12.900 ± 0.209   21.332 ± 0.895  ops/s
BenchmarkHiveFileFormat.read  TRINO_OPTIMIZED_PARQUET           NONE       VARCHAR_SMALL  thrpt  15  12.210 ± 0.469  114.673 ± 3.483  ops/s
BenchmarkHiveFileFormat.read  TRINO_OPTIMIZED_PARQUET           NONE       VARCHAR_LARGE  thrpt  15  12.056 ± 0.737  116.865 ± 1.482  ops/s
BenchmarkHiveFileFormat.read  TRINO_OPTIMIZED_PARQUET           NONE  VARCHAR_DICTIONARY  thrpt  15   9.555 ± 0.729  164.582 ± 4.742  ops/s

Additional context and related issues

Fixes #2020

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text:

# Hive, Hudi, Delta, Iceberg
* Improve performance of queries with filters or projections on low cardinality string columns stored in parquet files. ({issue}`15269`)

sopel39 · 2022-12-01T14:13:37Z

I understand this is on hold due to regression?

lib/trino-parquet/src/main/java/io/trino/parquet/reader/flat/FlatColumnReader.java

sopel39 · 2022-12-02T11:52:48Z

approved, we probably need to address other dictionary issues first

lib/trino-parquet/src/main/java/io/trino/parquet/ParquetReaderUtils.java

lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java

martint · 2022-12-07T19:43:15Z

lib/trino-parquet/src/main/java/io/trino/parquet/reader/flat/FlatColumnReader.java

+
+        void readNullable(ValueDecoder<T> valueDecoder, boolean[] isNull, int offset, int nonNullCount, int chunkSize);
+
+        ColumnChunk createNonNullBlock();


Why does the method name indicate "block"?

The important thing inside ColumnChunk is a "block".
The existing class "ColumnChunk" is not appropriately named because we're returning a batch of rows from a parquet column chunk in the form of a block rather than the column chunk.

martint · 2022-12-07T19:45:34Z

lib/trino-parquet/src/main/java/io/trino/parquet/reader/flat/FlatColumnReader.java

+        return new DataValuesBuffer<>(columnAdapter, batchSize);
+    }
+
+    private interface ValuesBuffer<T>


How is this different from ColumnAdapter? There seems to be some overlap.

This owns state (buffer containing values for current batch) while ColumnAdapter doesn't.
It also keeps track of count of nulls in the current batch and uses that to produce RLE block of nulls if all values are nulls or a non-block for a nullable column if the current batch had no nulls.
It's not doing anything type specific whereas ColumnAdapter is mainly about things that need to be done differently for various types.

raunaqmorarka · 2023-01-06T06:46:51Z

Parquet dictionary block sf1k partitioned.pdf

Parquet dictionary block sf1k unpartitioned.pdf

TPC results showed regressions for some queries if we tried to use dictionary blocks for all types (despite restricting dictionary blocks only to cases where the entire column chunk is dictionary encoded and the filtered row count is greater than dictionary size).
When it's restricted to only variable width types, we see small improvements overall. This makes the implementation consistent with ORC reader which also produces dictionary only for strings.

Switch to producing DictionaryBlock from FlatColumnReader when a column is encoded fully using dictionary encoding and the count of rows read from the column chunk are larger than the size of the dictionary. We continue to resolve dictionary ids as before when there is a mix of dictionary and other encodings to keep the implementation simpler. DictionaryBlock creation is restricted to variable width data types where dictionary processing is most beneficial. Benchmark (benchmarkFileFormat) (compression) (dataSet) Mode Cnt Score Before Score After Units BenchmarkHiveFileFormat.read TRINO_OPTIMIZED_PARQUET NONE LINEITEM thrpt 15 12.900 ± 0.209 21.332 ± 0.895 ops/s BenchmarkHiveFileFormat.read TRINO_OPTIMIZED_PARQUET NONE VARCHAR_SMALL thrpt 15 12.210 ± 0.469 114.673 ± 3.483 ops/s BenchmarkHiveFileFormat.read TRINO_OPTIMIZED_PARQUET NONE VARCHAR_LARGE thrpt 15 12.056 ± 0.737 116.865 ± 1.482 ops/s BenchmarkHiveFileFormat.read TRINO_OPTIMIZED_PARQUET NONE VARCHAR_DICTIONARY thrpt 15 9.555 ± 0.729 164.582 ± 4.742 ops/s Co-authored-by: Krzysztof Skrzypczynski <krzysztof.skrzypczynski@starburstdata.com>

cla-bot bot added the cla-signed label Dec 1, 2022

raunaqmorarka requested review from skrzypo987, sopel39 and martint December 1, 2022 11:48

github-actions bot added the tests:hive label Dec 1, 2022

sopel39 reviewed Dec 1, 2022

View reviewed changes

raunaqmorarka force-pushed the pqr-dict-block branch 4 times, most recently from d689d3a to 3a6744f Compare December 2, 2022 05:51

sopel39 approved these changes Dec 2, 2022

View reviewed changes

sopel39 closed this Dec 2, 2022

sopel39 reopened this Dec 2, 2022

raunaqmorarka force-pushed the pqr-dict-block branch from 3a6744f to 7512eb0 Compare December 7, 2022 00:01

raunaqmorarka added the performance label Dec 7, 2022

raunaqmorarka mentioned this pull request Dec 7, 2022

Support DictionaryBlock for Parquet dictionary encoded columns #2020

Closed

raunaqmorarka force-pushed the pqr-dict-block branch from 7512eb0 to 1448f30 Compare December 7, 2022 08:09

martint reviewed Dec 7, 2022

View reviewed changes

raunaqmorarka requested a review from martint December 8, 2022 02:20

raunaqmorarka force-pushed the pqr-dict-block branch from 1448f30 to 51ef12a Compare January 2, 2023 12:00

Move isOnlyDictionaryEncodingPages to ParquetReaderUtils

e31432d

raunaqmorarka force-pushed the pqr-dict-block branch 3 times, most recently from dc577d2 to 31a0553 Compare January 6, 2023 06:37

raunaqmorarka and others added 2 commits January 6, 2023 14:51

Add parquet tests for mixed dictionary/plain encoding

e480db2

raunaqmorarka force-pushed the pqr-dict-block branch from 31a0553 to e480db2 Compare January 6, 2023 09:23

raunaqmorarka merged commit a3b24b6 into trinodb:master Jan 6, 2023

raunaqmorarka deleted the pqr-dict-block branch January 6, 2023 11:34

github-actions bot added this to the 406 milestone Jan 6, 2023

raunaqmorarka mentioned this pull request Jan 6, 2023

Release notes for 406 #15563

Closed

colebow mentioned this pull request Jan 10, 2023

Add Trino 406 release notes #15625

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Produce DictionaryBlock when reading parquet dictionary encoded columns #15269

Produce DictionaryBlock when reading parquet dictionary encoded columns #15269

raunaqmorarka commented Dec 1, 2022 •

edited

Loading

sopel39 commented Dec 1, 2022

sopel39 commented Dec 2, 2022

martint Dec 7, 2022

raunaqmorarka Dec 8, 2022

martint Dec 7, 2022

raunaqmorarka Dec 8, 2022

raunaqmorarka commented Jan 6, 2023


		void readNullable(ValueDecoder<T> valueDecoder, boolean[] isNull, int offset, int nonNullCount, int chunkSize);

		ColumnChunk createNonNullBlock();

Produce DictionaryBlock when reading parquet dictionary encoded columns #15269

Produce DictionaryBlock when reading parquet dictionary encoded columns #15269

Conversation

raunaqmorarka commented Dec 1, 2022 • edited Loading

Description

Additional context and related issues

Release notes

sopel39 commented Dec 1, 2022

sopel39 commented Dec 2, 2022

martint Dec 7, 2022

Choose a reason for hiding this comment

raunaqmorarka Dec 8, 2022

Choose a reason for hiding this comment

martint Dec 7, 2022

Choose a reason for hiding this comment

raunaqmorarka Dec 8, 2022

Choose a reason for hiding this comment

raunaqmorarka commented Jan 6, 2023

raunaqmorarka commented Dec 1, 2022 •

edited

Loading