Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize DELTA_LENGTH_BYTE_ARRAY decoder in parquet #15897

Merged
merged 1 commit into from
Feb 1, 2023

Conversation

raunaqmorarka
Copy link
Member

@raunaqmorarka raunaqmorarka commented Jan 30, 2023

Description

Optimize DELTA_LENGTH_BYTE_ARRAY decoder in parquet

Benchmark                         (positionLength)                             (type)   Mode  Cnt    Before             After              Units
BenchmarkBinaryColumnReader.read    VARIABLE_0_100                          UNBOUNDED  thrpt   10    1.795 ±  0.793     87.765 ±    6.190  ops/s
BenchmarkBinaryColumnReader.read    VARIABLE_0_100          VARCHAR_ASCII_BOUND_EXACT  thrpt   10    2.065 ±  0.539     96.691 ±   13.081  ops/s
BenchmarkBinaryColumnReader.read    VARIABLE_0_100              CHAR_ASCII_BOUND_HALF  thrpt   10    2.401 ±  0.134      5.792 ±    1.579  ops/s
BenchmarkBinaryColumnReader.read    VARIABLE_0_100  CHAR_BOUND_HALF_PADDING_SOMETIMES  thrpt   10    2.607 ±  0.113      7.328 ±    2.045  ops/s
BenchmarkBinaryColumnReader.read   VARIABLE_0_1000                          UNBOUNDED  thrpt   10  669.902 ± 85.391  72508.841 ± 1083.144  ops/s
BenchmarkBinaryColumnReader.read   VARIABLE_0_1000          VARCHAR_ASCII_BOUND_EXACT  thrpt   10  587.317 ± 21.659  66455.490 ± 1082.769  ops/s
BenchmarkBinaryColumnReader.read   VARIABLE_0_1000              CHAR_ASCII_BOUND_HALF  thrpt   10  505.577 ± 69.876   1758.385 ±   37.769  ops/s
BenchmarkBinaryColumnReader.read   VARIABLE_0_1000  CHAR_BOUND_HALF_PADDING_SOMETIMES  thrpt   10  472.345 ± 60.199   1683.888 ±   45.331  ops/s
BenchmarkBinaryColumnReader.read          FIXED_10                          UNBOUNDED  thrpt   10    4.192 ±  0.806    252.310 ±    7.160  ops/s
BenchmarkBinaryColumnReader.read          FIXED_10          VARCHAR_ASCII_BOUND_EXACT  thrpt   10    3.741 ±  0.761    241.414 ±    5.647  ops/s
BenchmarkBinaryColumnReader.read          FIXED_10              CHAR_ASCII_BOUND_HALF  thrpt   10    3.386 ±  0.582     13.649 ±    0.220  ops/s
BenchmarkBinaryColumnReader.read          FIXED_10  CHAR_BOUND_HALF_PADDING_SOMETIMES  thrpt   10    4.587 ±  0.424     14.169 ±    0.302  ops/s
BenchmarkBinaryColumnReader.read         FIXED_100                          UNBOUNDED  thrpt   10  423.869 ± 26.744  31731.563 ±  842.399  ops/s
BenchmarkBinaryColumnReader.read         FIXED_100          VARCHAR_ASCII_BOUND_EXACT  thrpt   10  468.546 ± 13.797  27931.904 ±  438.979  ops/s
BenchmarkBinaryColumnReader.read         FIXED_100              CHAR_ASCII_BOUND_HALF  thrpt   10  348.080 ± 54.947    872.921 ±   63.038  ops/s
BenchmarkBinaryColumnReader.read         FIXED_100  CHAR_BOUND_HALF_PADDING_SOMETIMES  thrpt   10  337.808 ± 55.891    840.805 ±   74.542  ops/s

Additional context and related issues

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Hive, Hudi, Delta, Iceberg
* Improve performance of reading string data types from parquet files. ({issue}`15897`)

@cla-bot cla-bot bot added the cla-signed label Jan 30, 2023
@raunaqmorarka raunaqmorarka requested review from skrzypo987, sopel39 and martint and removed request for skrzypo987 January 30, 2023 11:44
Benchmark                         (positionLength)                             (type)   Mode  Cnt    Before             After              Units
BenchmarkBinaryColumnReader.read    VARIABLE_0_100                          UNBOUNDED  thrpt   10    1.795 ±  0.793     87.765 ±    6.190  ops/s
BenchmarkBinaryColumnReader.read    VARIABLE_0_100          VARCHAR_ASCII_BOUND_EXACT  thrpt   10    2.065 ±  0.539     96.691 ±   13.081  ops/s
BenchmarkBinaryColumnReader.read    VARIABLE_0_100              CHAR_ASCII_BOUND_HALF  thrpt   10    2.401 ±  0.134      5.792 ±    1.579  ops/s
BenchmarkBinaryColumnReader.read    VARIABLE_0_100  CHAR_BOUND_HALF_PADDING_SOMETIMES  thrpt   10    2.607 ±  0.113      7.328 ±    2.045  ops/s
BenchmarkBinaryColumnReader.read   VARIABLE_0_1000                          UNBOUNDED  thrpt   10  669.902 ± 85.391  72508.841 ± 1083.144  ops/s
BenchmarkBinaryColumnReader.read   VARIABLE_0_1000          VARCHAR_ASCII_BOUND_EXACT  thrpt   10  587.317 ± 21.659  66455.490 ± 1082.769  ops/s
BenchmarkBinaryColumnReader.read   VARIABLE_0_1000              CHAR_ASCII_BOUND_HALF  thrpt   10  505.577 ± 69.876   1758.385 ±   37.769  ops/s
BenchmarkBinaryColumnReader.read   VARIABLE_0_1000  CHAR_BOUND_HALF_PADDING_SOMETIMES  thrpt   10  472.345 ± 60.199   1683.888 ±   45.331  ops/s
BenchmarkBinaryColumnReader.read          FIXED_10                          UNBOUNDED  thrpt   10    4.192 ±  0.806    252.310 ±    7.160  ops/s
BenchmarkBinaryColumnReader.read          FIXED_10          VARCHAR_ASCII_BOUND_EXACT  thrpt   10    3.741 ±  0.761    241.414 ±    5.647  ops/s
BenchmarkBinaryColumnReader.read          FIXED_10              CHAR_ASCII_BOUND_HALF  thrpt   10    3.386 ±  0.582     13.649 ±    0.220  ops/s
BenchmarkBinaryColumnReader.read          FIXED_10  CHAR_BOUND_HALF_PADDING_SOMETIMES  thrpt   10    4.587 ±  0.424     14.169 ±    0.302  ops/s
BenchmarkBinaryColumnReader.read         FIXED_100                          UNBOUNDED  thrpt   10  423.869 ± 26.744  31731.563 ±  842.399  ops/s
BenchmarkBinaryColumnReader.read         FIXED_100          VARCHAR_ASCII_BOUND_EXACT  thrpt   10  468.546 ± 13.797  27931.904 ±  438.979  ops/s
BenchmarkBinaryColumnReader.read         FIXED_100              CHAR_ASCII_BOUND_HALF  thrpt   10  348.080 ± 54.947    872.921 ±   63.038  ops/s
BenchmarkBinaryColumnReader.read         FIXED_100  CHAR_BOUND_HALF_PADDING_SOMETIMES  thrpt   10  337.808 ± 55.891    840.805 ±   74.542  ops/s

Co-authored-by: Krzysztof Skrzypczynski <krzysztof.skrzypczynski@starburstdata.com>
@raunaqmorarka raunaqmorarka merged commit 4c90738 into trinodb:master Feb 1, 2023
@raunaqmorarka raunaqmorarka deleted the pqr-v2-string branch February 1, 2023 03:08
@github-actions github-actions bot added this to the 407 milestone Feb 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

3 participants