Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize DELTA_BYTE_ARRAY decoder in parquet reader #15923

Merged
merged 3 commits into from
Feb 2, 2023

Conversation

raunaqmorarka
Copy link
Member

@raunaqmorarka raunaqmorarka commented Feb 1, 2023

Description

Optimize DELTA_BYTE_ARRAY decoder in parquet reader for BINARY and FIXED_LEN_BYTE_ARRAY parquet types

Additional context and related issues

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Hive, Hudi, Delta, Iceberg
* Improve performance of reading string data types from parquet files. ({issue}`15923`)

Copy link
Member

@sopel39 sopel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please also wait for @skrzypo987 approval

raunaqmorarka and others added 3 commits February 2, 2023 11:47
Benchmark                         (positionLength)                             (type)   Mode  Cnt     Before            After             Units
BenchmarkBinaryColumnReader.read    VARIABLE_0_100                          UNBOUNDED  thrpt   10     6.772 ±  0.282    63.404 ±   1.339  ops/s
BenchmarkBinaryColumnReader.read    VARIABLE_0_100          VARCHAR_ASCII_BOUND_EXACT  thrpt   10     6.197 ±  1.045    61.093 ±   0.756  ops/s
BenchmarkBinaryColumnReader.read    VARIABLE_0_100              CHAR_ASCII_BOUND_HALF  thrpt   10     6.581 ±  0.559    20.286 ±   3.499  ops/s
BenchmarkBinaryColumnReader.read    VARIABLE_0_100  CHAR_BOUND_HALF_PADDING_SOMETIMES  thrpt   10     7.107 ±  0.129    20.926 ±   1.483  ops/s
BenchmarkBinaryColumnReader.read   VARIABLE_0_1000                          UNBOUNDED  thrpt   10  1764.530 ± 65.214  8311.994 ± 417.798  ops/s
BenchmarkBinaryColumnReader.read   VARIABLE_0_1000          VARCHAR_ASCII_BOUND_EXACT  thrpt   10  1615.245 ± 70.177  7364.618 ± 220.423  ops/s
BenchmarkBinaryColumnReader.read   VARIABLE_0_1000              CHAR_ASCII_BOUND_HALF  thrpt   10  1601.118 ± 46.370  3460.392 ± 114.178  ops/s
BenchmarkBinaryColumnReader.read   VARIABLE_0_1000  CHAR_BOUND_HALF_PADDING_SOMETIMES  thrpt   10  1315.271 ± 74.272  3522.025 ± 137.662  ops/s
BenchmarkBinaryColumnReader.read          FIXED_10                          UNBOUNDED  thrpt   10    12.629 ±  0.554   138.457 ±   4.436  ops/s
BenchmarkBinaryColumnReader.read          FIXED_10          VARCHAR_ASCII_BOUND_EXACT  thrpt   10     9.612 ±  3.404   125.494 ±   3.186  ops/s
BenchmarkBinaryColumnReader.read          FIXED_10              CHAR_ASCII_BOUND_HALF  thrpt   10    10.112 ±  0.340    39.157 ±   5.328  ops/s
BenchmarkBinaryColumnReader.read          FIXED_10  CHAR_BOUND_HALF_PADDING_SOMETIMES  thrpt   10    10.293 ±  0.774    37.144 ±   6.790  ops/s
BenchmarkBinaryColumnReader.read         FIXED_100                          UNBOUNDED  thrpt   10   806.429 ± 12.979  5250.974 ±  96.183  ops/s
BenchmarkBinaryColumnReader.read         FIXED_100          VARCHAR_ASCII_BOUND_EXACT  thrpt   10   798.022 ± 12.012  5383.613 ± 251.305  ops/s
BenchmarkBinaryColumnReader.read         FIXED_100              CHAR_ASCII_BOUND_HALF  thrpt   10   644.532 ± 28.102  2031.493 ± 106.947  ops/s
BenchmarkBinaryColumnReader.read         FIXED_100  CHAR_BOUND_HALF_PADDING_SOMETIMES  thrpt   10   676.331 ± 39.974  2014.215 ± 134.481  ops/s

Co-authored-by: Krzysztof Skrzypczynski <krzysztof.skrzypczynski@starburstdata.com>
Benchmark                               (byteArrayLength)   Mode  Cnt  Before           After           Units
BenchmarkShortDecimalColumnReader.read                  1  thrpt   10  19.552 ± 2.7966  109.689 ± 3.551  ops/s
BenchmarkShortDecimalColumnReader.read                  2  thrpt   10  19.388 ± 0.7022   77.033 ± 4.054  ops/s
BenchmarkShortDecimalColumnReader.read                  3  thrpt   10  16.082 ± 1.4900   59.217 ± 3.118  ops/s
BenchmarkShortDecimalColumnReader.read                  4  thrpt   10  20.012 ± 1.3366   73.665 ± 3.047  ops/s
BenchmarkShortDecimalColumnReader.read                  5  thrpt   10  16.827 ± 2.2422   55.817 ± 5.022  ops/s
BenchmarkShortDecimalColumnReader.read                  6  thrpt   10  20.799 ± 0.1855   66.127 ± 1.533  ops/s
BenchmarkShortDecimalColumnReader.read                  7  thrpt   10  16.956 ± 1.0444   54.469 ± 3.195  ops/s
BenchmarkShortDecimalColumnReader.read                  8  thrpt   10  14.576 ± 2.4777   48.632 ± 1.844  ops/s

Benchmark                               Mode  Cnt  Before          After           Units
BenchmarkLongDecimalColumnReader.read  thrpt   20  19.669 ± 1.816  37.417 ± 0.829  ops/s
Benchmark                        Mode  Cnt   Before           After           Units
BenchmarkUuidColumnReader.read  thrpt   20   22.372 ± 1.186   79.362 ± 5.013  ops/s
@raunaqmorarka raunaqmorarka merged commit 504fab7 into trinodb:master Feb 2, 2023
@raunaqmorarka raunaqmorarka deleted the pqr-delta-byte-array branch February 2, 2023 10:50
@github-actions github-actions bot added this to the 407 milestone Feb 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

3 participants