Group small columns together in parquet files #17404

raunaqmorarka · 2023-05-09T10:17:17Z

Description

Modified parquet writer to store columns in order of their size inside row groups
so that the reader can fetch small columns in fewer filesystem requests

Additional context and related issues

Based on similar logic in ORC writer at

trino/lib/trino-orc/src/main/java/io/trino/orc/OrcWriter.java

Line 420 in a13e451

Collections.sort(dataStreams);

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Hive
* Improve the layout of data in parquet files produced by the optimized parquet writer for faster reads. ({issue}`17404`)

# Hudi, Iceberg, Delta
* Improve the layout of data in parquet files for faster reads. ({issue}`17404`)

Modified parquet writer to store columns in order of their size inside row groups so that the reader can fetch small columns in fewer filesystem requests

sopel39

Do you think it might drop performance in some scenarios?

E.g. it seems optimally it should group columns that are mostly used together

raunaqmorarka · 2023-05-09T18:03:38Z

Do you think it might drop performance in some scenarios?

E.g. it seems optimally it should group columns that are mostly used together

This change matters only for columns with size less than parquet.max-buffer-size (default 8mb). Given that we're only affecting smaller columns, the penalty for a bad decision shouldn't be high and the current heuristic is the best we can do without knowing usage patterns. I'm also relying on this having been a successful optimization (or at least no complaints) in ORC.

cla-bot bot added the cla-signed label May 9, 2023

raunaqmorarka requested review from sopel39 and alexjo2144 May 9, 2023 10:17

Group small columns together in parquet files

3072e00

Modified parquet writer to store columns in order of their size inside row groups so that the reader can fetch small columns in fewer filesystem requests

raunaqmorarka force-pushed the pqw-reorder branch from e968526 to 3072e00 Compare May 9, 2023 10:30

sopel39 approved these changes May 9, 2023

View reviewed changes

github-actions bot added the tests:hive label May 9, 2023

raunaqmorarka merged commit 20740eb into trinodb:master May 9, 2023

raunaqmorarka deleted the pqw-reorder branch May 9, 2023 18:15

raunaqmorarka mentioned this pull request May 9, 2023

Release notes for 417 #17339

Closed

github-actions bot added this to the 417 milestone May 9, 2023

colebow mentioned this pull request May 10, 2023

Add Trino 417 release notes #17447

Merged

raunaqmorarka added the performance label May 10, 2023

raunaqmorarka mentioned this pull request May 18, 2023

Add Trino 418 release notes #17540

Merged

raunaqmorarka mentioned this pull request Jun 20, 2023

Revert optimization to reorder columns in parquet writer #17978

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Group small columns together in parquet files #17404

Group small columns together in parquet files #17404

raunaqmorarka commented May 9, 2023 •

edited

Loading

sopel39 left a comment

raunaqmorarka commented May 9, 2023

Group small columns together in parquet files #17404

Group small columns together in parquet files #17404

Conversation

raunaqmorarka commented May 9, 2023 • edited Loading

Description

Additional context and related issues

Release notes

sopel39 left a comment

Choose a reason for hiding this comment

raunaqmorarka commented May 9, 2023

raunaqmorarka commented May 9, 2023 •

edited

Loading