Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Group small columns together in parquet files #17404

Merged
merged 1 commit into from
May 9, 2023

Conversation

raunaqmorarka
Copy link
Member

@raunaqmorarka raunaqmorarka commented May 9, 2023

Description

Modified parquet writer to store columns in order of their size inside row groups
so that the reader can fetch small columns in fewer filesystem requests

Additional context and related issues

Based on similar logic in ORC writer at

Collections.sort(dataStreams);

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Hive
* Improve the layout of data in parquet files produced by the optimized parquet writer for faster reads. ({issue}`17404`)

# Hudi, Iceberg, Delta
* Improve the layout of data in parquet files for faster reads. ({issue}`17404`)

@cla-bot cla-bot bot added the cla-signed label May 9, 2023
Modified parquet writer to store columns in order of
their size inside row groups so that the reader can fetch
small columns in fewer filesystem requests
Copy link
Member

@sopel39 sopel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it might drop performance in some scenarios?

E.g. it seems optimally it should group columns that are mostly used together

@raunaqmorarka
Copy link
Member Author

Do you think it might drop performance in some scenarios?

E.g. it seems optimally it should group columns that are mostly used together

This change matters only for columns with size less than parquet.max-buffer-size (default 8mb). Given that we're only affecting smaller columns, the penalty for a bad decision shouldn't be high and the current heuristic is the best we can do without knowing usage patterns. I'm also relying on this having been a successful optimization (or at least no complaints) in ORC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

2 participants