Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizations in parquet file page materialization #5582

Merged
merged 8 commits into from
Jun 17, 2024

Conversation

malhotrashivam
Copy link
Contributor

@malhotrashivam malhotrashivam commented Jun 6, 2024

In the original code, for non-primitive types like non-dictionary encoded Strings, LocalDate, etc., we read the bytes from the parquet file and convert it to binary data, and then later we convert the binary data to appropriate type, like String. This leads to extra memory allocation and extra copy for each value. After this PR, we will go directly from parquet file -> String in one step and that saves up to 10% in performance and significant improvements in memory utilization.

Same optimization has also been done for a few primitive types like char, short, and byte. For these types, we originally did parquet data -> int and int -> short/char/byte. Now we do directly parquet file -> short/char/byte

@malhotrashivam malhotrashivam added parquet Related to the Parquet integration NoDocumentationNeeded ReleaseNotesNeeded Release notes are needed labels Jun 6, 2024
@malhotrashivam malhotrashivam added this to the 3. May 2024 milestone Jun 6, 2024
@malhotrashivam malhotrashivam self-assigned this Jun 6, 2024
@malhotrashivam malhotrashivam changed the title Optimized parquet page materialization Optimizations in parquet file reading Jun 6, 2024
@malhotrashivam malhotrashivam changed the title Optimizations in parquet file reading Optimizations in parquet file page materialization Jun 10, 2024
chipkent
chipkent previously approved these changes Jun 12, 2024
Copy link
Member

@chipkent chipkent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python LGTM

Copy link
Member

@rcaudy rcaudy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice change set. Meaningfully reduces allocation. Want to consider if there's a way to make BigDecimal better.


@Override
@NotNull
public final Class<LocalDate> getNativeType() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me wonder if we should just have a ToObjectPage with a type field. Then again, here we can use a singleton, which is nice. Maybe ignore me.

Copy link
Member

@rcaudy rcaudy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed. I'm happy to merge these changes, or hold out for BigDecimal, etc.

Copy link
Member

@rcaudy rcaudy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge this, and make a separate PR for BigDecimal.

@malhotrashivam malhotrashivam merged commit 643cc9a into deephaven:main Jun 17, 2024
15 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Jun 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
NoDocumentationNeeded parquet Related to the Parquet integration ReleaseNotesNeeded Release notes are needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants