Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable optimized Parquet writer by default in Delta Lake #12757

Merged
merged 1 commit into from
Jun 21, 2022

Conversation

ebyhr
Copy link
Member

@ebyhr ebyhr commented Jun 9, 2022

Description

Enable optimized Parquet writer by default in Delta Lake

Documentation

(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
(x) Release notes entries required with the following suggested text:

# Delta Lake
* Enable optimized Parquet writer by default. ({issue}`12757`)

@cla-bot cla-bot bot added the cla-signed label Jun 9, 2022
@ebyhr
Copy link
Member Author

ebyhr commented Jun 9, 2022

Looking CI failure.

Error:  Failures: 
Error:    TestDeltaLakeCreateTableStatisticsLegacyWriter>AbstractTestDeltaLakeCreateTableStatistics.testTimestampMilliRecords:378 expected [Optional[5536338739703808]] but found [Optional.empty]

@ebyhr ebyhr force-pushed the ebi/delta-native-parquet-reader branch from 1a0659e to 52a9a4d Compare June 9, 2022 08:43
@ebyhr ebyhr requested review from raunaqmorarka and findepi June 10, 2022 05:58
Copy link
Member

@raunaqmorarka raunaqmorarka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code changes look ok.
Do the Databricks product tests run automatically ?
Could you also add some rough benchmark results confirming improvement with new writer ?

@findepi
Copy link
Member

findepi commented Jun 10, 2022

Could you also add some rough benchmark results confirming improvement with new writer ?

New writer is an improvement regardless of perf numbers, because it avoids JNI for compression, and thus avoids OOM due to GCLocker.

Do the Databricks product tests run automatically ?

They still don't exist, @findinpath @alexjo2144 are working on this.
@ebyhr we need to verify this internally for now.

@findepi findepi requested a review from alexjo2144 June 10, 2022 07:18
@homar
Copy link
Member

homar commented Jun 13, 2022

Is there a reason to leave the possibility to use the old writer ?

@findepi
Copy link
Member

findepi commented Jun 13, 2022

Is there a reason to leave the possibility to use the old writer ?

just in case. we can remove it later.

@ebyhr ebyhr force-pushed the ebi/delta-native-parquet-reader branch from 52a9a4d to 4d31bb6 Compare June 21, 2022 00:04
@ebyhr ebyhr merged commit e5c5ff6 into trinodb:master Jun 21, 2022
@ebyhr ebyhr deleted the ebi/delta-native-parquet-reader branch June 21, 2022 01:58
@ebyhr ebyhr mentioned this pull request Jun 21, 2022
@github-actions github-actions bot added this to the 387 milestone Jun 21, 2022
@findepi
Copy link
Member

findepi commented Jun 22, 2022

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

5 participants