-
Notifications
You must be signed in to change notification settings - Fork 334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: buffered parquet writer #1263
feat: buffered parquet writer #1263
Conversation
Codecov Report
@@ Coverage Diff @@
## develop #1263 +/- ##
===========================================
- Coverage 85.88% 85.63% -0.25%
===========================================
Files 500 500
Lines 75697 75667 -30
===========================================
- Hits 65013 64799 -214
- Misses 10684 10868 +184 |
7349924
to
2898e88
Compare
It's impressive! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good in my side
…to buffer before flush to underlying storage
ad76340
to
c5a2f8a
Compare
* wip: use * rebase develop * chore: fix typos * feat: replace export parquet writer with buffered writer * fix: some cr comments * feat: add sst_write_buffer_size config item to config how many bytes to buffer before flush to underlying storage * chore: reabse onto develop
I hereby agree to the terms of the GreptimeDB CLA
What's changed and what's your intention?
This PR proposed a buffered parquet writer to avoid writing whole parquet file into memory and flush to underlying storage at a time, which is caused by the gap between ArrowWriter which requires a
std::io::Write
and opendal'sWriter
, which implementsfutures::AsyncWrite
only.This PR can help to reduce the memory consumption during flush/compaction.
Given an experiment that writes 1,000,000,000 rows into parquet file (target file size 1.2GB), original code uses up to 1.5GB ram while
BufferedWriter
uses at most 270MB.Original
ParquetWriter
which used to export table data is also replaced byBufferedWriter
. This change can write table content to single file instead of previous approach that exports into several fragments while significantly improves export performance to reduce cost time to nearly 60%. Additionally, it can smooth the memory footprint as demonstrated belowKnown issue
This solution requires opendal's backend support
Writer
API to continously append data to the end of object. This API is supported by fs and s3 backend, but not yet by aliyun oss (which I'm working on).Checklist
Refer to a related PR or issue link (optional)