Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: buffered parquet writer #1263

Merged

Conversation

v0y4g3r
Copy link
Contributor

@v0y4g3r v0y4g3r commented Mar 27, 2023

I hereby agree to the terms of the GreptimeDB CLA

What's changed and what's your intention?

This PR proposed a buffered parquet writer to avoid writing whole parquet file into memory and flush to underlying storage at a time, which is caused by the gap between ArrowWriter which requires a std::io::Write and opendal's Writer, which implements futures::AsyncWrite only.

This PR can help to reduce the memory consumption during flush/compaction.

Given an experiment that writes 1,000,000,000 rows into parquet file (target file size 1.2GB), original code uses up to 1.5GB ram while BufferedWriter uses at most 270MB.

image

Original ParquetWriter which used to export table data is also replaced by BufferedWriter. This change can write table content to single file instead of previous approach that exports into several fragments while significantly improves export performance to reduce cost time to nearly 60%. Additionally, it can smooth the memory footprint as demonstrated below

image

Known issue

This solution requires opendal's backend support Writer API to continously append data to the end of object. This API is supported by fs and s3 backend, but not yet by aliyun oss (which I'm working on).

Checklist

  • I have written the necessary rustdoc comments.
  • I have added the necessary unit tests and integration tests.

Refer to a related PR or issue link (optional)

@codecov
Copy link

codecov bot commented Mar 27, 2023

Codecov Report

Merging #1263 (459185b) into develop (a2b262e) will decrease coverage by 0.25%.
The diff coverage is 81.65%.

❗ Current head 459185b differs from pull request most recent head c5a2f8a. Consider uploading reports for the commit c5a2f8a to get more accurate results

@@             Coverage Diff             @@
##           develop    #1263      +/-   ##
===========================================
- Coverage    85.88%   85.63%   -0.25%     
===========================================
  Files          500      500              
  Lines        75697    75667      -30     
===========================================
- Hits         65013    64799     -214     
- Misses       10684    10868     +184     

@v0y4g3r v0y4g3r force-pushed the feat/streaming-parquet-writer branch from 7349924 to 2898e88 Compare March 30, 2023 08:32
@killme2008
Copy link
Contributor

It's impressive!

src/datanode/src/error.rs Show resolved Hide resolved
src/storage/src/sst/stream_writer.rs Outdated Show resolved Hide resolved
src/storage/src/sst/stream_writer.rs Show resolved Hide resolved
src/storage/src/sst/parquet.rs Outdated Show resolved Hide resolved
src/storage/src/sst/stream_writer.rs Show resolved Hide resolved
Copy link
Member

@waynexia waynexia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good in my side

@v0y4g3r v0y4g3r force-pushed the feat/streaming-parquet-writer branch from ad76340 to c5a2f8a Compare March 31, 2023 15:52
@killme2008 killme2008 merged commit 0253136 into GreptimeTeam:develop Apr 1, 2023
paomian pushed a commit to paomian/greptimedb that referenced this pull request Oct 19, 2023
* wip: use

* rebase develop

* chore: fix typos

* feat: replace export parquet writer with buffered writer

* fix: some cr comments

* feat: add sst_write_buffer_size config item to config how many bytes to buffer before flush to underlying storage

* chore: reabse onto develop
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants