feat: buffered parquet writer #1263

v0y4g3r · 2023-03-27T15:31:24Z

I hereby agree to the terms of the GreptimeDB CLA

What's changed and what's your intention?

This PR proposed a buffered parquet writer to avoid writing whole parquet file into memory and flush to underlying storage at a time, which is caused by the gap between ArrowWriter which requires a std::io::Write and opendal's Writer, which implements futures::AsyncWrite only.

This PR can help to reduce the memory consumption during flush/compaction.

Given an experiment that writes 1,000,000,000 rows into parquet file (target file size 1.2GB), original code uses up to 1.5GB ram while BufferedWriter uses at most 270MB.

Original ParquetWriter which used to export table data is also replaced by BufferedWriter. This change can write table content to single file instead of previous approach that exports into several fragments while significantly improves export performance to reduce cost time to nearly 60%. Additionally, it can smooth the memory footprint as demonstrated below

Known issue

This solution requires opendal's backend support Writer API to continously append data to the end of object. This API is supported by fs and s3 backend, but not yet by aliyun oss (which I'm working on).

Checklist

I have written the necessary rustdoc comments.
I have added the necessary unit tests and integration tests.

Refer to a related PR or issue link (optional)

codecov · 2023-03-27T15:54:09Z

Codecov Report

Merging #1263 (459185b) into develop (a2b262e) will decrease coverage by 0.25%.
The diff coverage is 81.65%.

❗ Current head 459185b differs from pull request most recent head c5a2f8a. Consider uploading reports for the commit c5a2f8a to get more accurate results

@@             Coverage Diff             @@
##           develop    #1263      +/-   ##
===========================================
- Coverage    85.88%   85.63%   -0.25%     
===========================================
  Files          500      500              
  Lines        75697    75667      -30     
===========================================
- Hits         65013    64799     -214     
- Misses       10684    10868     +184

killme2008 · 2023-03-31T02:03:14Z

It's impressive!

src/datanode/src/error.rs

src/storage/src/sst/stream_writer.rs

src/storage/src/sst/parquet.rs

src/storage/src/sst/stream_writer.rs

waynexia

Looking good in my side

…to buffer before flush to underlying storage

* wip: use * rebase develop * chore: fix typos * feat: replace export parquet writer with buffered writer * fix: some cr comments * feat: add sst_write_buffer_size config item to config how many bytes to buffer before flush to underlying storage * chore: reabse onto develop

v0y4g3r mentioned this pull request Mar 27, 2023

Support incremental backup #1248

Closed

v0y4g3r force-pushed the feat/streaming-parquet-writer branch from 7349924 to 2898e88 Compare March 30, 2023 08:32

waynexia reviewed Mar 31, 2023

View reviewed changes

src/datanode/src/error.rs Show resolved Hide resolved

src/storage/src/sst/stream_writer.rs Outdated Show resolved Hide resolved

src/storage/src/sst/stream_writer.rs Show resolved Hide resolved

v0y4g3r requested review from killme2008 and fengjiachun March 31, 2023 03:54

killme2008 reviewed Mar 31, 2023

View reviewed changes

src/storage/src/sst/parquet.rs Outdated Show resolved Hide resolved

src/storage/src/sst/stream_writer.rs Show resolved Hide resolved

waynexia approved these changes Mar 31, 2023

View reviewed changes

v0y4g3r added 7 commits March 31, 2023 23:38

wip: use

b2a0fab

rebase develop

b380b23

chore: fix typos

e8b2a74

feat: replace export parquet writer with buffered writer

e18449f

fix: some cr comments

acd7d62

feat: add sst_write_buffer_size config item to config how many bytes …

abdefd6

…to buffer before flush to underlying storage

chore: reabse onto develop

c5a2f8a

v0y4g3r force-pushed the feat/streaming-parquet-writer branch from ad76340 to c5a2f8a Compare March 31, 2023 15:52

killme2008 approved these changes Apr 1, 2023

View reviewed changes

killme2008 merged commit 0253136 into GreptimeTeam:develop Apr 1, 2023

This was referenced Apr 1, 2023

Test failure: frontend tests::instance_test::test_execute_copy_to_s3 #1301

Closed

fix: unit test fails when try to copy table to s3 and copy back #1302

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: buffered parquet writer #1263

feat: buffered parquet writer #1263

v0y4g3r commented Mar 27, 2023 •

edited

Loading

codecov bot commented Mar 27, 2023 •

edited

Loading

killme2008 commented Mar 31, 2023

waynexia left a comment

feat: buffered parquet writer #1263

feat: buffered parquet writer #1263

Conversation

v0y4g3r commented Mar 27, 2023 • edited Loading

What's changed and what's your intention?

Known issue

Checklist

Refer to a related PR or issue link (optional)

codecov bot commented Mar 27, 2023 • edited Loading

Codecov Report

killme2008 commented Mar 31, 2023

waynexia left a comment

Choose a reason for hiding this comment

v0y4g3r commented Mar 27, 2023 •

edited

Loading

codecov bot commented Mar 27, 2023 •

edited

Loading