Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bin-pack Writes Operation into multiple parquet files, and parallelize writing WriteTasks #444

Merged
merged 15 commits into from
Mar 28, 2024

Conversation

kevinjqliu
Copy link
Contributor

@kevinjqliu kevinjqliu commented Feb 19, 2024

This PR bin-packs write operations into multiple parquet files when necessary, for both append and overwrite.

Bin-packing is determined by the write.target-file-size-bytes config (WRITE_TARGET_FILE_SIZE_BYTES) and defaults to 512 MB (WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT). Bin-packing is determined based on the size of the arrow dataframe in memory, the resulting parquet files might be smaller than the target size due to parquet's compression.

This PR also adds the ability to parallelize writing WriteTasks for #428. To use parallelism, set env PYICEBERG_MAX_WORKERS (docs).
Or in jupyter notebook,

%env PYICEBERG_MAX_WORKERS=8

Results

Reading a 6.7GB, 10M row arrow file and writing with pyiceberg

-rw-r--r--@ 1 kevinliu  staff   6.7G Feb 24 23:40 table_10000000.arrow

Without parallelized writes (using pyiceberg 0.6)

took 40.948 seconds
1 file
-rw-r--r--  1 kevinliu  wheel   1.6G Mar 28 11:36 00000-0-f49ef6e2-cdb4-40cd-9e5d-38caa00c6b08.parquet

With parallelize writes (this branch)

env: PYICEBERG_MAX_WORKERS=8
took 16.738 seconds
14 files
-rw-r--r--@ 1 kevinliu  wheel    74M Mar 28 11:15 00000-0-efb86411-91a0-4397-929e-7b646b6dc008.parquet
-rw-r--r--@ 1 kevinliu  wheel    74M Mar 28 11:15 00000-1-efb86411-91a0-4397-929e-7b646b6dc008.parquet
-rw-r--r--@ 1 kevinliu  wheel   126M Mar 28 11:15 00000-10-efb86411-91a0-4397-929e-7b646b6dc008.parquet
-rw-r--r--@ 1 kevinliu  wheel    75M Mar 28 11:15 00000-11-efb86411-91a0-4397-929e-7b646b6dc008.parquet
-rw-r--r--@ 1 kevinliu  wheel    99M Mar 28 11:15 00000-12-efb86411-91a0-4397-929e-7b646b6dc008.parquet
-rw-r--r--@ 1 kevinliu  wheel    33M Mar 28 11:15 00000-13-efb86411-91a0-4397-929e-7b646b6dc008.parquet
-rw-r--r--@ 1 kevinliu  wheel    74M Mar 28 11:15 00000-2-efb86411-91a0-4397-929e-7b646b6dc008.parquet
-rw-r--r--@ 1 kevinliu  wheel   114M Mar 28 11:15 00000-3-efb86411-91a0-4397-929e-7b646b6dc008.parquet
-rw-r--r--@ 1 kevinliu  wheel    74M Mar 28 11:15 00000-4-efb86411-91a0-4397-929e-7b646b6dc008.parquet
-rw-r--r--@ 1 kevinliu  wheel    74M Mar 28 11:15 00000-5-efb86411-91a0-4397-929e-7b646b6dc008.parquet
-rw-r--r--@ 1 kevinliu  wheel    93M Mar 28 11:15 00000-6-efb86411-91a0-4397-929e-7b646b6dc008.parquet
-rw-r--r--@ 1 kevinliu  wheel    75M Mar 28 11:15 00000-7-efb86411-91a0-4397-929e-7b646b6dc008.parquet
-rw-r--r--@ 1 kevinliu  wheel    75M Mar 28 11:15 00000-8-efb86411-91a0-4397-929e-7b646b6dc008.parquet
-rw-r--r--@ 1 kevinliu  wheel    99M Mar 28 11:15 00000-9-efb86411-91a0-4397-929e-7b646b6dc008.parquet

Here's a Jupyter notebook of the test:
https://colab.research.google.com/drive/1oLYXPNisjQ0W3cwUe3e8w1KNVgizwRB_?usp=sharing

@kevinjqliu kevinjqliu mentioned this pull request Feb 19, 2024
pyiceberg/io/pyarrow.py Outdated Show resolved Hide resolved
pyiceberg/io/pyarrow.py Outdated Show resolved Hide resolved
)
return iter([data_file])
splits = tbl.to_batches()
target_weight = 2 << 27 # 256 MB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Java we have the write.target-file-size-bytes configuration. In this case, we're looking at the size in memory, and not the file size. Converting this is very tricky since Parquet has some excellent encodings to reduce the size on disk. We might want to check the heuristic on the Java side. On the other end, we also don't want to explode the memory when decoding a Parquet file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is done in Java like so #428 (comment)

Write is done row by row and on every 1000 rows, the file size is checked against the desired size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and the write.target-file-size-bytes configuration is just a heuristic to achieve, not the absolute size of the result file.

Based on this comment, it seems that even in Spark result parquet files can be smaller than the target file size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I propose we reuse the write.target-file-size-bytes option and default to 512MB of arrow size in memory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a test run when we bin-packed 685.46 MB of arrow memory into 256MB chunks. We ended up with 3 ~80MB parquet files.

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great @kevinjqliu! 🙌 Let me know when it is out of draft, and I'll give it a spin.

@kevinjqliu kevinjqliu force-pushed the kevinjqliu/bin-pack-write branch 2 times, most recently from bfbc6c4 to f632c22 Compare February 23, 2024 02:57
@kevinjqliu kevinjqliu marked this pull request as ready for review February 23, 2024 20:28
@kevinjqliu
Copy link
Contributor Author

@Fokko PRs ready for review. Please give it a try. I've linked an example notebook in the PR description.

I've also noticed that writing one RecordBatch at a time seems to be less ideal than converting to Table and then writing the Table. Still exploring the different options here

@kevinjqliu kevinjqliu changed the title [WIP] Bin Pack Writes Bin Pack Writes Operation into multiple parquet files Feb 23, 2024
@kevinjqliu kevinjqliu changed the title Bin Pack Writes Operation into multiple parquet files Bin-pack Writes Operation into multiple parquet files Feb 23, 2024
@kevinjqliu kevinjqliu changed the title Bin-pack Writes Operation into multiple parquet files Bin-pack Writes Operation into multiple parquet files, and parallelize writing WriteTasks Feb 23, 2024
@kevinjqliu
Copy link
Contributor Author

We rely on Table.to_batches() to produce smaller RecordBatchs from Table object which we then use to bin-pack. Depending on how the table was constructed, .to_batches() might produce different amount of RecordBatchs.

Updated the PR to use table size and row size heuristics to chunk the Table object.
Thanks to @bigluck for testing it out.

Copy link
Contributor

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kevinjqliu for working on this and @bigluck for testing! This looks great! I left some minor comments. Could you please rebase this PR to main? Thanks!

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved
tests/integration/test_writes.py Outdated Show resolved Hide resolved
pyiceberg/io/pyarrow.py Outdated Show resolved Hide resolved
@kevinjqliu
Copy link
Contributor Author

thanks for the review @HonahX. I've rebased off main and addressed your comments.

I also added more tests after figuring out the fix for #482

@kevinjqliu kevinjqliu force-pushed the kevinjqliu/bin-pack-write branch from 7123b9f to 0b41791 Compare March 7, 2024 01:23
@kevinjqliu kevinjqliu requested a review from HonahX March 7, 2024 02:06
Copy link
Contributor

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for adding additional tests.

@kevinjqliu kevinjqliu force-pushed the kevinjqliu/bin-pack-write branch from 0b41791 to 5ceb80d Compare March 8, 2024 16:25
@kevinjqliu kevinjqliu force-pushed the kevinjqliu/bin-pack-write branch from 5ceb80d to d80054d Compare March 9, 2024 16:49
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good @kevinjqliu

Although we never know what the actual target size will be (check out this blog), I think it is good to chunk since we also don't want to have very efficient encoded files that blow up the memory. Thanks for working on this!

Could you do the docs exposing describing the properties in a separate PR?

@Fokko
Copy link
Contributor

Fokko commented Mar 28, 2024

@kevinjqliu can you fix the merge conflict? I'll merge right after that

@kevinjqliu
Copy link
Contributor Author

@Fokko resolved merge conflict. Please take another look. I've also updated the description to show the results of parallelized writes

@Fokko
Copy link
Contributor

Fokko commented Mar 28, 2024

@kevinjqliu Thanks for adding the examples. I think in general we want to have slightly bigger files.

A simple heuristic I can think of is that we put an upper bound on the number of files, equal to the number of threads. This way we still get decent parallelization, but avoid creating many small files (and avoid the overhead of opening new files). We can do this in a separate PR.

@Fokko Fokko merged commit 6aeb126 into apache:main Mar 28, 2024
7 checks passed
@kevinjqliu kevinjqliu deleted the kevinjqliu/bin-pack-write branch March 28, 2024 20:09
@kevinjqliu kevinjqliu mentioned this pull request May 14, 2024
39 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants