Bin-pack Writes Operation into multiple parquet files, and parallelize writing `WriteTask`s #444

kevinjqliu · 2024-02-19T18:25:56Z

This PR bin-packs write operations into multiple parquet files when necessary, for both append and overwrite.

Bin-packing is determined by the write.target-file-size-bytes config (WRITE_TARGET_FILE_SIZE_BYTES) and defaults to 512 MB (WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT). Bin-packing is determined based on the size of the arrow dataframe in memory, the resulting parquet files might be smaller than the target size due to parquet's compression.

This PR also adds the ability to parallelize writing WriteTasks for #428. To use parallelism, set env PYICEBERG_MAX_WORKERS (docs).
Or in jupyter notebook,

%env PYICEBERG_MAX_WORKERS=8

Results

Reading a 6.7GB, 10M row arrow file and writing with pyiceberg

-rw-r--r--@ 1 kevinliu  staff   6.7G Feb 24 23:40 table_10000000.arrow

Without parallelized writes (using pyiceberg 0.6)

took 40.948 seconds
1 file
-rw-r--r--  1 kevinliu  wheel   1.6G Mar 28 11:36 00000-0-f49ef6e2-cdb4-40cd-9e5d-38caa00c6b08.parquet

With parallelize writes (this branch)

env: PYICEBERG_MAX_WORKERS=8
took 16.738 seconds
14 files
-rw-r--r--@ 1 kevinliu  wheel    74M Mar 28 11:15 00000-0-efb86411-91a0-4397-929e-7b646b6dc008.parquet
-rw-r--r--@ 1 kevinliu  wheel    74M Mar 28 11:15 00000-1-efb86411-91a0-4397-929e-7b646b6dc008.parquet
-rw-r--r--@ 1 kevinliu  wheel   126M Mar 28 11:15 00000-10-efb86411-91a0-4397-929e-7b646b6dc008.parquet
-rw-r--r--@ 1 kevinliu  wheel    75M Mar 28 11:15 00000-11-efb86411-91a0-4397-929e-7b646b6dc008.parquet
-rw-r--r--@ 1 kevinliu  wheel    99M Mar 28 11:15 00000-12-efb86411-91a0-4397-929e-7b646b6dc008.parquet
-rw-r--r--@ 1 kevinliu  wheel    33M Mar 28 11:15 00000-13-efb86411-91a0-4397-929e-7b646b6dc008.parquet
-rw-r--r--@ 1 kevinliu  wheel    74M Mar 28 11:15 00000-2-efb86411-91a0-4397-929e-7b646b6dc008.parquet
-rw-r--r--@ 1 kevinliu  wheel   114M Mar 28 11:15 00000-3-efb86411-91a0-4397-929e-7b646b6dc008.parquet
-rw-r--r--@ 1 kevinliu  wheel    74M Mar 28 11:15 00000-4-efb86411-91a0-4397-929e-7b646b6dc008.parquet
-rw-r--r--@ 1 kevinliu  wheel    74M Mar 28 11:15 00000-5-efb86411-91a0-4397-929e-7b646b6dc008.parquet
-rw-r--r--@ 1 kevinliu  wheel    93M Mar 28 11:15 00000-6-efb86411-91a0-4397-929e-7b646b6dc008.parquet
-rw-r--r--@ 1 kevinliu  wheel    75M Mar 28 11:15 00000-7-efb86411-91a0-4397-929e-7b646b6dc008.parquet
-rw-r--r--@ 1 kevinliu  wheel    75M Mar 28 11:15 00000-8-efb86411-91a0-4397-929e-7b646b6dc008.parquet
-rw-r--r--@ 1 kevinliu  wheel    99M Mar 28 11:15 00000-9-efb86411-91a0-4397-929e-7b646b6dc008.parquet

Here's a Jupyter notebook of the test:
https://colab.research.google.com/drive/1oLYXPNisjQ0W3cwUe3e8w1KNVgizwRB_?usp=sharing

pyiceberg/io/pyarrow.py

Fokko · 2024-02-22T10:34:10Z

pyiceberg/io/pyarrow.py

-    )
-    return iter([data_file])
+    splits = tbl.to_batches()
+    target_weight = 2 << 27  # 256 MB


In Java we have the write.target-file-size-bytes configuration. In this case, we're looking at the size in memory, and not the file size. Converting this is very tricky since Parquet has some excellent encodings to reduce the size on disk. We might want to check the heuristic on the Java side. On the other end, we also don't want to explode the memory when decoding a Parquet file

I think this is done in Java like so #428 (comment)

Write is done row by row and on every 1000 rows, the file size is checked against the desired size.

and the write.target-file-size-bytes configuration is just a heuristic to achieve, not the absolute size of the result file.

Based on this comment, it seems that even in Spark result parquet files can be smaller than the target file size.

For now, I propose we reuse the write.target-file-size-bytes option and default to 512MB of arrow size in memory.

Here's a test run when we bin-packed 685.46 MB of arrow memory into 256MB chunks. We ended up with 3 ~80MB parquet files.

Fokko

This is looking great @kevinjqliu! 🙌 Let me know when it is out of draft, and I'll give it a spin.

kevinjqliu · 2024-02-23T20:32:24Z

@Fokko PRs ready for review. Please give it a try. I've linked an example notebook in the PR description.

I've also noticed that writing one RecordBatch at a time seems to be less ideal than converting to Table and then writing the Table. Still exploring the different options here

kevinjqliu · 2024-02-26T01:00:06Z

We rely on Table.to_batches() to produce smaller RecordBatchs from Table object which we then use to bin-pack. Depending on how the table was constructed, .to_batches() might produce different amount of RecordBatchs.

Updated the PR to use table size and row size heuristics to chunk the Table object.
Thanks to @bigluck for testing it out.

HonahX

Thanks @kevinjqliu for working on this and @bigluck for testing! This looks great! I left some minor comments. Could you please rebase this PR to main? Thanks!

pyiceberg/table/__init__.py

tests/integration/test_writes.py

pyiceberg/io/pyarrow.py

kevinjqliu · 2024-03-06T23:39:28Z

thanks for the review @HonahX. I've rebased off main and addressed your comments.

I also added more tests after figuring out the fix for #482

HonahX

LGTM! Thanks for adding additional tests.

Fokko

Looking good @kevinjqliu

Although we never know what the actual target size will be (check out this blog), I think it is good to chunk since we also don't want to have very efficient encoded files that blow up the memory. Thanks for working on this!

Could you do the docs exposing describing the properties in a separate PR?

Fokko · 2024-03-28T07:48:50Z

@kevinjqliu can you fix the merge conflict? I'll merge right after that

kevinjqliu · 2024-03-28T18:44:51Z

@Fokko resolved merge conflict. Please take another look. I've also updated the description to show the results of parallelized writes

Fokko · 2024-03-28T18:47:44Z

@kevinjqliu Thanks for adding the examples. I think in general we want to have slightly bigger files.

A simple heuristic I can think of is that we put an upper bound on the number of files, equal to the number of threads. This way we still get decent parallelization, but avoid creating many small files (and avoid the overhead of opening new files). We can do this in a separate PR.

kevinjqliu mentioned this pull request Feb 19, 2024

Parallel Table.append #428

Closed

Fokko reviewed Feb 22, 2024

View reviewed changes

pyiceberg/io/pyarrow.py Outdated Show resolved Hide resolved

Fokko reviewed Feb 22, 2024

View reviewed changes

pyiceberg/io/pyarrow.py Outdated Show resolved Hide resolved

Fokko reviewed Feb 22, 2024

View reviewed changes

kevinjqliu force-pushed the kevinjqliu/bin-pack-write branch 2 times, most recently from bfbc6c4 to f632c22 Compare February 23, 2024 02:57

kevinjqliu marked this pull request as ready for review February 23, 2024 20:28

kevinjqliu changed the title ~~[WIP] Bin Pack Writes~~ Bin Pack Writes Operation into multiple parquet files Feb 23, 2024

kevinjqliu changed the title ~~Bin Pack Writes Operation into multiple parquet files~~ Bin-pack Writes Operation into multiple parquet files Feb 23, 2024

kevinjqliu changed the title ~~Bin-pack Writes Operation into multiple parquet files~~ Bin-pack Writes Operation into multiple parquet files, and parallelize writing WriteTasks Feb 23, 2024

kevinjqliu mentioned this pull request Feb 29, 2024

Spark <> Iceberg bug integration test #482

Closed

kevinjqliu requested a review from Fokko March 1, 2024 03:46

HonahX reviewed Mar 6, 2024

View reviewed changes

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved

tests/integration/test_writes.py Outdated Show resolved Hide resolved

pyiceberg/io/pyarrow.py Outdated Show resolved Hide resolved

kevinjqliu mentioned this pull request Mar 6, 2024

Disable Spark Catalog caching for integration tests #501

Merged

kevinjqliu force-pushed the kevinjqliu/bin-pack-write branch from ad54850 to 7123b9f Compare March 6, 2024 22:16

kevinjqliu force-pushed the kevinjqliu/bin-pack-write branch from 7123b9f to 0b41791 Compare March 7, 2024 01:23

kevinjqliu requested a review from HonahX March 7, 2024 02:06

HonahX approved these changes Mar 7, 2024

View reviewed changes

kevinjqliu force-pushed the kevinjqliu/bin-pack-write branch from 0b41791 to 5ceb80d Compare March 8, 2024 16:25

kevinjqliu and others added 7 commits March 9, 2024 08:47

bin pack write

5a7c8f9

add write target file size config

bae32e2

test

2730a8f

add test for multiple data files

ef64c92

parquet writer write once

9cb9649

parallelize write tasks

3f284b2

refactor

6462d06

kevinjqliu added 7 commits March 9, 2024 08:47

chunk correctly using to_batches

fd1efe0

change variable names

7ccfdb2

get rid of assert

1ee3a55

configure PackingIterator

f92de1a

add more tests

0047fd8

rewrite set_properties

c6cb8de

set int property

d80054d

kevinjqliu force-pushed the kevinjqliu/bin-pack-write branch from 5ceb80d to d80054d Compare March 9, 2024 16:49

Fokko approved these changes Mar 28, 2024

View reviewed changes

Merge branch 'main' into kevinjqliu/bin-pack-write

8cd7160

Fokko merged commit 6aeb126 into apache:main Mar 28, 2024
7 checks passed

kevinjqliu deleted the kevinjqliu/bin-pack-write branch March 28, 2024 20:09

Fokko mentioned this pull request May 13, 2024

Faster ingestion from Parquet #346

Closed

kevinjqliu mentioned this pull request May 14, 2024

PyIceberg Near-Term Roadmap #736

Open

39 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bin-pack Writes Operation into multiple parquet files, and parallelize writing `WriteTask`s #444

Bin-pack Writes Operation into multiple parquet files, and parallelize writing `WriteTask`s #444

kevinjqliu commented Feb 19, 2024 •

edited

Loading

Fokko Feb 22, 2024

kevinjqliu Feb 22, 2024

kevinjqliu Feb 22, 2024

kevinjqliu Feb 22, 2024

kevinjqliu Feb 22, 2024

Fokko left a comment

kevinjqliu commented Feb 23, 2024

kevinjqliu commented Feb 26, 2024

HonahX left a comment

kevinjqliu commented Mar 6, 2024

HonahX left a comment

Fokko left a comment

Fokko commented Mar 28, 2024

kevinjqliu commented Mar 28, 2024

Fokko commented Mar 28, 2024

Bin-pack Writes Operation into multiple parquet files, and parallelize writing WriteTasks #444

Bin-pack Writes Operation into multiple parquet files, and parallelize writing WriteTasks #444

Conversation

kevinjqliu commented Feb 19, 2024 • edited Loading

Results

Fokko Feb 22, 2024

Choose a reason for hiding this comment

kevinjqliu Feb 22, 2024

Choose a reason for hiding this comment

kevinjqliu Feb 22, 2024

Choose a reason for hiding this comment

kevinjqliu Feb 22, 2024

Choose a reason for hiding this comment

kevinjqliu Feb 22, 2024

Choose a reason for hiding this comment

Fokko left a comment

Choose a reason for hiding this comment

kevinjqliu commented Feb 23, 2024

kevinjqliu commented Feb 26, 2024

HonahX left a comment

Choose a reason for hiding this comment

kevinjqliu commented Mar 6, 2024

HonahX left a comment

Choose a reason for hiding this comment

Fokko left a comment

Choose a reason for hiding this comment

Fokko commented Mar 28, 2024

kevinjqliu commented Mar 28, 2024

Fokko commented Mar 28, 2024

Bin-pack Writes Operation into multiple parquet files, and parallelize writing `WriteTask`s #444

Bin-pack Writes Operation into multiple parquet files, and parallelize writing `WriteTask`s #444

kevinjqliu commented Feb 19, 2024 •

edited

Loading