-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster ingestion from Parquet #346
Comments
Hello! As of right now, the Pyiceberg write API requires the input to be read as pyarrow tables and can only write to unpartitioned tables. Partitioned write support is currently worked on in #208. With that said, I wonder if there's a way to create Iceberg tables without rewriting the data files. Technically, Iceberg is a metadata layer on top of the actual data files. The metadata is created from table file stats. iceberg-python/pyiceberg/table/__init__.py Lines 2432 to 2465 in a4856bc
It would be super fast to just read the input files, collect metadata, and create the necessary metadata files. |
Another avenue to look into. On the Spark side of Iceberg, there's a
And it supports table partitioning! |
Fixed in #444 |
Question
I'm using the new pyiceberg write functionality. I wonder if there is any way to make it faster in my scenario:
I have around 1 TiB of Parquet files (zstd 3 compressed) that I want to ingest into Iceberg.
Table sizes are ~ power law distributed: The largest table is 25 % of total size, and there are ~ 100 tables.
Since Iceberg wants to repartition data I don't see a way to have it use my Parquet files without rewriting them.
Is it possible to use multiple cores for writing the Parquet files? I don't think that's something that PyArrow supports natively but it might be possible to run multiple PyArrow writers?
The text was updated successfully, but these errors were encountered: