-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
write.target-file-size-bytes isn't respected when writing data #8729
Comments
The issue title and parts of the description refer to Check out https://iceberg.apache.org/docs/latest/spark-writes/#controlling-file-sizes and https://iceberg.apache.org/docs/latest/spark-writes/#writing-distribution-modes The file size will be bounded by the Spark task size; if the task size exceeds the write.target-file-size-bytes, the writer will roll over to a new file. However, if the task size is smaller a larger file won't be written; the file will be at most the task size. When this gets written to disk, since Parquet is highly compressible it'll be even smaller. There's a write.distribution-mode table property which how to distribute the data across spark tasks performing the writes. @RussellSpitzer @aokolnychyi would also have more expertise in this area, so please correct me if I'm wrong about anything! |
Wrong name in the title, sorry, copied it and didn't look, now it's corrected. I am using the latest version of iceberg, 1.3.1. So this setting is not really "file size", it's more like "task size"? @amogh-jahagirdar |
I just tried bumping up the value of this setting by times 10 (5368709120), and the file sizes are still around 100MB. |
I would not say that since task size is more related to Spark. The docs I linked earlier put it concisely when it says "When writing data to Iceberg with Spark, it’s important to note that Spark cannot write a file larger than a Spark task and a file cannot span an Iceberg partition boundary. This means although Iceberg will always roll over a file when it grows to write.target-file-size-bytes, but unless the Spark task is large enough that will not happen." The property controls rolling over to a new file when the file is about to exceed the target size. So it does respect the target file size. If the Spark task is not large enough, then you won't see files hit the "write.target-file-size-bytes". To influence the Spark task size you can see the write.distribution-mode properties in the docs (and if you're using 1.3.1 the default will be the hash based)
Right, bumping it up won't magically make the files bigger, it depends on the task size which is determined by Spark (the point above) |
Also what's your configured value for |
I also had the same problem , i added couple of spark properties and then the file size got increased from 50 to 250 MB around |
@amogh-jahagirdar the value of spark.sql.adaptive.advisoryPartitionSizeInBytes is the default 64MB.
If the answers to 1) and 2) are yes and yes, then the size of files (uncompressed) should be less than 64MB and if we compress it it's supposed to be even smaller, maybe around 5MB. But how does it end up being 100MB? Does iceberg do extra coalescing of tasks such that they grow larger than spark.sql.adaptive.advisoryPartitionSizeInBytes? |
I increased |
Everything Amogh said is correct, write target file size is the max a writer will produce not the minimum. Amount of data written to a file is dependent on the amount of data in the Spark Task. This is controlled by advisory partition size if you are using hash or range distributions, if you are not using any write distribution it is just equal to the size of the spark tasks. As for your questions,
BUT these only apply if a shuffle is happening before write which only happens if write distribution mode is hash or range. Iceberg has no additional coalescing rules. |
Thanks @RussellSpitzer you helped me with this in slack. I understand it now. I think the doc should add some extra explanation about this though. |
@paulpaul1076 It could be really great if you can add some explanation which you have understood regarding this as it might benefit others also. |
Apache Iceberg version
None
Query engine
None
Please describe the bug 🐞
I have a job that reads orc files and writes them to an iceberg table, the files that are getting created are around 100MB, not 512MB, which is the default value of write.target-file-size-bytes. I tried setting write.target-file-size-bytes to 512MB manually, too, but still, the files are around 100MB.
The text was updated successfully, but these errors were encountered: