Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write.target-file-size-bytes isn't respected when writing data #8729

Closed
paulpaul1076 opened this issue Oct 6, 2023 · 12 comments
Closed

write.target-file-size-bytes isn't respected when writing data #8729

paulpaul1076 opened this issue Oct 6, 2023 · 12 comments

Comments

@paulpaul1076
Copy link

paulpaul1076 commented Oct 6, 2023

Apache Iceberg version

None

Query engine

None

Please describe the bug 🐞

I have a job that reads orc files and writes them to an iceberg table, the files that are getting created are around 100MB, not 512MB, which is the default value of write.target-file-size-bytes. I tried setting write.target-file-size-bytes to 512MB manually, too, but still, the files are around 100MB.

  val df = spark.read.orc("s3://hdp-temp/arch/csv3_2023")
  df.writeTo("db.batch_iceberg_test3")
    .tableProperty("write.target-file-size-bytes", "536870912")
    .createOrReplace()
image
@amogh-jahagirdar
Copy link
Contributor

amogh-jahagirdar commented Oct 6, 2023

The issue title and parts of the description refer to write.parquet.page-size-bytes but what you are describing is the file size and in your code also you refer to file size, did you perhaps mean write.target-file-size-bytes instead? I'm assuming that's the case based on your question being around the file size.

Check out https://iceberg.apache.org/docs/latest/spark-writes/#controlling-file-sizes and https://iceberg.apache.org/docs/latest/spark-writes/#writing-distribution-modes

The file size will be bounded by the Spark task size; if the task size exceeds the write.target-file-size-bytes, the writer will roll over to a new file. However, if the task size is smaller a larger file won't be written; the file will be at most the task size. When this gets written to disk, since Parquet is highly compressible it'll be even smaller.

There's a write.distribution-mode table property which how to distribute the data across spark tasks performing the writes.
Prior to 1.2.0 this was none, which required explicit ordering by partition; for tables created after 1.2.0 this is hash which shuffles the data via hash prior to writing. This change was done to alleviate the small files problem, so the Iceberg version you are using will also be helpful info.

@RussellSpitzer @aokolnychyi would also have more expertise in this area, so please correct me if I'm wrong about anything!

@paulpaul1076
Copy link
Author

paulpaul1076 commented Oct 6, 2023

Wrong name in the title, sorry, copied it and didn't look, now it's corrected. I am using the latest version of iceberg, 1.3.1.

So this setting is not really "file size", it's more like "task size"? @amogh-jahagirdar

@paulpaul1076 paulpaul1076 changed the title write.parquet.page-size-bytes isn't respected when writing data write.parquet.target-file-size-bytess isn't respected when writing data Oct 6, 2023
@paulpaul1076 paulpaul1076 changed the title write.parquet.target-file-size-bytess isn't respected when writing data write.target-file-size-bytes isn't respected when writing data Oct 6, 2023
@paulpaul1076
Copy link
Author

I just tried bumping up the value of this setting by times 10 (5368709120), and the file sizes are still around 100MB.

@amogh-jahagirdar
Copy link
Contributor

amogh-jahagirdar commented Oct 6, 2023

So this setting is not really "file size", it's more like "task size"?

I would not say that since task size is more related to Spark. The docs I linked earlier put it concisely when it says "When writing data to Iceberg with Spark, it’s important to note that Spark cannot write a file larger than a Spark task and a file cannot span an Iceberg partition boundary. This means although Iceberg will always roll over a file when it grows to write.target-file-size-bytes, but unless the Spark task is large enough that will not happen."

The property controls rolling over to a new file when the file is about to exceed the target size. So it does respect the target file size. If the Spark task is not large enough, then you won't see files hit the "write.target-file-size-bytes". To influence the Spark task size you can see the write.distribution-mode properties in the docs (and if you're using 1.3.1 the default will be the hash based)

I just tried bumping up the value of this setting by times 10 (5368709120), and the file sizes are still around 100MB.

Right, bumping it up won't magically make the files bigger, it depends on the task size which is determined by Spark (the point above)

@amogh-jahagirdar
Copy link
Contributor

Also what's your configured value for spark.sql.adaptive.advisoryPartitionSizeInBytes? that will also influence the Spark task size for your case as well (by default, the value is 64mb)

@gzagarwal
Copy link

I also had the same problem , i added couple of spark properties and then the file size got increased from 50 to 250 MB around
As Amogh pointed about the property "spark.sql.adaptive.advisoryPartitionSizeInBytes" . Try to increase the value of this property from 64MB to 512MB .and the value of write.distribution.mode to be hash.
Though i have enabled the AQE

@paulpaul1076
Copy link
Author

paulpaul1076 commented Oct 6, 2023

@amogh-jahagirdar the value of spark.sql.adaptive.advisoryPartitionSizeInBytes is the default 64MB.

  1. Do I understand it correctly that the size of the uncompressed data in each one of my spark tasks is supposed to be around 64MB?
  2. Also, do I understand it correctly that the we can't write one file from multiple tasks in spark?

If the answers to 1) and 2) are yes and yes, then the size of files (uncompressed) should be less than 64MB and if we compress it it's supposed to be even smaller, maybe around 5MB. But how does it end up being 100MB? Does iceberg do extra coalescing of tasks such that they grow larger than spark.sql.adaptive.advisoryPartitionSizeInBytes?

@paulpaul1076
Copy link
Author

paulpaul1076 commented Oct 6, 2023

I increased spark.sql.adaptive.advisoryPartitionSizeInBytes and the files are still around 100MB in size.

@RussellSpitzer
Copy link
Member

Everything Amogh said is correct, write target file size is the max a writer will produce not the minimum. Amount of data written to a file is dependent on the amount of data in the Spark Task. This is controlled by advisory partition size if you are using hash or range distributions, if you are not using any write distribution it is just equal to the size of the spark tasks.

As for your questions,

  1. It's whatever the shuffle engine things the size of the spark serialized rows are.
  2. Yes

BUT these only apply if a shuffle is happening before write which only happens if write distribution mode is hash or range. Iceberg has no additional coalescing rules.

@RussellSpitzer
Copy link
Member

@paulpaul1076
Copy link
Author

Thanks @RussellSpitzer you helped me with this in slack. I understand it now. I think the doc should add some extra explanation about this though.

@atifiu
Copy link

atifiu commented Dec 26, 2023

@paulpaul1076 It could be really great if you can add some explanation which you have understood regarding this as it might benefit others also.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants