-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-5338] Adjust coalesce behavior within NONE sort mode for bulk insert #7396
[HUDI-5338] Adjust coalesce behavior within NONE sort mode for bulk insert #7396
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yihua we also need to
- Review other impls that actually use parallelism hint (like GlobalSortPartitioner) and makes sure that we use
max(config, input_parallelism)
...hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/NonSortPartitioner.java
Show resolved
Hide resolved
@yihua @alexeykudinkin Can you have a review of this? #7372 (comment), I think they are somewhat related. Moreover, bulk insert is also used in cluster, and its parallelism is determined by cluser output files in one group. |
I created a ticket for the follow-up: HUDI-5360. Regarding the |
Thanks for raising this. I'll check the PR.
Thanks for pointing this out. Yes, I also noticed this. I revised the PR so that the NONE sort mode can still respect the specified number of output partitions for clustering. |
...hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/NonSortPartitioner.java
Outdated
Show resolved
Hide resolved
87f44b3
to
637214c
Compare
637214c
to
20dd6c6
Compare
…nsert (#7396) This PR adjusts NONE sort mode for bulk insert so that, by default, coalesce is not applied, matching the default parquet write behavior. The NONE sort mode still applies coalesce for clustering as the clustering operation relies on the bulk insert and the specified number of output Spark partitions to write a specific number of files.
…nsert (apache#7396) This PR adjusts NONE sort mode for bulk insert so that, by default, coalesce is not applied, matching the default parquet write behavior. The NONE sort mode still applies coalesce for clustering as the clustering operation relies on the bulk insert and the specified number of output Spark partitions to write a specific number of files.
…nsert (apache#7396) This PR adjusts NONE sort mode for bulk insert so that, by default, coalesce is not applied, matching the default parquet write behavior. The NONE sort mode still applies coalesce for clustering as the clustering operation relies on the bulk insert and the specified number of output Spark partitions to write a specific number of files.
…nsert (apache#7396) This PR adjusts NONE sort mode for bulk insert so that, by default, coalesce is not applied, matching the default parquet write behavior. The NONE sort mode still applies coalesce for clustering as the clustering operation relies on the bulk insert and the specified number of output Spark partitions to write a specific number of files.
…nsert (apache#7396) This PR adjusts NONE sort mode for bulk insert so that, by default, coalesce is not applied, matching the default parquet write behavior. The NONE sort mode still applies coalesce for clustering as the clustering operation relies on the bulk insert and the specified number of output Spark partitions to write a specific number of files.
…nsert (apache#7396) This PR adjusts NONE sort mode for bulk insert so that, by default, coalesce is not applied, matching the default parquet write behavior. The NONE sort mode still applies coalesce for clustering as the clustering operation relies on the bulk insert and the specified number of output Spark partitions to write a specific number of files.
…nsert (apache#7396) This PR adjusts NONE sort mode for bulk insert so that, by default, coalesce is not applied, matching the default parquet write behavior. The NONE sort mode still applies coalesce for clustering as the clustering operation relies on the bulk insert and the specified number of output Spark partitions to write a specific number of files.
…nsert (apache#7396) This PR adjusts NONE sort mode for bulk insert so that, by default, coalesce is not applied, matching the default parquet write behavior. The NONE sort mode still applies coalesce for clustering as the clustering operation relies on the bulk insert and the specified number of output Spark partitions to write a specific number of files.
…nsert (#7396) This PR adjusts NONE sort mode for bulk insert so that, by default, coalesce is not applied, matching the default parquet write behavior. The NONE sort mode still applies coalesce for clustering as the clustering operation relies on the bulk insert and the specified number of output Spark partitions to write a specific number of files.
…nsert (apache#7396) This PR adjusts NONE sort mode for bulk insert so that, by default, coalesce is not applied, matching the default parquet write behavior. The NONE sort mode still applies coalesce for clustering as the clustering operation relies on the bulk insert and the specified number of output Spark partitions to write a specific number of files.
Change Logs
Before this change, the NONE sort mode for bulk insert does coalesce for the input records or rows based on the shuffle parallelism of bulk insert (
hoodie.bulkinsert.shuffle.parallelism
) to reduce the parallelism. This could affect write latency if the cluster workers are not fully utilized due to reduced parallelism.This PR adjusts NONE sort mode for bulk insert so that, by default, coalesce is not applied, matching the default parquet write behavior. The NONE sort mode still applies
coalesce
for clustering as the clustering operation relies on the bulk insert and the specified number of output Spark partitions to write a specific number of files.New tests are added for the behavior change.
Impact
The removal of coalesce within NONE sort mode for bulk insert will reduce the write latency if the input parallelism is higher and the cluster workers are not fully utilized due to the lower shuffle parallelism of bulk insert.
For clustering, there is no behavior change, i.e., coalesce still happens in NONE sort mode for bulk insert in clustering.
Risk level
low
Documentation Update
HUDI-5339 for updating docs regarding the behavior change in NONE sort mode for bulk insert.
Contributor's checklist