-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle #28846
Conversation
Test build #124149 has finished for PR 28846 at commit
|
Test build #124152 has finished for PR 28846 at commit
|
retest this please |
Test build #124159 has finished for PR 28846 at commit
|
Can you elaborate it more? How does this optimization help to plan broadcast join?
The AQE needs to wait for the stage to finish, so that it knows the size and can change SMJ to BHJ. How can we avoid unnecessary I/O after the stage is finished? |
Use an example to elaborate it. This query The adaptivePlan in current master:
In above adaptivePlan, AQE produces two The adaptivePlan in this change:
In this change, AQE only materializes one exchange and then optimizes SortMergeJoin as BroadcastHashJoin. After that, we don't need to produce another exchange. |
How do you achieve it? Do you hold off the execution of one query stage, and wait until another query stage completes? |
This change does not create and materialize all query stages of the join in a batch. It creates and materializes one stage first and then re-optimize the join. So once it makes the join as broadcast join, it won't create the unnecessary exchange. |
This is the confusing part. Creating a stage is not enough, we must wait for it to complete, then we can know the size and optimize the join to broadcast join. |
I updated previous comment. "It creates and materializes one stage first..." You can see the query plan in previous comment, it optimizes the join to broadcast join. |
Do you mean to trigger the materialization or wait for it to complete? |
It needs to wait for it to complete, this is how AQE does. As you said, we need to know the size of exchange. |
The question is: This means you hold off the other stage, right? Shouldn't it cause any regressions? If this is eventually a SMJ instead of a BHJ, one of the stages will be delayed. |
Does triggering all stages of a join, mean they are running at the same time actually? I think it means they are put into scheduler. When the stages are put to running depends on resources provision. If first stage uses all resources, I think later stage still needs to held off? It is also related to one question I have, the speed-up of AQE is gained by triggering all stages (not holding off other stage as you said) together, or optimizing join from SMJ to BHJ (if we only consider join case)? I may misunderstand, but before having AQE in SparkSQL, I think we don't trigger all stages like that too, right? |
That's true, but that's an assumption. It's also possible that these 2 jobs indeed run together.
In the benchmark, the default parallelism takes all the CPU cores. I think the most perf gain should be from shuffle partition coalescing and SMJ -> BHJ. cc @JkSelf That said, by design AQE triggers all independent stages at the same time, to maximize the parallelism. And it's helpful if the resource is sufficient (or auto-scaling). I don't think we should change this design. |
And as @maryannxue said, you may trigger the large side first and it doesn't make sense to hold off. Ideally we should trigger both sides and cancel the large side if the small side completes very quickly. It will be great if you can explore the cancelation approach. |
@cloud-fan Thanks for clarifying. The idea sounds worth exploring as we can avoid local shuffle under current parallelism design of independent stages in AQE. I will explore the possibility. |
In our previous 3TB TPC-DS benchmark, the perf improvement is mainly benefit from the coalescing shuffle partitions and SMJ -> BHJ two features. The result is here for reference. |
Hi, @viirya . Could you rebase this PR to the |
Test build #127207 has finished for PR 28846 at commit
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
This patch changes the current way of creating query stages in AQE. Instead of creating query stages in batch, incrementally creating query stage can bring the optimization in earlier. It could avoid unnecessary local shuffle.
Why are the changes needed?
The current way of creating query stage in AQE is in batch. For example, the children of a sort merge join will be materialized as query stages in a batch. Then AQE brings the optimization in and optimize sort merge join to broadcast join. Except for the broadcasted exchange, we don't need do any exchange on another side of join but we already materialized the exchange. Currently AQE wraps the materialized exchange with local reader, but it still brings unnecessary I/O. We can avoid unnecessary local shuffle by incrementally creating query stage.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Unit tests.