Implement LimitPushDown for ExecutionPlan #9815

Lordworms · 2024-03-27T00:48:30Z

Which issue does this PR close?

Closes #9792

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Lordworms · 2024-03-28T00:31:01Z

datafusion/core/src/physical_optimizer/limit_pushdown.rs

+    if let Some(global_limit) = plan.as_any().downcast_ref::<GlobalLimitExec>() {
+        let input = global_limit.input().as_any();
+        if let Some(_) = input.downcast_ref::<CoalescePartitionsExec>() {
+            return Ok(Transformed::yes(swap_with_coalesce_partition(global_limit)));


add more rules then. currently only specify to the repartition.slt test

Lordworms · 2024-03-28T01:00:17Z

I got stuck in a plan like this

the panic error lies in

the reason that it panics is the output partition number of LocalLimitExec is 3. However, the LocalLimitExec has been through a UnionExec. I think the output partition could adjust to 1? or in this case, we could not do LimitPushdown? Could you please give me an answer? @alamb @mustafasrepo

mustafasrepo · 2024-03-28T06:25:34Z

Output partition number of the UnionExec is the sum of partition number of its inputs. Hence we cannot rely on after UnionExec output partitioning is 1.

mustafasrepo · 2024-03-28T06:40:59Z

datafusion/sqllogictest/test_files/repartition.slt

+CoalescePartitionsExec
+--GlobalLimitExec: skip=0, fetch=5


I think this change is not correct. We cannot push down GlobalLimitExec though CoalescePartitionsExec. However, we can convert following pattern

GlobalLimitExec: skip=0, fetch=5 --CoalescePartitionsExec

into

GlobalLimitExec: skip=0, fetch=5 --CoalescePartitionsExec ----LocalLimitExec: skip=0, fetch=5

if skip is larger than 0. LocalLimitExec should still have skip=0 where fetch is skip+global limit fetch.

Got it so the CoalescePartitionExec will also be a pushdown terminator and we just add a new LocalLimitExec below it.

I don't really understand the reason to add an extra LocalLimitExec, I think we could simply add a global fetch when hit the pattern.

I agree, that would be better. We can have fetch support in CoalescePArtitionsExec

mustafasrepo · 2024-03-28T06:43:17Z

datafusion/physical-plan/src/coalesce_batches.rs

@@ -83,6 +83,9 @@ impl CoalesceBatchesExec {
            input.execution_mode(),                 // Execution Mode
        )
    }
+    pub fn set_target_batch_size(&mut self, siz: usize) {
+        self.target_batch_size = siz;


Instead of overwriting target_batch_size. We can add fetch: Option<usize>. CoalesceBatchesExec can emit when hit to this count also. As well as target_batch_size.

berkaysynnada · 2024-03-28T08:14:09Z

datafusion/core/src/physical_optimizer/limit_pushdown.rs

+        _config: &ConfigOptions,
+    ) -> Result<Arc<dyn ExecutionPlan>> {
+        // if this node is not a global limit, then directly return
+        if !is_global_limit(&plan) {


Can we extend the rule to pushdown GlobalLimit's which are not at the top of the plan?

Definitely, this is just a draft to test the exist .slt test

berkaysynnada · 2024-03-28T08:21:44Z

datafusion/physical-plan/src/coalesce_batches.rs

@@ -83,6 +83,9 @@ impl CoalesceBatchesExec {
            input.execution_mode(),                 // Execution Mode
        )
    }
+    pub fn set_target_batch_size(&mut self, siz: usize) {
+        self.target_batch_size = siz;


I think if the fetch count is larger than target_batch_size, this will introduce some incorrect behavior.

I'll set to max()

berkaysynnada · 2024-03-28T08:28:34Z

datafusion/core/src/physical_optimizer/limit_pushdown.rs

+impl LimitPushdown {}
+fn new_global_limit_with_input() {}
+// try to push down current limit, based on the son
+fn push_down_limit(


I believe we can also set a fetch count for CoalesceBatchesExec without changing the plan order. A global fetch count may be carried across the subtree until facing with breaking plan, but I don't know if it would bring more capability. Can there be plans which cannot swap with limit but also do not break the required fetch count?

iniyt init remove .gz

alamb

Thanks @Lordworms -- I took a quick look of this PR

I am probably missing something obvious but I don't understand the need for the pushdown pass in the physical optimizer.

If the usecase is to get a limit closer to StreamingTableExec then maybe we can pushing the fetch to the CoalesceBatchesExec rather than the StreamingTableExec ?

It seems to me that a limit in the StreamingTable exec can likely be implemented more efficiently, and would already be handled by the existing Limit pushdown in the LogicalPlan.

Maybe @berkaysynnada or @mustafasrepo have some more context

alamb · 2024-04-03T20:23:49Z

datafusion/physical-plan/src/limit.rs

@@ -167,6 +167,10 @@ impl ExecutionPlan for GlobalLimitExec {

        // GlobalLimitExec requires a single input partition
        if 1 != self.input.output_partitioning().partition_count() {
+            println!(


I think this println should be removed

alamb · 2024-04-03T20:25:50Z

datafusion/sqllogictest/test_files/repartition.slt

@@ -123,7 +123,7 @@ Limit: skip=0, fetch=5
 physical_plan
 GlobalLimitExec: skip=0, fetch=5
 --CoalescePartitionsExec
----CoalesceBatchesExec: target_batch_size=8192
+----CoalesceBatchesExec: target_batch_size=8192 fetch= 5
 ------FilterExec: c3@2 > 0
 --------RepartitionExec: partitioning=RoundRobinBatch(3), input_partitions=1
 ----------StreamingTableExec: partition_sizes=1, projection=[c1, c2, c3], infinite_source=true


Shouldn't we also push the limit to the StreamingTableExec as well?

alamb · 2024-04-03T20:26:32Z

datafusion/physical-plan/src/coalesce_batches.rs

@@ -216,7 +238,8 @@ impl CoalesceBatchesStream {
            match input_batch {
                Poll::Ready(x) => match x {
                    Some(Ok(batch)) => {
-                        if batch.num_rows() >= self.target_batch_size
+                        if (batch.num_rows() >= self.target_batch_size


I think we need some tests of this new logic added to repartition exec

Lordworms · 2024-04-04T01:09:59Z

Thanks @Lordworms -- I took a quick look of this PR

I am probably missing something obvious but I don't understand the need for the pushdown pass in the physical optimizer.

If the usecase is to get a limit closer to StreamingTableExec then maybe we can pushing the fetch to the CoalesceBatchesExec rather than the StreamingTableExec ?
actually it has been pushed to CoalesceBatchesExec, and then carries down to StreamingTableExec

It seems to me that a limit in the StreamingTable exec can likely be implemented more efficiently, and would already be handled by the existing Limit pushdown in the LogicalPlan.
I agree I should add this logic in LogicalPlan Phases, I will save this one to draft and work on #9873 first.

Maybe @berkaysynnada or @mustafasrepo have some more context
Thanks for your review and Sorry for my immature design, I'll complete it later.

berkaysynnada · 2024-04-04T07:15:16Z

Thanks @Lordworms -- I took a quick look of this PR

I am probably missing something obvious but I don't understand the need for the pushdown pass in the physical optimizer.

If the usecase is to get a limit closer to StreamingTableExec then maybe we can pushing the fetch to the CoalesceBatchesExec rather than the StreamingTableExec ?

It seems to me that a limit in the StreamingTable exec can likely be implemented more efficiently, and would already be handled by the existing Limit pushdown in the LogicalPlan.

Maybe @berkaysynnada or @mustafasrepo have some more context

Thanks @alamb for the feedbacks. @Lordworms's strategy is actually intuitive and reasonable, but maybe we need another way to solve the problem.

If I summarize #9792, the problem is when a Limit exists above CoalesceBatches, CoalesceBatches waits until all rows are collected which are possibly not used after Limit. Therefore; we need CoalesceBatches to sense the fetch count of the Limit, and after that many rows are collected, it should be able to return them without waiting more.

alamb · 2024-04-04T14:10:43Z

Edit: moved conversation to #9792 (comment)

alamb · 2024-04-04T14:11:22Z

Maybe @berkaysynnada or @mustafasrepo have some more context
Thanks for your review and Sorry for my immature design, I'll complete it later.

No worries at all -- we are all sorting this out together @Lordworms . Thank you for helping push it along

github-actions · 2024-06-04T01:48:05Z

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

github-actions bot added the core Core DataFusion crate label Mar 27, 2024

Lordworms force-pushed the issue_9792 branch from 7246f35 to 7c07545 Compare March 27, 2024 00:50

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Mar 28, 2024

Lordworms commented Mar 28, 2024

View reviewed changes

mustafasrepo reviewed Mar 28, 2024

View reviewed changes

berkaysynnada reviewed Mar 28, 2024

View reviewed changes

Lordworms mentioned this pull request Mar 31, 2024

Support arbitrary expressions in LIMIT clause #9821

Closed

github-actions bot added sql SQL Planner logical-expr Logical plan and expressions physical-expr Physical Expressions optimizer Optimizer rules substrait labels Apr 1, 2024

Lordworms force-pushed the issue_9792 branch from e4e1c38 to 9b43649 Compare April 1, 2024 15:54

github-actions bot removed sql SQL Planner logical-expr Logical plan and expressions physical-expr Physical Expressions optimizer Optimizer rules substrait labels Apr 1, 2024

Lordworms added 4 commits April 1, 2024 10:54

for debug

012facd

iniyt init remove .gz

finalize

7f3a224

for debug

06ba06c

change to pattern match

fb02e21

Lordworms force-pushed the issue_9792 branch from 9b43649 to fb02e21 Compare April 1, 2024 15:55

Lordworms added 2 commits April 1, 2024 10:55

fix bugs

bcf87a7

fix clippy

cdca40d

Lordworms marked this pull request as ready for review April 2, 2024 01:03

alamb reviewed Apr 3, 2024

View reviewed changes

Lordworms marked this pull request as draft April 4, 2024 01:10

alamb mentioned this pull request Apr 4, 2024

Adding Fetch Support to CoalesceBatchesExec #9792

Closed

github-actions bot added the Stale PR has not had any activity for some time label Jun 4, 2024

github-actions bot closed this Jun 11, 2024

WeCodingNow mentioned this pull request Sep 3, 2024

Support prepared statement arguments in the LIMIT clause #12294

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement LimitPushDown for ExecutionPlan #9815

Implement LimitPushDown for ExecutionPlan #9815

Lordworms commented Mar 27, 2024

Lordworms Mar 28, 2024

Lordworms commented Mar 28, 2024 •

edited

Loading

mustafasrepo commented Mar 28, 2024

mustafasrepo Mar 28, 2024 •

edited

Loading

Lordworms Mar 28, 2024

Lordworms Mar 31, 2024

mustafasrepo Apr 1, 2024 •

edited

Loading

mustafasrepo Mar 28, 2024

Lordworms Mar 28, 2024

berkaysynnada Mar 28, 2024

Lordworms Mar 28, 2024

berkaysynnada Mar 28, 2024

Lordworms Mar 28, 2024

berkaysynnada Mar 28, 2024 •

edited

Loading

alamb left a comment

alamb Apr 3, 2024

alamb Apr 3, 2024

alamb Apr 3, 2024

Lordworms commented Apr 4, 2024

berkaysynnada commented Apr 4, 2024

alamb commented Apr 4, 2024 •

edited

Loading

alamb commented Apr 4, 2024

github-actions bot commented Jun 4, 2024

		CoalescePartitionsExec
		--GlobalLimitExec: skip=0, fetch=5

Implement LimitPushDown for ExecutionPlan #9815

Implement LimitPushDown for ExecutionPlan #9815

Conversation

Lordworms commented Mar 27, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Lordworms commented Mar 28, 2024 • edited Loading

mustafasrepo commented Mar 28, 2024

mustafasrepo Mar 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mustafasrepo Apr 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

berkaysynnada Mar 28, 2024 • edited Loading

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Lordworms commented Apr 4, 2024

berkaysynnada commented Apr 4, 2024

alamb commented Apr 4, 2024 • edited Loading

alamb commented Apr 4, 2024

github-actions bot commented Jun 4, 2024

Lordworms commented Mar 28, 2024 •

edited

Loading

mustafasrepo Mar 28, 2024 •

edited

Loading

mustafasrepo Apr 1, 2024 •

edited

Loading

berkaysynnada Mar 28, 2024 •

edited

Loading

alamb commented Apr 4, 2024 •

edited

Loading