reduce clones of LogicalPlan in planner #7775

doki23 · 2023-10-09T09:42:13Z

Rationale for this change

To reduce clone of the logical plan. This pr may have some relation with #5637
And the clone of input plan will be reduced after #4628 closed.

What changes are included in this PR?

Speedup the planner but make some tests slower than before because of some more clones.

Are these changes tested?

yes.

Are there any user-facing changes?

no.

doki23 · 2023-10-09T13:04:06Z

I draft this pr because it seems that we need more thread stack space -- tpcds_physical_q54 meet the problem of thread stack overflow.

crepererum

Looking through the changes, I wonder if we should wrap LogicalPlan into an Arc similar to the physical version (even though it's not dyn-dispatch). That would safe stack space and makes cloning very cheap. I think this should also be done for all "child" plans in LogicalPlan that are currently Boxed.

alamb · 2023-10-10T14:14:23Z

One of the tensions is that if we wrapped the plan in Arc it is harder to match on it as I understand

match plan {
  LogicalPlan::Scan(..) => {..}
  LogicalPlan::Project(..) => {..}
  ...
}

crepererum · 2023-10-10T16:30:12Z

One of the tensions is that if we wrapped the plan in Arc it is harder to match on it as I understand

Depends on how you want to match. You can use match plan.as_ref() {...} but then you need to manually clone all struct members if you want to construct a new LogicalPlan. However I would argue that the logical and phys. plan should not contain copies of any large data structures that are NOT wrapped into an Arc in the first place.

doki23 · 2023-10-11T11:56:12Z

There are some functions taking borrow of plan and return a new plan like:

pub fn optimize(&self, plan: &LogicalPlan) -> Result<LogicalPlan>

If we wrap plan with Arc, we still cannot avoid some clones.
How about Arc<RefCell<LogicalPlan>>? We can return a new plan by changing the original plan by plan.borrow_mut().

crepererum · 2023-10-11T12:06:23Z

How about Arc<RefCell<LogicalPlan>>? We can return a new plan by changing the original plan by plan.borrow_mut().

I think RefCell and interior mutability make code VERY hard to understand and debug. Rust has very clear API types that lets you pass by value, pass by ref or pass by mutable ref and they all mean different things. Hacking around it for such a fundamental type is IMHO a no-go. I think optimize should be:

pub fn optimize(&self, plan: Arc<LogicalPlan>) -> Result<Arc<LogicalPlan>>

If the plan stays the same, you can just pass through the Arc. If not, you can create a new one. All child nodes can easily be cloned (since they are Arced, at least after the change), and all other metadata that is attached to nodes should either be Arced as well or should be cheap to clone.

alamb · 2023-10-11T21:01:51Z

I think optimize should be:

pub fn optimize(&self, plan: Arc<LogicalPlan>) -> Result<Arc<LogicalPlan>>

I agree that sounds like a more sensible plan.

alamb · 2023-10-11T21:37:09Z

fyi @sadboy, @schulte-lukas and @wolfram-s

sadboy · 2023-12-14T16:38:47Z

Hi, just saw this thread. FWIW we (SDF) recently changed all our internal use of LogicalPlan to Arc<LogicalPlan> (for reference, this was the change we made on the Datafusion side: sdf-labs#40), and saw no performance impact whatsoever on our semantic analysis workloads. My guess is that LogicalPlan::clone() is already a O(1) operation (in the size of the logical plan, because of the Arc pointer to parent plans), so saving that constant factor was negligible in the grand scheme of things.

What did turn out to have a huge perf impact on our workloads, was the asymptotic behavior of the logical plan constructors. Specifically, many methods in LogicalPlanBuilder, e.g. project and join, perform input sanitization which is (at least) O(n) in the size of the parent plan(s), and as a result using LogicalPlanBuilder to construct logical plans takes O(n^2) time in the size of the input query.

Anyway, tl;dr is that

Choice of LogicalPlan vs Arc<LogicalPlan> has minimal perf impact, and thus can be made based solely on API ergonomic considerations
Logical plan constructors do have large impact on overall perf, and thus should be careful in the operations they perform

alamb · 2023-12-14T17:51:36Z

What did turn out to have a huge perf impact on our workloads, was the asymptotic behavior of the logical plan constructors. Specifically, many methods in LogicalPlanBuilder, e.g. project and join, perform input sanitization which is (at least) O(n) in the size of the parent plan(s), and as a result using LogicalPlanBuilder to construct logical plans takes O(n^2) time in the size of the input query.

Thank you @sadboy this is great feedback. I wonder if we could / should make "don't error check" type constructors for this kind of optimization

Perhaps something like

impl ProjectionExec { 
  // Creates a new projection exec without any error checking. Use this only
  // if you know the correct arguments
  pub fn try_new_unchecked(
    expr: Vec<(Arc<dyn PhysicalExpr>, String)>,
    input: Arc<dyn ExecutionPlan>
  ) -> Result<ProjectionExec, DataFusionError> {
    ...
  }
}

sadboy · 2023-12-14T22:45:46Z

we could / should make "don't error check" type constructors for this kind of optimization

As a quick and simple solution, that's what I would recommend, yes.

More fundamentally, I think the contention arises from the de-facto "dual
use" nature of Datafusion's LogicalPlan/Expr API, which serves two very
different use cases:

The DataFrame users, who use DataFusion API to programmtically compose
their queries. In this kind of scenario, you do want to perform eager
input validation and fail early.
The "analysis" users, who use LogicalPlan as an IR. Datafusion's own
internal use of LogicalPlan would mostly fall into this category as
well, including the SQL compiler, the analyzer/optimizers, logical plan
serializers, etc. In this kind of scenario, input sanitation on the plan
node constructors is wasteful as arguments are already known to be
well-formed.

As things currently stand, the constructor methods in LogicalPlanBuilder
are largely geared to serve the former use case, but widely shared in the
code paths of the latter. Not an issue when the input queries are small, but
would definitely cause scalability issues when processing large queries
(like we've been experiencing). Having unchecked "dumb" constructors would
be greatly beneficial for perf here, but only if used consistently through
the whole codebase (not exactly a trivial concern, as in general simply
having "unchecked" in the method name would discourage people from using
them).

Ideally, however, I believe these two use cases are different enough that it
would be beneficial to actually separate them statically at the type level.
i.e. have a separate type hierarchy for "DataFramePlan"/"DataFrameExpr",
parallel to LogicalPlan/Expr. The former would be exclusively used to
capture "end user" input through the DataFrame API, while the latter would
be used exclusively as an internal IR, never directly exposed to the end
user. There would need to be an explicit conversion between the two, but
that would mostly be mechanical and trivial (and where the input sanitation
could take place). The benefit is then you entirely eliminate concerns such
as "should I call the checked or unchecked version of the constructor" when
dealing with logical plan/exprs. In addition, you could "fine tune" each
type hierarchy to better fit its purpose. For example, just off the top of
my head:

Plan types such as Expr::Wildcard can be removed, as they don't serve
any purpose in an IR. The effect is that you can greatly cut down the
"invalid states" in the analyzer/optimizer pipeline, making it much easier
to perform tree transformations
Expr::Column can be changed to index-based rather than name-based (i.e.
Column { index: usize }), so you get guaranteed O(1) column resolution

Anyway, that's just my $0.02 🙂 (and as I just realized, probably way off topic for
this thread too, lol)

alamb · 2023-12-15T12:10:26Z

Anyway, that's just my $0.02 🙂 (and as I just realized, probably way off topic for
this thread too, lol)

I think it is a great discussion to have -- I filed #8556 to get it out of this thread (on a closed ticket) into a new issue for hopefully wider discussions

doki23 added 3 commits October 7, 2023 16:25

take the ownership of LogicalPlan in the planner

7359372

fix test

2176ad7

make physical planner takes ownership of logicalplans

5d116b7

github-actions bot added optimizer Optimizer rules core Core DataFusion crate substrait labels Oct 9, 2023

clippy

f046065

github-actions bot added the sql SQL Planner label Oct 9, 2023

doki23 marked this pull request as draft October 9, 2023 13:04

crepererum reviewed Oct 10, 2023

View reviewed changes

doki23 closed this Dec 14, 2023

alamb mentioned this pull request Dec 15, 2023

Discuss: Speeding up LogicalPlan manipulation by skipping validation #8556

Open

peter-toth mentioned this pull request Dec 15, 2023

Refactor TreeNode recursions #7942

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reduce clones of LogicalPlan in planner #7775

reduce clones of LogicalPlan in planner #7775

doki23 commented Oct 9, 2023 •

edited

Loading

doki23 commented Oct 9, 2023

crepererum left a comment

alamb commented Oct 10, 2023

crepererum commented Oct 10, 2023

doki23 commented Oct 11, 2023

crepererum commented Oct 11, 2023

alamb commented Oct 11, 2023

alamb commented Oct 11, 2023

sadboy commented Dec 14, 2023

alamb commented Dec 14, 2023

sadboy commented Dec 14, 2023

alamb commented Dec 15, 2023

reduce clones of LogicalPlan in planner #7775

reduce clones of LogicalPlan in planner #7775

Conversation

doki23 commented Oct 9, 2023 • edited Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

doki23 commented Oct 9, 2023

crepererum left a comment

Choose a reason for hiding this comment

alamb commented Oct 10, 2023

crepererum commented Oct 10, 2023

doki23 commented Oct 11, 2023

crepererum commented Oct 11, 2023

alamb commented Oct 11, 2023

alamb commented Oct 11, 2023

sadboy commented Dec 14, 2023

alamb commented Dec 14, 2023

sadboy commented Dec 14, 2023

alamb commented Dec 15, 2023

doki23 commented Oct 9, 2023 •

edited

Loading