Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reduce clones of LogicalPlan in planner #7775

Closed
wants to merge 4 commits into from
Closed

reduce clones of LogicalPlan in planner #7775

wants to merge 4 commits into from

Conversation

doki23
Copy link
Contributor

@doki23 doki23 commented Oct 9, 2023

Rationale for this change

To reduce clone of the logical plan. This pr may have some relation with #5637
And the clone of input plan will be reduced after #4628 closed.

What changes are included in this PR?

Speedup the planner but make some tests slower than before because of some more clones.

Are these changes tested?

yes.

Are there any user-facing changes?

no.

@github-actions github-actions bot added optimizer Optimizer rules core Core DataFusion crate substrait labels Oct 9, 2023
@github-actions github-actions bot added the sql SQL Planner label Oct 9, 2023
@doki23
Copy link
Contributor Author

doki23 commented Oct 9, 2023

I draft this pr because it seems that we need more thread stack space -- tpcds_physical_q54 meet the problem of thread stack overflow.

@doki23 doki23 marked this pull request as draft October 9, 2023 13:04
Copy link
Contributor

@crepererum crepererum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking through the changes, I wonder if we should wrap LogicalPlan into an Arc similar to the physical version (even though it's not dyn-dispatch). That would safe stack space and makes cloning very cheap. I think this should also be done for all "child" plans in LogicalPlan that are currently Boxed.

@alamb
Copy link
Contributor

alamb commented Oct 10, 2023

One of the tensions is that if we wrapped the plan in Arc it is harder to match on it as I understand

match plan {
  LogicalPlan::Scan(..) => {..}
  LogicalPlan::Project(..) => {..}
  ...
}

@crepererum
Copy link
Contributor

One of the tensions is that if we wrapped the plan in Arc it is harder to match on it as I understand

Depends on how you want to match. You can use match plan.as_ref() {...} but then you need to manually clone all struct members if you want to construct a new LogicalPlan. However I would argue that the logical and phys. plan should not contain copies of any large data structures that are NOT wrapped into an Arc in the first place.

@doki23
Copy link
Contributor Author

doki23 commented Oct 11, 2023

There are some functions taking borrow of plan and return a new plan like:

pub fn optimize(&self, plan: &LogicalPlan) -> Result<LogicalPlan>

If we wrap plan with Arc, we still cannot avoid some clones.
How about Arc<RefCell<LogicalPlan>>? We can return a new plan by changing the original plan by plan.borrow_mut().

@crepererum
Copy link
Contributor

How about Arc<RefCell<LogicalPlan>>? We can return a new plan by changing the original plan by plan.borrow_mut().

I think RefCell and interior mutability make code VERY hard to understand and debug. Rust has very clear API types that lets you pass by value, pass by ref or pass by mutable ref and they all mean different things. Hacking around it for such a fundamental type is IMHO a no-go. I think optimize should be:

pub fn optimize(&self, plan: Arc<LogicalPlan>) -> Result<Arc<LogicalPlan>>

If the plan stays the same, you can just pass through the Arc. If not, you can create a new one. All child nodes can easily be cloned (since they are Arced, at least after the change), and all other metadata that is attached to nodes should either be Arced as well or should be cheap to clone.

@alamb
Copy link
Contributor

alamb commented Oct 11, 2023

I think optimize should be:

pub fn optimize(&self, plan: Arc<LogicalPlan>) -> Result<Arc<LogicalPlan>>

I agree that sounds like a more sensible plan.

@alamb
Copy link
Contributor

alamb commented Oct 11, 2023

fyi @sadboy, @schulte-lukas and @wolfram-s

@doki23 doki23 closed this Dec 14, 2023
@sadboy
Copy link
Contributor

sadboy commented Dec 14, 2023

Hi, just saw this thread. FWIW we (SDF) recently changed all our internal use of LogicalPlan to Arc<LogicalPlan> (for reference, this was the change we made on the Datafusion side: sdf-labs#40), and saw no performance impact whatsoever on our semantic analysis workloads. My guess is that LogicalPlan::clone() is already a O(1) operation (in the size of the logical plan, because of the Arc pointer to parent plans), so saving that constant factor was negligible in the grand scheme of things.

What did turn out to have a huge perf impact on our workloads, was the asymptotic behavior of the logical plan constructors. Specifically, many methods in LogicalPlanBuilder, e.g. project and join, perform input sanitization which is (at least) O(n) in the size of the parent plan(s), and as a result using LogicalPlanBuilder to construct logical plans takes O(n^2) time in the size of the input query.

Anyway, tl;dr is that

  1. Choice of LogicalPlan vs Arc<LogicalPlan> has minimal perf impact, and thus can be made based solely on API ergonomic considerations
  2. Logical plan constructors do have large impact on overall perf, and thus should be careful in the operations they perform

@alamb
Copy link
Contributor

alamb commented Dec 14, 2023

What did turn out to have a huge perf impact on our workloads, was the asymptotic behavior of the logical plan constructors. Specifically, many methods in LogicalPlanBuilder, e.g. project and join, perform input sanitization which is (at least) O(n) in the size of the parent plan(s), and as a result using LogicalPlanBuilder to construct logical plans takes O(n^2) time in the size of the input query.

Thank you @sadboy this is great feedback. I wonder if we could / should make "don't error check" type constructors for this kind of optimization

Perhaps something like

impl ProjectionExec { 
  // Creates a new projection exec without any error checking. Use this only
  // if you know the correct arguments
  pub fn try_new_unchecked(
    expr: Vec<(Arc<dyn PhysicalExpr>, String)>,
    input: Arc<dyn ExecutionPlan>
  ) -> Result<ProjectionExec, DataFusionError> {
    ...
  }
}

@sadboy
Copy link
Contributor

sadboy commented Dec 14, 2023

we could / should make "don't error check" type constructors for this kind of optimization

As a quick and simple solution, that's what I would recommend, yes.

More fundamentally, I think the contention arises from the de-facto "dual
use" nature of Datafusion's LogicalPlan/Expr API, which serves two very
different use cases:

  1. The DataFrame users, who use DataFusion API to programmtically compose
    their queries. In this kind of scenario, you do want to perform eager
    input validation and fail early.
  2. The "analysis" users, who use LogicalPlan as an IR. Datafusion's own
    internal use of LogicalPlan would mostly fall into this category as
    well, including the SQL compiler, the analyzer/optimizers, logical plan
    serializers, etc. In this kind of scenario, input sanitation on the plan
    node constructors is wasteful as arguments are already known to be
    well-formed.

As things currently stand, the constructor methods in LogicalPlanBuilder
are largely geared to serve the former use case, but widely shared in the
code paths of the latter. Not an issue when the input queries are small, but
would definitely cause scalability issues when processing large queries
(like we've been experiencing). Having unchecked "dumb" constructors would
be greatly beneficial for perf here, but only if used consistently through
the whole codebase (not exactly a trivial concern, as in general simply
having "unchecked" in the method name would discourage people from using
them).

Ideally, however, I believe these two use cases are different enough that it
would be beneficial to actually separate them statically at the type level.
i.e. have a separate type hierarchy for "DataFramePlan"/"DataFrameExpr",
parallel to LogicalPlan/Expr. The former would be exclusively used to
capture "end user" input through the DataFrame API, while the latter would
be used exclusively as an internal IR, never directly exposed to the end
user. There would need to be an explicit conversion between the two, but
that would mostly be mechanical and trivial (and where the input sanitation
could take place). The benefit is then you entirely eliminate concerns such
as "should I call the checked or unchecked version of the constructor" when
dealing with logical plan/exprs. In addition, you could "fine tune" each
type hierarchy to better fit its purpose. For example, just off the top of
my head:

  • Plan types such as Expr::Wildcard can be removed, as they don't serve
    any purpose in an IR. The effect is that you can greatly cut down the
    "invalid states" in the analyzer/optimizer pipeline, making it much easier
    to perform tree transformations
  • Expr::Column can be changed to index-based rather than name-based (i.e.
    Column { index: usize }), so you get guaranteed O(1) column resolution

Anyway, that's just my $0.02 🙂 (and as I just realized, probably way off topic for
this thread too, lol)

@alamb
Copy link
Contributor

alamb commented Dec 15, 2023

Anyway, that's just my $0.02 🙂 (and as I just realized, probably way off topic for
this thread too, lol)

I think it is a great discussion to have -- I filed #8556 to get it out of this thread (on a closed ticket) into a new issue for hopefully wider discussions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate optimizer Optimizer rules sql SQL Planner substrait
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants