-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: logical op constructor+builder boundary #3684
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3684 +/- ##
==========================================
- Coverage 77.82% 77.79% -0.03%
==========================================
Files 728 732 +4
Lines 89919 90457 +538
==========================================
+ Hits 69975 70368 +393
- Misses 19944 20089 +145
|
Note to reviewers: Join has been pretty broken and I don't want to change that behavior in this PR. This is going to be directly followed by PR to fix various join issues, so don't worry too much about any bugs with join you find in this PR. |
CodSpeed Performance ReportMerging #3684 will degrade performances by 34.92%Comparing Summary
Benchmarks breakdown
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall looks good. I do think a builder pattern may be better suited for Join
in the long term though. Right now we have 8 arguments to the constructor.
Something like this I think would be a bit more intuitive
JoinBuilder::new(left, right)
.left_on(left_on)
.right_on(right_on)
// could also add a `.on(join_keys)` instead of needing to always supply left and right keys.
.join_type(join_type)
.join_suffix(suffix) // optional
.join_prefix(prefix) // optional
.rename_right_columns(true) // optional
.keep_join_keys(true) // optional
.finish()
@@ -188,11 +188,19 @@ impl LogicalPlanBuilder { | |||
} | |||
|
|||
pub fn select(&self, to_select: Vec<ExprRef>) -> DaftResult<Self> { | |||
let expr_resolver = ExprResolver::builder().allow_actor_pool_udf(true).build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, why does each (or some) method create their own expr_resolver? Maybe the builder could hold a single expr_resolver for its schema, then resolve_single
takes a single arg and closes over self.schema().
But there may be more to this I am not familiar with yet, just curious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each logical op may resolve expressions differently. For example, Agg expects expressions to be aggregation expressions, whereas Project expects no aggregation expressions. Take a look at the parameters to the ExprResolver builder!
Some general comments as I familiarize myself with this.
Agreed, rewrites/transforms f on the logical domain L so f(L) -> L shouldn't create unresolved expressions (afaik) – once resolved to L you don't leave L.
Do you anticipate every logical operator having both a try_new and a builder? The builders may need to perform checks incrementally as well as on the final .build(). |
@universalmind303 @rchowell I'm open to having individual builders for logical ops, but I'm not sure yet how that would fit into our current structure. We already have a LogicalPlanBuilder, which is the interface that our external APIs use to construct logical plans. It would not make much sense to create a builder abstraction for each op but have, say, our dataframe API still create joins by passing in all the arguments to We should probably think about this on a case-by-case basis. Most ops are pretty basic and do not require builders. Moreover, I'm not super happy about this current LogicalPlanBuilder abstraction either, it tries to work for both dataframe and sql but is becoming a little unwieldy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a bit of cleanup along with this PR:
- moved resolve_expr to inside builder because we don't want anything other than the builder to use it anymore
- removed various
.context(CreationSnafu)?
because conversion to logical_plan::Result is actually done automatically for DaftResult if using?
, so this is not needed
The problem
Plan ops are created for various reasons through our code - from our dataframe or sql interfaces, to optimization rules, to even op constructors themselves which can sometimes create other ones. All of these cases generally go through the same new/try_new constructor for each op, which tries to accommodate all of these use cases. This creates complexity, adds unnecessary compute to planning time, and also conflates user input errors with Daft internal errors.
For example, I don't expect any optimization rules to create unresolved expressions, expression resolution should only be done for the builder. Another example is the Join op, where inputs such as join_prefix and join_suffix are only applicable for renaming columns, which should also only happen via the builder. We recently added another initializer to some ops for that reason, but it bypasses the validation that is typically done and is not standardized across ops.
My solution
Every op should provide a
try_new
constructor which contain explicit checks for all the requirements about the op's state (one example would be that all expression columns exist in the schema), but otherwise should simply put those values into the struct without any modification and return it.LogicalPlan::with_new_children
will just calltry_new
.try_new
. E.g. aJoin::rename_right_columns
to rename the right side columns that conflict with the left side, called to update the right side schema before callingtry_new
.