Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: logical op constructor+builder boundary #3684

Merged
merged 7 commits into from
Jan 16, 2025

Conversation

kevinzwang
Copy link
Member

@kevinzwang kevinzwang commented Jan 14, 2025

The problem

Plan ops are created for various reasons through our code - from our dataframe or sql interfaces, to optimization rules, to even op constructors themselves which can sometimes create other ones. All of these cases generally go through the same new/try_new constructor for each op, which tries to accommodate all of these use cases. This creates complexity, adds unnecessary compute to planning time, and also conflates user input errors with Daft internal errors.

For example, I don't expect any optimization rules to create unresolved expressions, expression resolution should only be done for the builder. Another example is the Join op, where inputs such as join_prefix and join_suffix are only applicable for renaming columns, which should also only happen via the builder. We recently added another initializer to some ops for that reason, but it bypasses the validation that is typically done and is not standardized across ops.

My solution

Every op should provide a try_new constructor which contain explicit checks for all the requirements about the op's state (one example would be that all expression columns exist in the schema), but otherwise should simply put those values into the struct without any modification and return it.

  • Functions such as LogicalPlan::with_new_children will just call try_new.
  • Other constructors/helpers may exist that explicitly provide additional functionality and ultimately call try_new. E.g. a Join::rename_right_columns to rename the right side columns that conflict with the left side, called to update the right side schema before calling try_new.
  • User input normalization, such as expression resolution, should be handled by the logical plan builder. After the logical plan op has been constructed, everything should be in a valid state from there on.

@github-actions github-actions bot added the chore label Jan 14, 2025
@kevinzwang kevinzwang changed the title chore: refactor logical op constructor+builder boundary refactor: logical op constructor+builder boundary Jan 15, 2025
@github-actions github-actions bot added refactor and removed chore labels Jan 15, 2025
Copy link

codecov bot commented Jan 15, 2025

Codecov Report

Attention: Patch coverage is 98.34025% with 4 lines in your changes missing coverage. Please review.

Project coverage is 77.79%. Comparing base (feab49a) to head (9304075).
Report is 11 commits behind head on main.

Files with missing lines Patch % Lines
src/daft-dsl/src/expr/mod.rs 60.00% 2 Missing ⚠️
src/daft-logical-plan/src/ops/filter.rs 85.71% 1 Missing ⚠️
src/daft-logical-plan/src/ops/join.rs 97.36% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3684      +/-   ##
==========================================
- Coverage   77.82%   77.79%   -0.03%     
==========================================
  Files         728      732       +4     
  Lines       89919    90457     +538     
==========================================
+ Hits        69975    70368     +393     
- Misses      19944    20089     +145     
Files with missing lines Coverage Δ
src/daft-dsl/src/lib.rs 100.00% <ø> (ø)
src/daft-dsl/src/python.rs 91.07% <ø> (-0.03%) ⬇️
src/daft-logical-plan/src/builder/mod.rs 92.71% <100.00%> (ø)
src/daft-logical-plan/src/builder/resolve_expr.rs 89.20% <100.00%> (ø)
src/daft-logical-plan/src/builder/tests.rs 100.00% <ø> (ø)
src/daft-logical-plan/src/display.rs 98.06% <100.00%> (ø)
src/daft-logical-plan/src/lib.rs 100.00% <100.00%> (ø)
src/daft-logical-plan/src/logical_plan.rs 74.27% <100.00%> (+0.63%) ⬆️
...rc/daft-logical-plan/src/ops/actor_pool_project.rs 36.73% <100.00%> (-5.86%) ⬇️
src/daft-logical-plan/src/ops/agg.rs 62.50% <100.00%> (-5.20%) ⬇️
... and 15 more

... and 40 files with indirect coverage changes

@kevinzwang
Copy link
Member Author

Note to reviewers: Join has been pretty broken and I don't want to change that behavior in this PR. This is going to be directly followed by PR to fix various join issues, so don't worry too much about any bugs with join you find in this PR.

Copy link

codspeed-hq bot commented Jan 15, 2025

CodSpeed Performance Report

Merging #3684 will degrade performances by 34.92%

Comparing kevin/logical-plan-builder-refactor (9304075) with main (f97902a)

Summary

⚡ 4 improvements
❌ 1 regressions
✅ 22 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark main kevin/logical-plan-builder-refactor Change
test_count[1 Small File] 3.7 ms 3.3 ms +11.75%
test_iter_rows_first_row[100 Small Files] 214.8 ms 186.2 ms +15.4%
test_show[100 Small Files] 15.7 ms 24.1 ms -34.92%
test_tpch[1-in-memory-native-2] 106.6 ms 96.4 ms +10.6%
test_tpch_sql[1-in-memory-native-2] 226.9 ms 204.7 ms +10.87%

Copy link
Contributor

@universalmind303 universalmind303 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks good. I do think a builder pattern may be better suited for Join in the long term though. Right now we have 8 arguments to the constructor.

Something like this I think would be a bit more intuitive

JoinBuilder::new(left, right)
  .left_on(left_on)
  .right_on(right_on)
  // could also add a `.on(join_keys)` instead of needing to always supply left and right keys. 
  .join_type(join_type)
  .join_suffix(suffix) // optional
  .join_prefix(prefix) // optional
  .rename_right_columns(true) // optional
  .keep_join_keys(true)  // optional
  .finish()

@@ -188,11 +188,19 @@ impl LogicalPlanBuilder {
}

pub fn select(&self, to_select: Vec<ExprRef>) -> DaftResult<Self> {
let expr_resolver = ExprResolver::builder().allow_actor_pool_udf(true).build();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, why does each (or some) method create their own expr_resolver? Maybe the builder could hold a single expr_resolver for its schema, then resolve_single takes a single arg and closes over self.schema().

But there may be more to this I am not familiar with yet, just curious.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each logical op may resolve expressions differently. For example, Agg expects expressions to be aggregation expressions, whereas Project expects no aggregation expressions. Take a look at the parameters to the ExprResolver builder!

@rchowell
Copy link
Contributor

Some general comments as I familiarize myself with this.

expression resolution should only be done for the builder

Agreed, rewrites/transforms f on the logical domain L so f(L) -> L shouldn't create unresolved expressions (afaik) – once resolved to L you don't leave L.

Every op should provide a try_new constructor which contain explicit checks for all the requirements about the op's state

Do you anticipate every logical operator having both a try_new and a builder? The builders may need to perform checks incrementally as well as on the final .build().

@kevinzwang
Copy link
Member Author

kevinzwang commented Jan 15, 2025

@universalmind303 @rchowell I'm open to having individual builders for logical ops, but I'm not sure yet how that would fit into our current structure.

We already have a LogicalPlanBuilder, which is the interface that our external APIs use to construct logical plans. It would not make much sense to create a builder abstraction for each op but have, say, our dataframe API still create joins by passing in all the arguments to builder.join(...) -- we would probably want to expose the op builder directly somehow.

We should probably think about this on a case-by-case basis. Most ops are pretty basic and do not require builders. Moreover, I'm not super happy about this current LogicalPlanBuilder abstraction either, it tries to work for both dataframe and sql but is becoming a little unwieldy.

Copy link
Member Author

@kevinzwang kevinzwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a bit of cleanup along with this PR:

  1. moved resolve_expr to inside builder because we don't want anything other than the builder to use it anymore
  2. removed various .context(CreationSnafu)? because conversion to logical_plan::Result is actually done automatically for DaftResult if using ?, so this is not needed

@kevinzwang kevinzwang merged commit 3720c2a into main Jan 16, 2025
42 of 43 checks passed
@kevinzwang kevinzwang deleted the kevin/logical-plan-builder-refactor branch January 16, 2025 19:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants