Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(optimizer): convert filter predicate to CNF to push through join #3623

Merged
merged 3 commits into from
Dec 19, 2024

Conversation

kevinzwang
Copy link
Member

By converting the predicate to Conjunctive Normal Form, we are able to split the predicate into expressions that are isolated to the schemas of an individual side of a join. Then, we can split by conjunctions and push those new expressions through the joins.

For example, consider the schemas:

left = [a: Utf8]
right = [b: Utf8]

If our plan was

Filter
|  by: (col(a) = "FRANCE" & col("b") = "GERMANY") | (col(a) = "GERMANY" & col("b") = "FRANCE")
|
Join
| \
|  Scan(right)
Scan(left)

That filter predicate in CNF is

(col(a) = "FRANCE" | col(a) = "GERMANY") 
  & (col(a) = "FRANCE" | col(b) = "FRANCE") 
  & (col(b) = "GERMANY" | col(a) = "GERMANY") 
  & (col(b) = "GERMANY" | col(b) = "FRANCE")

The first and last expressions in this conjunction only have columns from a single side of the join, so can be pushed down, resulting in this optimized plan:

Filter
|  by: (col(a) = "FRANCE" | col(b) = "FRANCE") & (col(b) = "GERMANY" | col(a) = "GERMANY") 
|
Join
| \
|  Scan(right)
|    pushdowns: Filter { (col(b) = "GERMANY" | col(b) = "FRANCE") }
Scan(left)
|. pushdowns: Filter { (col(a) = "FRANCE" | col(a) = "GERMANY") }

@@ -273,7 +273,8 @@ impl PushDownFilter {
let left_cols = HashSet::<_>::from_iter(child_join.left.schema().names());
let right_cols = HashSet::<_>::from_iter(child_join.right.schema().names());

for predicate in split_conjunction(&filter.predicate) {
// TODO: simplify predicates, since they may be expanded with `to_cnf`
for predicate in split_conjunction(&to_cnf(filter.predicate.clone())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this only applicable to predicates? I'm wondering if this should be part of the SimplifyExprs optimizer pass?

Copy link
Member Author

@kevinzwang kevinzwang Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be a general simplification yes. Just put the comment here because this part specifically expands the predicate

Copy link

codspeed-hq bot commented Dec 19, 2024

CodSpeed Performance Report

Merging #3623 will degrade performances by 33.13%

Comparing kevin/better-filter-join (7febe13) with main (07f6b2c)

Summary

⚡ 2 improvements
❌ 1 regressions
✅ 24 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark main kevin/better-filter-join Change
test_iter_rows_first_row[100 Small Files] 244 ms 207.4 ms +17.66%
test_show[100 Small Files] 15.9 ms 23.8 ms -33.13%
test_tpch_sql[1-in-memory-native-7] 1,183.3 ms 778.9 ms +51.91%

Copy link

codecov bot commented Dec 19, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 77.55%. Comparing base (07f6b2c) to head (7febe13).
Report is 1 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3623      +/-   ##
==========================================
- Coverage   77.84%   77.55%   -0.29%     
==========================================
  Files         719      719              
  Lines       88280    88958     +678     
==========================================
+ Hits        68720    68992     +272     
- Misses      19560    19966     +406     
Files with missing lines Coverage Δ
src/daft-algebra/src/boolean.rs 100.00% <100.00%> (ø)
...al-plan/src/optimization/rules/push_down_filter.rs 97.32% <100.00%> (+1.68%) ⬆️

... and 20 files with indirect coverage changes

@andrewgazelka
Copy link
Contributor

andrewgazelka commented Dec 19, 2024

are we worried about expression explosion? like if we have 10 expressions on LHS an 10 on RHS I'd imagine we would sometimes get 100 = $O(n^2)$ expressions.

src/daft-algebra/src/boolean.rs Show resolved Hide resolved
use crate::boolean::{to_cnf, to_dnf};

#[test]
fn dnf_simple() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have more tests with negations?

assert_eq!(expected, to_dnf(expr));
}

#[test]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in future I suppose could do some rust proptest stuff here to verify they are correct but not needed for this issue imo

@kevinzwang
Copy link
Member Author

are we worried about expression explosion? like if we have 10 expressions on LHS an 10 on RHS I'd imagine we would sometimes get 100 = O ( n 2 ) expressions.

I do want to fix this at some point but most filter predicates should not balloon to significantly large sizes. Will work on more expression simplification stuff in the future to address this

@kevinzwang kevinzwang enabled auto-merge (squash) December 19, 2024 21:45
@kevinzwang kevinzwang merged commit 28d3fa7 into main Dec 19, 2024
40 of 41 checks passed
@kevinzwang kevinzwang deleted the kevin/better-filter-join branch December 19, 2024 22:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants