-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: Track accumulated selectivity in logical plan to improve probe side decisions #3734
perf: Track accumulated selectivity in logical plan to improve probe side decisions #3734
Conversation
CodSpeed Performance ReportMerging #3734 will degrade performances by 18.74%Comparing Summary
Benchmarks breakdown
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3734 +/- ##
==========================================
+ Coverage 77.58% 77.65% +0.06%
==========================================
Files 729 731 +2
Lines 92010 92329 +319
==========================================
+ Hits 71389 71697 +308
- Misses 20621 20632 +11
|
// String contains | ||
Expr::ScalarFunction(ScalarFunction { udf, .. }) if udf.name() == "contains" => 0.1, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@colin-ho Btw I almost forgot that I had put this in here while I was prototyping. We can leave it in or remove it - I don't think it makes too much of a difference either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good
#[derive(Clone, Debug, PartialEq, Eq, Deserialize, Serialize)] | ||
#[derive(Clone, Debug, PartialEq, Deserialize, Serialize)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f64 does not impl Eq because of NaNs, but it seems like we didn't need Eq for ApproxStats anyway ¯_(ツ)_/¯
ScanState::Tasks(scan_tasks) => { | ||
let mut approx_stats = ApproxStats::empty(); | ||
let mut prefiltered_num_rows = 0.0; | ||
for st in scan_tasks.iter() { | ||
if let Some(num_rows) = st.num_rows() { | ||
approx_stats.num_rows += num_rows; | ||
prefiltered_num_rows += num_rows as f64 | ||
/ st.pushdowns().estimated_selectivity(st.schema().as_ref()); | ||
} else if let Some(approx_num_rows) = st.approx_num_rows(None) { | ||
approx_stats.num_rows += approx_num_rows as usize; | ||
prefiltered_num_rows += approx_num_rows | ||
/ st.pushdowns().estimated_selectivity(st.schema().as_ref()); | ||
} | ||
approx_stats.size_bytes += | ||
st.estimate_in_memory_size_bytes(None).unwrap_or(0); | ||
} | ||
approx_stats.acc_selectivity = | ||
approx_stats.num_rows as f64 / prefiltered_num_rows; | ||
approx_stats |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe physical_scan_info
has the pushdowns, and they should be the same as the pushdowns in all the scan tasks. Given this, is physical_scan_info.pushdowns.estimated_selectivity()
essentially the same as acc_selectivity
, and we don't need to do the prefiltered_num_rows calculations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, changed this to physical_scan_info.pushdowns.estimated_selectivity(self.output_schema.as_ref());
// String contains | ||
Expr::ScalarFunction(ScalarFunction { udf, .. }) if udf.name() == "contains" => 0.1, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good
src/daft-logical-plan/src/ops/agg.rs
Outdated
@@ -58,13 +58,17 @@ impl Aggregate { | |||
ApproxStats { | |||
num_rows: 1, | |||
size_bytes: est_bytes_per_row, | |||
acc_selectivity: input_stats.approx_stats.acc_selectivity | |||
/ input_stats.approx_stats.num_rows as f64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonder what the behaviour is if input_stats.approx_stats.num_rows
is 0. I guess this will become NaN? Which might be dangerous if it is propagated into somewhere that casts it to usize.
Lets add some safety checks for divide by 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, added checks. Tempted to add unlikely
hints to the 0 num_rows case but doesn't seem like we do this in the rest of the code base so I'm holding off on it
Today, the cardinality of a join is estimated by taking the max of the cardinality of the two sides because we assume primary key + foreign key joins (pk-fk joins).
This is fine in the case where there is no filtering on the left or right side of the join. However, when there is filtering, we overestimate the cardinality of the join.
What's happening is that filtering one side of a pk-fk join should reduce the cardinality of the join by the same proportion. However, in the case where the unfiltered side has a larger cardinality, we currently take its cardinality as the cardinality of the join, meaning that the cardinality of the join is not scaled down by the proportion of the filtered side.
This PR fixes this by tracking the accumulated selectivity of all logical operations. When it comes time to estimate the cardinality of a join, we scale down the estimated cardinality of the join by the accumulated selectivity of the unchosen side (we don't need to scale down by the selectivity of the chosen side because the cardinality has already been scaled down prior to the join).
For a more concrete example, consider the following query tree:
For this inner join, the left side has a cardinality of 5 (10 * 0.5) and the right side has a cardinality of 9000 (10000 * 0.9). Assuming a pk-fk join, and ignoring filters, we expect the cardinality of the join to be 10000 (taking the max of the two sides to be the foreign key side).
However, if we take the filters into account, 50% of the primary keys are filtered out and 90% of the foreign keys are filtered out. Assuming that the filters are independent, we expect the cardinality of the join to be 10000 * 0.5 * 0.9 = 4500. To make this estimate, we take the larger cardinality of the two sides (9000) and scale it down by the selectivity of the smaller side (0.5) which gives us the expected cardinality of the join (4500).
This is an approximation, and it might well be the case that the selectivities are not independent. This could be handled in the future by using histograms, or we could also penalize multiplication of the selectivity by some tuned penalty.
Results with TPCH
These changes speed up TPC-H queries 8, 9, and 10, which were suffering from incorrect probe side decisions.
Q5 saw some regressions because the cardinalities for two sides were similar enough that these changes caused the probe side to flip incorrectly. However, we currently have an incorrect join order for Q5, so once join reordering is turned on this may no longer be an issue.
Other queries were not affected.