Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Improve stats for join side determination #3655

Merged
merged 9 commits into from
Jan 10, 2025

Conversation

colin-ho
Copy link
Contributor

@colin-ho colin-ho commented Jan 9, 2025

This PR updates swordfish join side determination logic to compare num rows instead of upper bound size bytes.

Details:

  • Use a fixed num_rows and size_bytes in the ApproxStats instead of lower / upper bounds.
  • Instead of having a fixed 20% selectivity for filters, take into account the complexity of the filter expressions into the selectivity. E.g. ANDS will be more selective than ORS, and IS_NULL will generally be less selective than comparisons or equalities. (This is useful because all of our joins have null filter pushdowns, but it tends to be the case that the side with the more complex filters will be the better side for the hash table, and having a fixed 20% selectivity will miss out on this)

Results on TPCH SF10:

Query Original (ms) Latest (ms) Change (%)
Q1 493.39 500.53 +1.45%
Q2 158.53 149.09 -5.95%
Q3 499.76 490.63 -1.83%
Q4 2527.00 269.22 -89.34%
Q5 757.68 721.30 -4.80%
Q6 151.95 156.64 +3.09%
Q7 471.33 458.81 -2.66%
Q8 568.80 1519.60 +167.16%
Q9 3644.70 3572.20 -1.99%
Q10 752.60 722.10 -4.05%
Q11 238.79 223.87 -6.25%
Q12 2676.30 320.82 -88.01%
Q13 979.72 962.65 -1.74%
Q14 510.17 504.23 -1.16%
Q15 480.39 468.77 -2.42%
Q16 183.70 188.36 +2.54%
Q17 392.50 375.32 -4.38%
Q18 7706.50 855.50 -88.90%
Q19 955.58 977.44 +2.29%
Q20 458.89 1191.30 +159.61%
Q21 10236.90 9616.3 -6.06%
Q22 2188.00 186.51 -91.47%

Total time:

  • Before: 36.03s
  • After: 24.36s

Notes:

  • Q8 and Q20 now have regressions. This is because cardinality estimation for joins is not accurate (it assumes primary key / foreign key join), leading to wrong probe side decisions for subsequent joins.

@github-actions github-actions bot added the feat label Jan 9, 2025
Copy link

codspeed-hq bot commented Jan 9, 2025

CodSpeed Performance Report

Merging #3655 will degrade performances by 34.36%

Comparing colin/use-num-rows-for-join-side (6aa9444) with main (bb85070)

Summary

❌ 1 regressions
✅ 26 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark main colin/use-num-rows-for-join-side Change
test_show[100 Small Files] 16 ms 24.3 ms -34.36%

Copy link

codecov bot commented Jan 9, 2025

Codecov Report

Attention: Patch coverage is 79.54545% with 36 lines in your changes missing coverage. Please review.

Project coverage is 78.06%. Comparing base (bb85070) to head (6aa9444).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/daft-physical-plan/src/plan.rs 65.78% 13 Missing ⚠️
src/daft-local-execution/src/pipeline.rs 56.52% 10 Missing ⚠️
src/daft-dsl/src/expr/mod.rs 71.87% 9 Missing ⚠️
src/daft-logical-plan/src/stats.rs 75.00% 2 Missing ⚠️
src/common/scan-info/src/pushdowns.rs 80.00% 1 Missing ⚠️
...daft-physical-plan/src/physical_planner/planner.rs 0.00% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #3655   +/-   ##
=======================================
  Coverage   78.06%   78.06%           
=======================================
  Files         728      728           
  Lines       90049    89967   -82     
=======================================
- Hits        70297    70237   -60     
+ Misses      19752    19730   -22     
Files with missing lines Coverage Δ
src/daft-dsl/src/lib.rs 100.00% <ø> (ø)
src/daft-logical-plan/src/ops/agg.rs 67.69% <100.00%> (-6.06%) ⬇️
src/daft-logical-plan/src/ops/distinct.rs 96.29% <100.00%> (-0.14%) ⬇️
src/daft-logical-plan/src/ops/explode.rs 80.32% <100.00%> (-0.32%) ⬇️
src/daft-logical-plan/src/ops/filter.rs 84.61% <100.00%> (-0.39%) ⬇️
src/daft-logical-plan/src/ops/join.rs 91.53% <100.00%> (-0.07%) ⬇️
src/daft-logical-plan/src/ops/limit.rs 96.42% <100.00%> (-1.35%) ⬇️
src/daft-logical-plan/src/ops/source.rs 72.41% <100.00%> (-3.35%) ⬇️
src/daft-logical-plan/src/ops/unpivot.rs 78.26% <100.00%> (-1.13%) ⬇️
src/daft-physical-plan/src/ops/filter.rs 68.75% <100.00%> (+31.25%) ⬆️
... and 9 more

... and 3 files with indirect coverage changes

@colin-ho colin-ho marked this pull request as ready for review January 9, 2025 19:18
@colin-ho colin-ho requested review from samster25 and desmondcheongzx and removed request for samster25 January 9, 2025 19:18
Copy link
Contributor

@desmondcheongzx desmondcheongzx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, nice results :) To fix the join size estimations, number of distinct values (NDV) stats might come in handy. If ndv(l) = num_rows(l) for the smaller side, then we know it's a primary key, and if not it's not a primary key join.

One more nit: this should be perf rather than feat imo

src/daft-logical-plan/src/ops/limit.rs Outdated Show resolved Hide resolved
src/daft-local-execution/src/pipeline.rs Outdated Show resolved Hide resolved
@colin-ho colin-ho changed the title feat: Improve stats for join side determination perf: Improve stats for join side determination Jan 10, 2025
@github-actions github-actions bot added perf and removed feat labels Jan 10, 2025
@colin-ho colin-ho merged commit c932ec9 into main Jan 10, 2025
42 of 43 checks passed
@colin-ho colin-ho deleted the colin/use-num-rows-for-join-side branch January 10, 2025 20:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants