-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: Improve stats for join side determination #3655
Conversation
CodSpeed Performance ReportMerging #3655 will degrade performances by 34.36%Comparing Summary
Benchmarks breakdown
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3655 +/- ##
=======================================
Coverage 78.06% 78.06%
=======================================
Files 728 728
Lines 90049 89967 -82
=======================================
- Hits 70297 70237 -60
+ Misses 19752 19730 -22
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, nice results :) To fix the join size estimations, number of distinct values (NDV) stats might come in handy. If ndv(l) = num_rows(l) for the smaller side, then we know it's a primary key, and if not it's not a primary key join.
One more nit: this should be perf
rather than feat
imo
This PR updates swordfish join side determination logic to compare num rows instead of upper bound size bytes.
Details:
num_rows
andsize_bytes
in theApproxStats
instead of lower / upper bounds.ANDS
will be more selective thanORS
, andIS_NULL
will generally be less selective than comparisons or equalities. (This is useful because all of our joins have null filter pushdowns, but it tends to be the case that the side with the more complex filters will be the better side for the hash table, and having a fixed 20% selectivity will miss out on this)Results on TPCH SF10:
Total time:
Notes: