-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add IEJoin algorithm for non-equi joins and support Full non-equi joins #18365
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #18365 +/- ##
==========================================
+ Coverage 79.82% 79.91% +0.08%
==========================================
Files 1501 1505 +4
Lines 201832 202674 +842
Branches 2869 2873 +4
==========================================
+ Hits 161110 161963 +853
+ Misses 40176 40163 -13
- Partials 546 548 +2 ☔ View full report in Codecov by Sentry. |
@adamreeve you need to change the title of this so that "add" is capitalized and then "pull request labeler" will pass. They do that because the version change summary scrapes the PR titles so it looks bad to have some bullet points that are lower case and some that are capitalized. |
Thanks a lot @adamreeve. I will pick this up. What I really want to add first is the parallel groups and then I need to think about the interface for a bit. As Ideally I'd like this to work with expressions in the default join. |
OK great, thanks Ritchie. Feel free to reach out on Discord or here if I can help with anything further. Yeah if this could work with expressions in the default join method that would be a lot nicer than the approach I've demonstrated here. |
CodSpeed Performance ReportMerging #18365 will not alter performanceComparing Summary
|
Thanks a lot @adamreeve. This turned out great with full non-equi join support. |
Relevant issues:
This adds a new
IEJoin
join type to the Rust library, based on the paper by Khayyat et al. and the DuckDB article on range joins. It handles joins that use two inequality expressions.How to represent this in the user-facing Python API is still an open question, so I'm opening this as a draft PR to hopefully start some discussion on that.
I've initially exposed the IEJoin type fairly directly with a new
inequality_join
method that expects two inequality expressions in the form oflhs_expression op rhs_expression
, where op is one of <, <=, > or >=. This might be a bit too complicated and error-prone though, so I've also added ajoin_between
method as an example higher level method that matches a value in one table with a range/interval from another table. We'd probably also want to add at least a couple of other methods to cover the subtypes listed in #10068: one method for matching on overlapping ranges and another method to match on values with some tolerance, maybe namedjoin_range
andband_join
for example.I'm not sure whether the current
inequality_join
method should be exposed as-is. It can express more generic join conditions, but it might be best to keep this private and aim to eventually add a more flexible API that can handle arbitrary join expressions, eg. something like what's suggested in #4207 or #10068 (comment).There are also some additional features and improvements that could be added but I think should be separate PRs:
by
parameters of join_asof