Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add IEJoin algorithm for non-equi joins and support Full non-equi joins #18365

Merged
merged 39 commits into from
Sep 7, 2024

Conversation

adamreeve
Copy link
Contributor

Relevant issues:

This adds a new IEJoin join type to the Rust library, based on the paper by Khayyat et al. and the DuckDB article on range joins. It handles joins that use two inequality expressions.

How to represent this in the user-facing Python API is still an open question, so I'm opening this as a draft PR to hopefully start some discussion on that.

I've initially exposed the IEJoin type fairly directly with a new inequality_join method that expects two inequality expressions in the form of lhs_expression op rhs_expression, where op is one of <, <=, > or >=. This might be a bit too complicated and error-prone though, so I've also added a join_between method as an example higher level method that matches a value in one table with a range/interval from another table. We'd probably also want to add at least a couple of other methods to cover the subtypes listed in #10068: one method for matching on overlapping ranges and another method to match on values with some tolerance, maybe named join_range and band_join for example.

I'm not sure whether the current inequality_join method should be exposed as-is. It can express more generic join conditions, but it might be best to keep this private and aim to eventually add a more flexible API that can handle arbitrary join expressions, eg. something like what's suggested in #4207 or #10068 (comment).

There are also some additional features and improvements that could be added but I think should be separate PRs:

  • Implementing multi-threading by breaking up each side of the join into blocks (section 5 from Khayyat et al.)
  • Allow additional equality join conditions like the by parameters of join_asof
  • Support streaming mode

@adamreeve adamreeve changed the title Add IEJoin algorithm for non-equi joins feat: add IEJoin algorithm for non-equi joins Aug 26, 2024
Copy link

codecov bot commented Aug 26, 2024

Codecov Report

Attention: Patch coverage is 94.63869% with 46 lines in your changes missing coverage. Please review.

Project coverage is 79.91%. Comparing base (9a256a3) to head (c54322c).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
crates/polars-ops/src/frame/join/iejoin/mod.rs 93.30% 18 Missing ⚠️
crates/polars-plan/src/plans/conversion/join.rs 93.54% 16 Missing ⚠️
py-polars/polars/dataframe/frame.py 50.00% 2 Missing and 1 partial ⚠️
py-polars/polars/lazyframe/frame.py 57.14% 2 Missing and 1 partial ⚠️
crates/polars-ops/src/frame/join/mod.rs 90.00% 2 Missing ⚠️
crates/polars-utils/src/binary_search.rs 94.87% 2 Missing ⚠️
crates/polars-lazy/src/frame/mod.rs 97.14% 1 Missing ⚠️
...rates/polars-python/src/lazyframe/visitor/nodes.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #18365      +/-   ##
==========================================
+ Coverage   79.82%   79.91%   +0.08%     
==========================================
  Files        1501     1505       +4     
  Lines      201832   202674     +842     
  Branches     2869     2873       +4     
==========================================
+ Hits       161110   161963     +853     
+ Misses      40176    40163      -13     
- Partials      546      548       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@deanm0000
Copy link
Collaborator

@adamreeve you need to change the title of this so that "add" is capitalized and then "pull request labeler" will pass. They do that because the version change summary scrapes the PR titles so it looks bad to have some bullet points that are lower case and some that are capitalized.

@adamreeve adamreeve changed the title feat: add IEJoin algorithm for non-equi joins feat: Add IEJoin algorithm for non-equi joins Aug 28, 2024
@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars and removed title needs formatting labels Aug 28, 2024
@ritchie46
Copy link
Member

Thanks a lot @adamreeve. I will pick this up. What I really want to add first is the parallel groups and then I need to think about the interface for a bit. As Ideally I'd like this to work with expressions in the default join.

@adamreeve
Copy link
Contributor Author

OK great, thanks Ritchie. Feel free to reach out on Discord or here if I can help with anything further. Yeah if this could work with expressions in the default join method that would be a lot nicer than the approach I've demonstrated here.

@c-peters c-peters assigned c-peters and ritchie46 and unassigned c-peters Sep 3, 2024
@c-peters c-peters added the accepted Ready for implementation label Sep 3, 2024
Copy link

codspeed-hq bot commented Sep 4, 2024

CodSpeed Performance Report

Merging #18365 will not alter performance

Comparing adamreeve:iejoin (c54322c) with main (106e239)

Summary

✅ 37 untouched benchmarks

@ritchie46 ritchie46 changed the title feat: Add IEJoin algorithm for non-equi joins feat: Add IEJoin algorithm for non-equi joins and support Full non-ecqui joins Sep 6, 2024
@ritchie46 ritchie46 added the highlight Highlight this PR in the changelog label Sep 7, 2024
@ritchie46
Copy link
Member

Thanks a lot @adamreeve. This turned out great with full non-equi join support.

@ritchie46 ritchie46 merged commit 80769d2 into pola-rs:main Sep 7, 2024
24 of 27 checks passed
@adamreeve adamreeve deleted the iejoin branch September 9, 2024 00:08
@ritchie46 ritchie46 changed the title feat: Add IEJoin algorithm for non-equi joins and support Full non-ecqui joins feat: Add IEJoin algorithm for non-equi joins and support Full non-equi joins Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature highlight Highlight this PR in the changelog python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants