-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bk/add leftjoin! #2843
Bk/add leftjoin! #2843
Conversation
Ah - and there are some minor clean-ups in the legacy join codes, but they are just tidying-up things. |
Please hold up with reviewing the PR. I am now investigating whether we should disallow duplicates in right table only if they would affect the join (i.e. non-matching duplicates would be allowed). I will comment when I am done. |
OK - I have changed the implementation to only require:
(so this means that if in and now things can be faster (in general roughly on par with other joins):
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, I don't have major comments about the implementation.
Regarding the requirement that matches in df2
must be unique, I think we should be prepared to allow choosing a different behavior in the future via an argument. One could wish to retain the first or the last match, or even to duplicate rows in df1
(which would imply resizing it). But AFAICT this wouldn't be a problem, right?
src/DataFrames.jl
Outdated
@@ -132,6 +133,7 @@ include("abstractdataframe/reshape.jl") | |||
|
|||
include("join/composer.jl") | |||
include("join/core.jl") | |||
include("join/leftjoin!.jl") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use a more general name like "inplace"? We may want to implement rightjoin!
at some point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only was afraid that rightjoin!
would be a bit confusing for the users (note the row order ), but potentially we can add it in the future (the algorithm will be the same). The point is that rightjoin
has a more complex logic regarding column order and creation, see:
julia> df1 = DataFrame(id=1, x=2)
1×2 DataFrame
Row │ id x
│ Int64 Int64
─────┼──────────────
1 │ 1 2
julia> df2 = DataFrame(id=1, y=3)
1×2 DataFrame
Row │ id y
│ Int64 Int64
─────┼──────────────
1 │ 1 3
julia> rightjoin(df1, df2, on=:id)
1×3 DataFrame
Row │ id x y
│ Int64 Int64? Int64
─────┼──────────────────────
1 │ 1 2 3
and a comment from a docstring:
When merging
on
categorical columns that differ in the ordering of their levels, the ordering of the left data frame
takes precedence over the ordering of the right data frame.
Also I do not think many people will want rightjoin!
.
src/join/leftjoin!.jl
Outdated
on::Union{<:OnType, AbstractVector} = Symbol[], makeunique::Bool=false, | ||
source::Union{Nothing, Symbol, AbstractString}=nothing, | ||
matchmissing::Symbol=:error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect indentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
src/join/leftjoin!.jl
Outdated
matchmissing=:error) | ||
|
||
Perform a left join of two data frame objects by updating the `df1` with the | ||
joined columns from `df2`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd keep this for clarity:
joined columns from `df2`. | |
joined columns from `df2`. | |
A left join includes all rows from `df1`. | |
Rows and columns from `df1` are left untouched. Each row in `df1` | |
must have at most one match in `df2` based on `on` columns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have written a new intro do docstring.
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
We can add it in the future.
This could be done relatively easily I think (I would have to check if it would not require sorting but hopefully not)
This would cause three problems:
|
This should be good for another round of reviews. Also do you think that we should add an option to keep row order in CC @pdeffebach |
Ah - this is not that easy as we have to handle column renaming still but it should not be super hard to add. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's merge this as-is more features can be added later.
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Thank you! |
Fixes #2259.
Requires #2794 before being finished but I share it to discuss API (especially the kwargs allowed and requirement of right key uniqueness).
I have not implemented all possible optimizations yet (I will decide to either add them now or leave for adding them later). The consequence of lacking optimizations is the following performance comparison:
(and as usual - we will have the worst case when right data frame is very tall and left is very short)