Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bk/add leftjoin! #2843

Merged
merged 11 commits into from
Sep 6, 2021
Merged

Bk/add leftjoin! #2843

merged 11 commits into from
Sep 6, 2021

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Aug 26, 2021

Fixes #2259.

Requires #2794 before being finished but I share it to discuss API (especially the kwargs allowed and requirement of right key uniqueness).

I have not implemented all possible optimizations yet (I will decide to either add them now or leave for adding them later). The consequence of lacking optimizations is the following performance comparison:

julia> using DataFrames, Random, BenchmarkTools

julia> df1 = DataFrame(id=1:10^7, x=1:10^7);

julia> df1x = DataFrame(id=shuffle(1:10^7), x=1:10^7);

julia> df2 = DataFrame(id=1:10^7, y=1:10^7);

julia> @btime leftjoin(copy($df1), $df2, on=:id);
  297.162 ms (326 allocations: 544.81 MiB)

julia> @btime leftjoin!(copy($df1), $df2, on=:id);
  1.782 s (184 allocations: 858.72 MiB)

julia> @btime leftjoin(copy($df1x), $df2, on=:id);
  939.200 ms (358 allocations: 621.10 MiB)

julia> @btime leftjoin!(copy($df1x), $df2, on=:id);
  2.020 s (183 allocations: 858.72 MiB)

(and as usual - we will have the worst case when right data frame is very tall and left is very short)

@bkamins bkamins added this to the 1.3 milestone Aug 26, 2021
@bkamins
Copy link
Member Author

bkamins commented Aug 26, 2021

Ah - and there are some minor clean-ups in the legacy join codes, but they are just tidying-up things.

@bkamins
Copy link
Member Author

bkamins commented Aug 26, 2021

Please hold up with reviewing the PR. I am now investigating whether we should disallow duplicates in right table only if they would affect the join (i.e. non-matching duplicates would be allowed). I will comment when I am done.

@bkamins
Copy link
Member Author

bkamins commented Aug 26, 2021

OK - I have changed the implementation to only require:

The rows in on columns of df2 that match rows in df1 must be unique.

(so this means that if in df2 we have rows that are not matching df1 they do not have to be unique)

and now things can be faster (in general roughly on par with other joins):

julia> using DataFrames, Random, BenchmarkTools

julia> df1 = DataFrame(id=1:10^7, x=1:10^7);

julia> df1x = DataFrame(id=shuffle(1:10^7), x=1:10^7);

julia> df2 = DataFrame(id=1:10^7, y=1:10^7);

julia> @btime leftjoin(copy($df1), $df2, on=:id);
  291.562 ms (321 allocations: 544.81 MiB)

julia> @btime leftjoin!(copy($df1), $df2, on=:id);
  266.190 ms (189 allocations: 467.31 MiB)

julia> @btime leftjoin(copy($df1x), $df2, on=:id);
  903.370 ms (353 allocations: 621.10 MiB)

julia> @btime leftjoin!(copy($df1x), $df2, on=:id);
  1.018 s (222 allocations: 543.61 MiB)

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I don't have major comments about the implementation.

Regarding the requirement that matches in df2 must be unique, I think we should be prepared to allow choosing a different behavior in the future via an argument. One could wish to retain the first or the last match, or even to duplicate rows in df1 (which would imply resizing it). But AFAICT this wouldn't be a problem, right?

@@ -132,6 +133,7 @@ include("abstractdataframe/reshape.jl")

include("join/composer.jl")
include("join/core.jl")
include("join/leftjoin!.jl")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use a more general name like "inplace"? We may want to implement rightjoin! at some point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only was afraid that rightjoin! would be a bit confusing for the users (note the row order ), but potentially we can add it in the future (the algorithm will be the same). The point is that rightjoin has a more complex logic regarding column order and creation, see:

julia> df1 = DataFrame(id=1, x=2)
1×2 DataFrame       
 Row │ id     x     
     │ Int64  Int64 
─────┼──────────────
   1 │     1      2 

julia> df2 = DataFrame(id=1, y=3)
1×2 DataFrame       
 Row │ id     y     
     │ Int64  Int64 
─────┼──────────────
   1 │     1      3 

julia> rightjoin(df1, df2, on=:id)
1×3 DataFrame
 Row │ id     x       y     
     │ Int64  Int64?  Int64 
─────┼──────────────────────
   1 │     1       2      3

and a comment from a docstring:

When merging on categorical columns that differ in the ordering of their levels, the ordering of the left data frame
takes precedence over the ordering of the right data frame.

Also I do not think many people will want rightjoin!.

Comment on lines 90 to 92
on::Union{<:OnType, AbstractVector} = Symbol[], makeunique::Bool=false,
source::Union{Nothing, Symbol, AbstractString}=nothing,
matchmissing::Symbol=:error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect indentation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

src/join/leftjoin!.jl Outdated Show resolved Hide resolved
src/join/leftjoin!.jl Outdated Show resolved Hide resolved
src/join/leftjoin!.jl Outdated Show resolved Hide resolved
test/join.jl Outdated Show resolved Hide resolved
test/join.jl Outdated Show resolved Hide resolved
test/join.jl Outdated Show resolved Hide resolved
test/join.jl Outdated Show resolved Hide resolved
matchmissing=:error)

Perform a left join of two data frame objects by updating the `df1` with the
joined columns from `df2`.
Copy link
Member

@nalimilan nalimilan Aug 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd keep this for clarity:

Suggested change
joined columns from `df2`.
joined columns from `df2`.
A left join includes all rows from `df1`.
Rows and columns from `df1` are left untouched. Each row in `df1`
must have at most one match in `df2` based on `on` columns.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have written a new intro do docstring.

src/join/leftjoin!.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented Sep 1, 2021

Regarding the requirement that matches in df2 must be unique, I think we should be prepared to allow choosing a different behavior in the future via an argument.

We can add it in the future.

One could wish to retain the first or the last match

This could be done relatively easily I think (I would have to check if it would not require sorting but hopefully not)

or even to duplicate rows in df1 (which would imply resizing it). But AFAICT this wouldn't be a problem, right?

This would cause three problems:

  1. for DataFrame resizing columns is costly (this is not super bad but still)
  2. it would require a completely different algorithm to be efficient (now we use the fact that only one slot is occupied per row in df1 which is efficient)
  3. for SubDataFrame as df1 it would not work (as clearly SubDataFrame cannot be resized) - of course we can check for this.

@bkamins
Copy link
Member Author

bkamins commented Sep 1, 2021

This should be good for another round of reviews. Also do you think that we should add an option to keep row order in leftjoin in a similar way to what we do now in leftjoin! (essentially we would do leftjoin! on a copy of df1 so it is easy to add)?

CC @pdeffebach

@bkamins
Copy link
Member Author

bkamins commented Sep 1, 2021

essentially we would do leftjoin! on a copy of df1 so it is easy to add

Ah - this is not that easy as we have to handle column renaming still but it should not be super hard to add.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge this as-is more features can be added later.

src/join/inplace.jl Outdated Show resolved Hide resolved
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins bkamins merged commit 8dcccb4 into main Sep 6, 2021
@bkamins bkamins deleted the bk/add_leftjoin! branch September 6, 2021 21:15
@bkamins
Copy link
Member Author

bkamins commented Sep 6, 2021

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add a leftjoin! (or match! or merge! or whatever it should be called)
2 participants