Bk/add leftjoin! #2843

bkamins · 2021-08-26T08:40:09Z

Requires #2794 before being finished but I share it to discuss API (especially the kwargs allowed and requirement of right key uniqueness).

I have not implemented all possible optimizations yet (I will decide to either add them now or leave for adding them later). The consequence of lacking optimizations is the following performance comparison:

julia> using DataFrames, Random, BenchmarkTools

julia> df1 = DataFrame(id=1:10^7, x=1:10^7);

julia> df1x = DataFrame(id=shuffle(1:10^7), x=1:10^7);

julia> df2 = DataFrame(id=1:10^7, y=1:10^7);

julia> @btime leftjoin(copy($df1), $df2, on=:id);
  297.162 ms (326 allocations: 544.81 MiB)

julia> @btime leftjoin!(copy($df1), $df2, on=:id);
  1.782 s (184 allocations: 858.72 MiB)

julia> @btime leftjoin(copy($df1x), $df2, on=:id);
  939.200 ms (358 allocations: 621.10 MiB)

julia> @btime leftjoin!(copy($df1x), $df2, on=:id);
  2.020 s (183 allocations: 858.72 MiB)

(and as usual - we will have the worst case when right data frame is very tall and left is very short)

bkamins · 2021-08-26T08:41:04Z

Ah - and there are some minor clean-ups in the legacy join codes, but they are just tidying-up things.

bkamins · 2021-08-26T17:29:01Z

Please hold up with reviewing the PR. I am now investigating whether we should disallow duplicates in right table only if they would affect the join (i.e. non-matching duplicates would be allowed). I will comment when I am done.

bkamins · 2021-08-26T21:56:34Z

OK - I have changed the implementation to only require:

The rows in on columns of df2 that match rows in df1 must be unique.

(so this means that if in df2 we have rows that are not matching df1 they do not have to be unique)

and now things can be faster (in general roughly on par with other joins):

julia> using DataFrames, Random, BenchmarkTools

julia> df1 = DataFrame(id=1:10^7, x=1:10^7);

julia> df1x = DataFrame(id=shuffle(1:10^7), x=1:10^7);

julia> df2 = DataFrame(id=1:10^7, y=1:10^7);

julia> @btime leftjoin(copy($df1), $df2, on=:id);
  291.562 ms (321 allocations: 544.81 MiB)

julia> @btime leftjoin!(copy($df1), $df2, on=:id);
  266.190 ms (189 allocations: 467.31 MiB)

julia> @btime leftjoin(copy($df1x), $df2, on=:id);
  903.370 ms (353 allocations: 621.10 MiB)

julia> @btime leftjoin!(copy($df1x), $df2, on=:id);
  1.018 s (222 allocations: 543.61 MiB)

nalimilan

Looks good, I don't have major comments about the implementation.

Regarding the requirement that matches in df2 must be unique, I think we should be prepared to allow choosing a different behavior in the future via an argument. One could wish to retain the first or the last match, or even to duplicate rows in df1 (which would imply resizing it). But AFAICT this wouldn't be a problem, right?

nalimilan · 2021-08-28T17:42:09Z

src/DataFrames.jl

@@ -132,6 +133,7 @@ include("abstractdataframe/reshape.jl")

 include("join/composer.jl")
 include("join/core.jl")
+include("join/leftjoin!.jl")


Maybe use a more general name like "inplace"? We may want to implement rightjoin! at some point.

I only was afraid that rightjoin! would be a bit confusing for the users (note the row order ), but potentially we can add it in the future (the algorithm will be the same). The point is that rightjoin has a more complex logic regarding column order and creation, see:

julia> df1 = DataFrame(id=1, x=2) 1×2 DataFrame Row │ id x │ Int64 Int64 ─────┼────────────── 1 │ 1 2 julia> df2 = DataFrame(id=1, y=3) 1×2 DataFrame Row │ id y │ Int64 Int64 ─────┼────────────── 1 │ 1 3 julia> rightjoin(df1, df2, on=:id) 1×3 DataFrame Row │ id x y │ Int64 Int64? Int64 ─────┼────────────────────── 1 │ 1 2 3

and a comment from a docstring:

When merging on categorical columns that differ in the ordering of their levels, the ordering of the left data frame
takes precedence over the ordering of the right data frame.

Also I do not think many people will want rightjoin!.

nalimilan · 2021-08-28T17:48:01Z

src/join/leftjoin!.jl

+         on::Union{<:OnType, AbstractVector} = Symbol[], makeunique::Bool=false,
+         source::Union{Nothing, Symbol, AbstractString}=nothing,
+         matchmissing::Symbol=:error)


Incorrect indentation.

src/join/leftjoin!.jl

test/join.jl

nalimilan · 2021-08-31T16:32:13Z

src/join/leftjoin!.jl

+              matchmissing=:error)
+
+Perform a left join of two data frame objects by updating the `df1` with the
+joined columns from `df2`.


I'd keep this for clarity:

Suggested change

joined columns from `df2`.

joined columns from `df2`.

A left join includes all rows from `df1`.

Rows and columns from `df1` are left untouched. Each row in `df1`

must have at most one match in `df2` based on `on` columns.

I have written a new intro do docstring.

src/join/leftjoin!.jl

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins · 2021-09-01T08:44:40Z

Regarding the requirement that matches in df2 must be unique, I think we should be prepared to allow choosing a different behavior in the future via an argument.

We can add it in the future.

One could wish to retain the first or the last match

This could be done relatively easily I think (I would have to check if it would not require sorting but hopefully not)

or even to duplicate rows in df1 (which would imply resizing it). But AFAICT this wouldn't be a problem, right?

This would cause three problems:

for DataFrame resizing columns is costly (this is not super bad but still)
it would require a completely different algorithm to be efficient (now we use the fact that only one slot is occupied per row in df1 which is efficient)
for SubDataFrame as df1 it would not work (as clearly SubDataFrame cannot be resized) - of course we can check for this.

bkamins · 2021-09-01T08:52:25Z

This should be good for another round of reviews. Also do you think that we should add an option to keep row order in leftjoin in a similar way to what we do now in leftjoin! (essentially we would do leftjoin! on a copy of df1 so it is easy to add)?

CC @pdeffebach

bkamins · 2021-09-01T13:56:04Z

essentially we would do leftjoin! on a copy of df1 so it is easy to add

Ah - this is not that easy as we have to handle column renaming still but it should not be super hard to add.

nalimilan

Let's merge this as-is more features can be added later.

src/join/inplace.jl

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins · 2021-09-06T21:15:55Z

Thank you!

bkamins added 2 commits August 26, 2021 10:23

add leftjoin!

e7251b4

add performance TODO

f615290

bkamins added feature joins labels Aug 26, 2021

bkamins added this to the 1.3 milestone Aug 26, 2021

bkamins requested review from nalimilan and pdeffebach August 26, 2021 08:40

bkamins added 4 commits August 26, 2021 23:56

change leftjoin! implementation

bc55a2c

code layout fix

68e428c

add leftjoin! to docstrings

1384a64

Merge branch 'main' into bk/add_leftjoin!

f2ec8eb

nalimilan reviewed Aug 31, 2021

View reviewed changes

bkamins commented Sep 1, 2021

View reviewed changes

src/join/leftjoin!.jl Outdated Show resolved Hide resolved

bkamins and others added 4 commits September 1, 2021 09:59

Apply suggestions from code review

f0690e3

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

updates after code review

4781e1a

Merge branch 'main' into bk/add_leftjoin!

09e4342

test SubDataFrame support of leftjoin!

f8724c9

nalimilan approved these changes Sep 6, 2021

View reviewed changes

src/join/inplace.jl Outdated Show resolved Hide resolved

Update src/join/inplace.jl

ffb3237

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins merged commit 8dcccb4 into main Sep 6, 2021

bkamins deleted the bk/add_leftjoin! branch September 6, 2021 21:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bk/add leftjoin! #2843

Bk/add leftjoin! #2843

bkamins commented Aug 26, 2021

bkamins commented Aug 26, 2021

bkamins commented Aug 26, 2021

bkamins commented Aug 26, 2021

nalimilan left a comment

nalimilan Aug 28, 2021

bkamins Sep 1, 2021

bkamins Sep 1, 2021

nalimilan Aug 28, 2021

bkamins Sep 1, 2021

nalimilan Aug 31, 2021 •

edited by bkamins

Loading

bkamins Sep 1, 2021

bkamins commented Sep 1, 2021

bkamins commented Sep 1, 2021 •

edited

Loading

bkamins commented Sep 1, 2021

nalimilan left a comment

bkamins commented Sep 6, 2021

-joined columns from `df2`.
+joined columns from `df2`.
+A left join includes all rows from `df1`.
+Rows and columns from `df1` are left untouched. Each row in `df1`
+must have at most one match in `df2` based on `on` columns.

Bk/add leftjoin! #2843

Bk/add leftjoin! #2843

Conversation

bkamins commented Aug 26, 2021

bkamins commented Aug 26, 2021

bkamins commented Aug 26, 2021

bkamins commented Aug 26, 2021

nalimilan left a comment

Choose a reason for hiding this comment

nalimilan Aug 28, 2021

Choose a reason for hiding this comment

bkamins Sep 1, 2021

Choose a reason for hiding this comment

bkamins Sep 1, 2021

Choose a reason for hiding this comment

nalimilan Aug 28, 2021

Choose a reason for hiding this comment

bkamins Sep 1, 2021

Choose a reason for hiding this comment

nalimilan Aug 31, 2021 • edited by bkamins Loading

Choose a reason for hiding this comment

bkamins Sep 1, 2021

Choose a reason for hiding this comment

bkamins commented Sep 1, 2021

bkamins commented Sep 1, 2021 • edited Loading

bkamins commented Sep 1, 2021

nalimilan left a comment

Choose a reason for hiding this comment

bkamins commented Sep 6, 2021

nalimilan Aug 31, 2021 •

edited by bkamins

Loading

bkamins commented Sep 1, 2021 •

edited

Loading