-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bk/add leftjoin! #2843
Bk/add leftjoin! #2843
Changes from 6 commits
e7251b4
f615290
bc55a2c
68e428c
1384a64
f2ec8eb
f0690e3
4781e1a
09e4342
f8724c9
ffb3237
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -109,6 +109,7 @@ antijoin | |
crossjoin | ||
innerjoin | ||
leftjoin | ||
leftjoin! | ||
outerjoin | ||
rightjoin | ||
semijoin | ||
|
Original file line number | Diff line number | Diff line change | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,175 @@ | ||||||||||||||||||||
""" | ||||||||||||||||||||
leftjoin!(df1, df2; on, makeunique=false, source=nothing, | ||||||||||||||||||||
matchmissing=:error) | ||||||||||||||||||||
|
||||||||||||||||||||
Perform a left join of two data frame objects by updating the `df1` with the | ||||||||||||||||||||
joined columns from `df2`. | ||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd keep this for clarity:
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have written a new intro do docstring. |
||||||||||||||||||||
|
||||||||||||||||||||
The rows in `on` columns of `df2` that match rows in `df1` must be unique. | ||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can/should we make this guaranty about the fact that rows won't be reordered?
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes - we guarantee this. Also the 'uniqueness' restriction makes it possible to perform this operation fast because:
I have updated the description in your suggestion above (as it was overlapping). |
||||||||||||||||||||
|
||||||||||||||||||||
# Arguments | ||||||||||||||||||||
- `df1`, `df2`: the `AbstractDataFrames` to be joined | ||||||||||||||||||||
|
||||||||||||||||||||
# Keyword Arguments | ||||||||||||||||||||
- `on` : A column name to join `df1` and `df2` on. If the columns on which | ||||||||||||||||||||
`df1` and `df2` will be joined have different names, then a `left=>right` | ||||||||||||||||||||
pair can be passed. It is also allowed to perform a join on multiple columns, | ||||||||||||||||||||
in which case a vector of column names or column name pairs can be passed | ||||||||||||||||||||
(mixing names and pairs is allowed). | ||||||||||||||||||||
- `makeunique` : if `false` (the default), an error will be raised | ||||||||||||||||||||
if duplicate names are found in columns not joined on; | ||||||||||||||||||||
if `true`, duplicate names will be suffixed with `_i` | ||||||||||||||||||||
(`i` starting at 1 for the first duplicate). | ||||||||||||||||||||
- `source` : Default: `nothing`. If a `Symbol` or string, adds indicator | ||||||||||||||||||||
column with the given name, for whether a row appeared in only `df1` (`"left_only"`) | ||||||||||||||||||||
or in both (`"both"`). If the name is already in use, | ||||||||||||||||||||
the column name will be modified if `makeunique=true`. | ||||||||||||||||||||
- `matchmissing` : if equal to `:error` throw an error if `missing` is present | ||||||||||||||||||||
in `on` columns; if equal to `:equal` then `missing` is allowed and missings are | ||||||||||||||||||||
matched; if equal to `:notequal` then missings are dropped in `df2` `on` columns; | ||||||||||||||||||||
`isequal` is used for comparisons of rows for equality | ||||||||||||||||||||
|
||||||||||||||||||||
The columns added to `df1` from `df2` will support missing values. | ||||||||||||||||||||
|
||||||||||||||||||||
It is not allowed to join on columns that contain `NaN` or `-0.0` in real or | ||||||||||||||||||||
imaginary part of the number. If you need to perform a join on such values use | ||||||||||||||||||||
CategoricalArrays.jl and transform a column containing such values into a | ||||||||||||||||||||
`CategoricalVector`. | ||||||||||||||||||||
|
||||||||||||||||||||
See also: [`leftjoin`](@ref). | ||||||||||||||||||||
|
||||||||||||||||||||
# Examples | ||||||||||||||||||||
```jldoctest | ||||||||||||||||||||
julia> name = DataFrame(ID = [1, 2, 3], Name = ["John Doe", "Jane Doe", "Joe Blogs"]) | ||||||||||||||||||||
3×2 DataFrame | ||||||||||||||||||||
Row │ ID Name | ||||||||||||||||||||
│ Int64 String | ||||||||||||||||||||
─────┼────────────────── | ||||||||||||||||||||
1 │ 1 John Doe | ||||||||||||||||||||
2 │ 2 Jane Doe | ||||||||||||||||||||
3 │ 3 Joe Blogs | ||||||||||||||||||||
|
||||||||||||||||||||
julia> job = DataFrame(ID = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"]) | ||||||||||||||||||||
3×2 DataFrame | ||||||||||||||||||||
Row │ ID Job | ||||||||||||||||||||
│ Int64 String | ||||||||||||||||||||
─────┼─────────────── | ||||||||||||||||||||
1 │ 1 Lawyer | ||||||||||||||||||||
2 │ 2 Doctor | ||||||||||||||||||||
3 │ 4 Farmer | ||||||||||||||||||||
|
||||||||||||||||||||
julia> leftjoin!(name, job, on = :ID) | ||||||||||||||||||||
3×3 DataFrame | ||||||||||||||||||||
Row │ ID Name Job | ||||||||||||||||||||
│ Int64 String String? | ||||||||||||||||||||
─────┼─────────────────────────── | ||||||||||||||||||||
1 │ 1 John Doe Lawyer | ||||||||||||||||||||
2 │ 2 Jane Doe Doctor | ||||||||||||||||||||
3 │ 3 Joe Blogs missing | ||||||||||||||||||||
|
||||||||||||||||||||
julia> job2 = DataFrame(identifier = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"]) | ||||||||||||||||||||
3×2 DataFrame | ||||||||||||||||||||
Row │ identifier Job | ||||||||||||||||||||
│ Int64 String | ||||||||||||||||||||
─────┼──────────────────── | ||||||||||||||||||||
1 │ 1 Lawyer | ||||||||||||||||||||
2 │ 2 Doctor | ||||||||||||||||||||
3 │ 4 Farmer | ||||||||||||||||||||
|
||||||||||||||||||||
julia> leftjoin!(name, job2, on = :ID => :identifier, makeunique=true, source=:source) | ||||||||||||||||||||
3×5 DataFrame | ||||||||||||||||||||
Row │ ID Name Job Job_1 source | ||||||||||||||||||||
│ Int64 String String? String? String | ||||||||||||||||||||
─────┼─────────────────────────────────────────────── | ||||||||||||||||||||
1 │ 1 John Doe Lawyer Lawyer both | ||||||||||||||||||||
2 │ 2 Jane Doe Doctor Doctor both | ||||||||||||||||||||
3 │ 3 Joe Blogs missing missing left_only | ||||||||||||||||||||
``` | ||||||||||||||||||||
""" | ||||||||||||||||||||
function leftjoin!(df1::AbstractDataFrame, df2::AbstractDataFrame; | ||||||||||||||||||||
on::Union{<:OnType, AbstractVector} = Symbol[], makeunique::Bool=false, | ||||||||||||||||||||
source::Union{Nothing, Symbol, AbstractString}=nothing, | ||||||||||||||||||||
matchmissing::Symbol=:error) | ||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Incorrect indentation. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixed |
||||||||||||||||||||
|
||||||||||||||||||||
# TODO: add a check if df1 allows adding columns if it is a SubDataFrame | ||||||||||||||||||||
# after https://github.com/JuliaData/DataFrames.jl/pull/2794 is merged | ||||||||||||||||||||
|
||||||||||||||||||||
_check_consistency(df1) | ||||||||||||||||||||
_check_consistency(df2) | ||||||||||||||||||||
|
||||||||||||||||||||
if on == [] | ||||||||||||||||||||
throw(ArgumentError("Missing join argument 'on'.")) | ||||||||||||||||||||
end | ||||||||||||||||||||
|
||||||||||||||||||||
joiner = DataFrameJoiner(df1, df2, on, matchmissing, :left) | ||||||||||||||||||||
|
||||||||||||||||||||
right_noon_names = names(joiner.dfr, Not(joiner.right_on)) | ||||||||||||||||||||
if !(makeunique || isempty(intersect(right_noon_names, names(df1)))) | ||||||||||||||||||||
throw(ArgumentError("left data frame has duplicate column names with " * | ||||||||||||||||||||
"right data frame. Pass makeunique=true to " * | ||||||||||||||||||||
"make it unique using a suffix automatically.")) | ||||||||||||||||||||
bkamins marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||
end | ||||||||||||||||||||
|
||||||||||||||||||||
left_ixs_inner, right_ixs_inner = find_inner_rows(joiner) | ||||||||||||||||||||
|
||||||||||||||||||||
right_ixs = _map_leftjoin_ixs(nrow(df1), left_ixs_inner, right_ixs_inner) | ||||||||||||||||||||
|
||||||||||||||||||||
# TODO: consider adding threading support in the future | ||||||||||||||||||||
for colname in right_noon_names | ||||||||||||||||||||
rcol = joiner.dfr[!, colname] # note that it does not have to be df2 | ||||||||||||||||||||
nalimilan marked this conversation as resolved.
Show resolved
Hide resolved
bkamins marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||
rcol_joined = compose_joined_rcol!(rcol, similar_missing(rcol, nrow(df1)), | ||||||||||||||||||||
right_ixs) | ||||||||||||||||||||
insertcols!(df1, colname => rcol_joined, makeunique=makeunique, copycols=false) | ||||||||||||||||||||
end | ||||||||||||||||||||
|
||||||||||||||||||||
if source !== nothing | ||||||||||||||||||||
pool = ["left_only", "right_only", "both"] | ||||||||||||||||||||
invpool = Dict{String, UInt32}("left_only" => 1, | ||||||||||||||||||||
"right_only" => 2, | ||||||||||||||||||||
"both" => 3) | ||||||||||||||||||||
indicatorcol = PooledArray(PooledArrays.RefArray(UInt32.(2 .* (right_ixs .> 0) .+ 1)), | ||||||||||||||||||||
invpool, pool) | ||||||||||||||||||||
|
||||||||||||||||||||
unique_indicator = source | ||||||||||||||||||||
if makeunique | ||||||||||||||||||||
try_idx = 0 | ||||||||||||||||||||
while hasproperty(df1, unique_indicator) | ||||||||||||||||||||
try_idx += 1 | ||||||||||||||||||||
unique_indicator = Symbol(source, "_", try_idx) | ||||||||||||||||||||
end | ||||||||||||||||||||
end | ||||||||||||||||||||
|
||||||||||||||||||||
if hasproperty(df1, unique_indicator) | ||||||||||||||||||||
throw(ArgumentError("updated left data frame already has column " * | ||||||||||||||||||||
bkamins marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||
":$unique_indicator. Pass makeunique=true to " * | ||||||||||||||||||||
"make it unique using a suffix automatically.")) | ||||||||||||||||||||
end | ||||||||||||||||||||
df1[!, unique_indicator] = indicatorcol | ||||||||||||||||||||
end | ||||||||||||||||||||
return df1 | ||||||||||||||||||||
end | ||||||||||||||||||||
|
||||||||||||||||||||
function _map_leftjoin_ixs(out_len::Int, | ||||||||||||||||||||
left_ixs_inner::Vector{Int}, | ||||||||||||||||||||
right_ixs_inner::Vector{Int}) | ||||||||||||||||||||
right_ixs = zeros(Int, out_len) | ||||||||||||||||||||
@inbounds for (li, ri) in zip(left_ixs_inner, right_ixs_inner) | ||||||||||||||||||||
if right_ixs[li] > 0 | ||||||||||||||||||||
throw(ArgumentError("duplicate rows found in right table")) | ||||||||||||||||||||
end | ||||||||||||||||||||
right_ixs[li] = ri | ||||||||||||||||||||
end | ||||||||||||||||||||
return right_ixs | ||||||||||||||||||||
end | ||||||||||||||||||||
|
||||||||||||||||||||
function compose_joined_rcol!(rcol::AbstractVector, | ||||||||||||||||||||
rcol_joined::AbstractVector, | ||||||||||||||||||||
right_ixs::Vector{Int}) | ||||||||||||||||||||
bkamins marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||
@assert length(rcol_joined) == length(right_ixs) | ||||||||||||||||||||
@inbounds for (i, idx) in enumerate(right_ixs) | ||||||||||||||||||||
if idx > 0 | ||||||||||||||||||||
rcol_joined[i] = rcol[idx] | ||||||||||||||||||||
end | ||||||||||||||||||||
end | ||||||||||||||||||||
return rcol_joined | ||||||||||||||||||||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use a more general name like "inplace"? We may want to implement
rightjoin!
at some point.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only was afraid that
rightjoin!
would be a bit confusing for the users (note the row order ), but potentially we can add it in the future (the algorithm will be the same). The point is thatrightjoin
has a more complex logic regarding column order and creation, see:and a comment from a docstring:
Also I do not think many people will want
rightjoin!
.