Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BREAKING] add matchmissing kwarg to joins #2504

Merged
merged 10 commits into from
Nov 2, 2020

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Oct 29, 2020

Fixes #2499.

This is the core implementation and update of docstrings.

I will add tests later, as it first requires #2503 to be merged and rebased against.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Though I think we should do something about NaN (which dplyr also handles via the equivalent of matchmissing). Maybe just throw an error if there are NaNs for now and see whether people complain.

src/abstractdataframe/join.jl Outdated Show resolved Hide resolved
src/abstractdataframe/join.jl Outdated Show resolved Hide resolved
src/abstractdataframe/join.jl Outdated Show resolved Hide resolved
src/abstractdataframe/join.jl Outdated Show resolved Hide resolved
src/abstractdataframe/join.jl Outdated Show resolved Hide resolved
src/abstractdataframe/join.jl Outdated Show resolved Hide resolved
src/abstractdataframe/join.jl Outdated Show resolved Hide resolved
src/abstractdataframe/join.jl Outdated Show resolved Hide resolved
src/abstractdataframe/join.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented Oct 29, 2020

Maybe just throw an error if there are NaNs for now and see whether people complain.

OK. Doing joins on floats is in general not advisable though:

julia> outerjoin(DataFrame(x=0.0,a=1), DataFrame(x=-0.0,b=2), on=:x)
2×3 DataFrame
│ Row │ x       │ a       │ b       │
│     │ Float64 │ Int64?  │ Int64?  │
├─────┼─────────┼─────────┼─────────┤
│ 1   │ 0.0     │ 1       │ missing │
│ 2   │ -0.0    │ missing │ 2       │

(which means in particular that I am OK to error on NaN)

@nalimilan
Copy link
Member

Yeah -0.0 is also a tricky case. We could also throw an error if we find -0.0.

@bkamins
Copy link
Member Author

bkamins commented Oct 29, 2020

@StefanKarpinski - do you have an opinion how NaN and -0.0 should be handled in joins when they are present in columns you join on? (currently we accept them and use isequal to match - just like Dicts)

As you can see @nalimilan considers if we should not just error on them. In general I understand his point, but I am not 100% sure what would be best. Given your experiences with designing isequal in Base what do you think would be best?

@kescobo
Copy link
Contributor

kescobo commented Oct 30, 2020

Doing joins on floats is in general not advisable though

💯 If people are doing joins on floats, something has already gone seriously wrong. But kudos for trying to get it right. How are other == but not === things handled? Eg.

a = [1,2]
b = [1,2]
c = [1,3]

df1 = DataFrame(x = [a,b,c], y = rand(3))
df2 = DataFrame(x = [a], z = [rand()])
julia> leftjoin(df1,df2, on=:x)
3×3 DataFrame
│ Row │ x      │ y        │ z        │
│     │ Array… │ Float64  │ Float64? │
├─────┼────────┼──────────┼──────────┤
│ 1   │ [1, 2] │ 0.213345 │ 0.614537 │
│ 2   │ [1, 2] │ 0.70363  │ 0.614537 │
│ 3   │ [1, 3] │ 0.181217 │ missing  │

So for consistency, seems like -0.0 and 0.0 should be treated the same, no?

@kescobo
Copy link
Contributor

kescobo commented Oct 30, 2020

Oh, but the original question was about NaN - that one is weird since it's === but not ==... That I have no idea about

@tbeason
Copy link
Contributor

tbeason commented Oct 30, 2020

I think NaN could more or less follow whatever decision was made about missing, right? Why treat them differently with regards to matching?

I am indifferent about 0.0 and -0.0 being special cased.

@bkamins
Copy link
Member Author

bkamins commented Oct 31, 2020

I have thought about it and I think I would leave floating point columns as they are now. One normally should not join on such columns anyway, as it is error prone in general. Instead I propose to add a comment in the docstrings warning about doing joins on floating point columns.

EDIT: but if you strongly feel we should error on NaN always then I can leave it as is. Then I would only add a comment that -0.0 and 0.0 are considered not equal. The case when joining on NaN or -0.0 might make sense is when you want to add a column with a descriptive mapping of on columns (like e.g. NaN is mapped to "error" value in the column added via join).

@bkamins bkamins changed the title add matchmissing kwarg to joins [BREAKING] add matchmissing kwarg to joins Nov 1, 2020
@bkamins bkamins added the breaking The proposed change is breaking. label Nov 1, 2020
@bkamins bkamins added this to the 1.0 milestone Nov 1, 2020
@nalimilan
Copy link
Member

Since we can't start throwing an error without breaking, I'd rather throw an error with NaNs (current state of PR), and see later whether we should add an argument to disable it (or even disable it altogether). It would be safer to do that also for -0.0.

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins
Copy link
Member Author

bkamins commented Nov 1, 2020

OK - I will disallow -0.0 also then and throw an informative error message.

@bkamins
Copy link
Member Author

bkamins commented Nov 1, 2020

I have updated the implementation and added appropriate tests. I think that the solution I propose (to use CategoricalVector as a wrapper) is clean enough (ie. if someone really wants to make such a join there is a way to do it).

The PR should be ready for a review.

src/abstractdataframe/join.jl Outdated Show resolved Hide resolved
src/abstractdataframe/join.jl Outdated Show resolved Hide resolved
@test_throws ArgumentError antijoin(name_w_special, job, on=:ID)
end

for special in [missing, 0.0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.0 works even without matchmissing=:equal, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, but I wanted to avoid adding the second set of the same tests. I think it is OK, as :equal has no influence on 0.0 handling

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins bkamins merged commit a156ef6 into JuliaData:master Nov 2, 2020
@bkamins bkamins deleted the stricter_joins branch November 2, 2020 07:19
@bkamins
Copy link
Member Author

bkamins commented Nov 2, 2020

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking The proposed change is breaking.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Explicitly handling missingness in join columns
4 participants