Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implementation proposal of column renaming #2313

Merged
merged 12 commits into from
Aug 9, 2020

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Jul 6, 2020

Fixes #1333.

I have implemented the code of the logic. It would be good to validate it. Thank you for anyone agreeing to do this.

Now the key question is about the API. What keyword argument names we should provide and what forms of passed values should they accept (just note that we need to handle renaming of more than 2 data frames - that is why internally it will be a vector, but maybe other special syntaxes should be allowed in the simpler forms).

As usual CC: @nalimilan + @oxinabox + @pdeffebach 😄.

@bkamins bkamins added feature non-breaking The proposed change is not breaking labels Jul 6, 2020
@bkamins bkamins added this to the 1.0 milestone Jul 6, 2020
@oxinabox
Copy link
Contributor

oxinabox commented Jul 6, 2020

What is the advantage of accepting a string rather a helper function for prefixwith(x)= y-> Symbol(x, y) ?

If we are accepting strings in this way shall we also accept Symbol s?

@oxinabox
Copy link
Contributor

oxinabox commented Jul 6, 2020

Are leftjoin and rightjoin still pending?

@bkamins
Copy link
Member Author

bkamins commented Jul 6, 2020

@oxinabox - these are exactly the things I wanted to discuss 😄.

What is the advantage of accepting a string rather a helper function for prefixwith(x)= y-> Symbol(x, y) ?

It is just shorter to write "_left" han suffixwith("_left"). Also in this case we have to export prefixwith and suffixwith (but maybe this is not that bad?).

If we are accepting strings in this way shall we also accept Symbol s?

We could, but do you think anyone would want to specify the suffix as a Symbol?

Are leftjoin and rightjoin still pending?

They can be easily added, but they are not fully general (we need to handle joining more than two data frames anyway).
Maybe for two-argument joins we should have leftjoin and rightjoin and for more than two argument joins some other kwarg.

I felt that having one kwarg was cleanest to start with, but this is exactly the point I was not sure what would be best.

@oxinabox
Copy link
Contributor

oxinabox commented Jul 6, 2020

If we are accepting strings in this way shall we also accept Symbol s?
We could, but do you think anyone would want to specify the suffix as a Symbol?

I think so, because people have a lot of habit of working with Symbols when thinking about DataFrames.
And basically everywhere we accept strings we accept symbols.
So consistancy with that is good.

@nalimilan
Copy link
Member

Sounds good. I agree it's nice to allow passing a string to suffix for simplicity (like other implementations to). Then we can discuss providing renaming functions like suffix_with separately.

I would also allow passing a pair when there are only two inputs (like on), and support symbols for consistency as @oxinabox noted.

@bkamins
Copy link
Member Author

bkamins commented Jul 30, 2020

I have fully updated the PR. For now I decided not to support the cases of passing more than 2 data frames (we can do it later), as for now for 2 data frames we can use Pair interface only like with on, and later decide how to handle more than 2 data frames case (which is rare anyway I think and we need to decide how we extend on for this case also as currently it is not fully defined).

This PR is relatively important because it turned out that I need to rewrite the indicator kwarg logic to make it possible to provide column renaming functionality. In the end what this PR proposes is faster and cleaner. This in turn forced me to understand how different joins order rows, which is by far non obvious. I have added the descriptions of the rules applied in the docstrings (and I think these rules are relevant as they are contracts that we should check against when redesigning joins to make them faster - I have added tests that check if ordering follows the rules in the most important cases). Note that in particular rightjoin uses a by far non-obvious ordering IMO.

A review would be welcome (if you find my English unpolished - which is likely - just please suggest corrections directly). Thank you!

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documenting the row order is nice, but you noted somewhere that performance improvements could require changing it. Do you think that's likely? Should we mention that it might change in the future?

src/abstractdataframe/join.jl Outdated Show resolved Hide resolved
src/abstractdataframe/join.jl Outdated Show resolved Hide resolved
src/abstractdataframe/join.jl Outdated Show resolved Hide resolved
src/abstractdataframe/join.jl Outdated Show resolved Hide resolved
src/abstractdataframe/join.jl Show resolved Hide resolved
src/abstractdataframe/join.jl Show resolved Hide resolved
src/abstractdataframe/join.jl Outdated Show resolved Hide resolved
test/join.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented Aug 4, 2020

but you noted somewhere that performance improvements could require changing it. Do you think that's likely? Should we mention that it might change in the future?

This note was after this PR was last updated. I would say that it is almost sure we will change the order (especially that e.g. what rightjoin now does is not make sense from the user perspective). I will change everywhere the docstring to stating that row order is undefined.

bkamins and others added 2 commits August 4, 2020 14:26
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins
Copy link
Member Author

bkamins commented Aug 4, 2020

For now I defensively say that the row order is undefined everywhere, but it is likely that for some of the joins we will be able to guarantee something in the future. In particular note that the key things that might influence row order in the future are:

  • threading (if we split one of the data frames into chunks per thread then the row order might be affected unless we restore is in post-processing)
  • imbalanced row count of joined tables (this might call for using different join algoirthms that might produce different row orders)

Comment on lines 366 to 367
joined, left_indicator, right_indicator = compose_joined_table(joiner, kind,
update_row_maps!(joiner.dfl_on, joiner.dfr_on, group_rows(joiner.dfr_on),
true, false, true, false)...,
makeunique, left_rename, right_rename, nothing)
inner_row_maps..., makeunique, left_rename, right_rename, nothing)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And then how about doing this?

joined, left_indicator, right_indicator =
    compose_joined_table(joiner, kind, inner_row_maps...,
                         makeunique, left_rename, right_rename, nothing)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is better indeed. fixed

src/abstractdataframe/join.jl Outdated Show resolved Hide resolved
src/abstractdataframe/join.jl Outdated Show resolved Hide resolved
bkamins and others added 2 commits August 6, 2020 13:29
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins
Copy link
Member Author

bkamins commented Aug 7, 2020

resolved conflicts with "drop deprecations" PR.

@bkamins bkamins merged commit abf5111 into JuliaData:master Aug 9, 2020
@bkamins bkamins deleted the col_rename_join branch August 9, 2020 14:00
@bkamins
Copy link
Member Author

bkamins commented Aug 9, 2020

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature non-breaking The proposed change is not breaking
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handling of duplicate column names by join
3 participants