implementation proposal of column renaming #2313

bkamins · 2020-07-06T17:57:24Z

I have implemented the code of the logic. It would be good to validate it. Thank you for anyone agreeing to do this.

Now the key question is about the API. What keyword argument names we should provide and what forms of passed values should they accept (just note that we need to handle renaming of more than 2 data frames - that is why internally it will be a vector, but maybe other special syntaxes should be allowed in the simpler forms).

As usual CC: @nalimilan + @oxinabox + @pdeffebach 😄.

oxinabox · 2020-07-06T18:02:22Z

What is the advantage of accepting a string rather a helper function for prefixwith(x)= y-> Symbol(x, y) ?

If we are accepting strings in this way shall we also accept Symbol s?

oxinabox · 2020-07-06T18:05:43Z

Are leftjoin and rightjoin still pending?

bkamins · 2020-07-06T20:34:00Z

@oxinabox - these are exactly the things I wanted to discuss 😄.

What is the advantage of accepting a string rather a helper function for prefixwith(x)= y-> Symbol(x, y) ?

It is just shorter to write "_left" han suffixwith("_left"). Also in this case we have to export prefixwith and suffixwith (but maybe this is not that bad?).

If we are accepting strings in this way shall we also accept Symbol s?

We could, but do you think anyone would want to specify the suffix as a Symbol?

Are leftjoin and rightjoin still pending?

They can be easily added, but they are not fully general (we need to handle joining more than two data frames anyway).
Maybe for two-argument joins we should have leftjoin and rightjoin and for more than two argument joins some other kwarg.

I felt that having one kwarg was cleanest to start with, but this is exactly the point I was not sure what would be best.

oxinabox · 2020-07-06T21:00:20Z

If we are accepting strings in this way shall we also accept Symbol s?
We could, but do you think anyone would want to specify the suffix as a Symbol?

I think so, because people have a lot of habit of working with Symbols when thinking about DataFrames.
And basically everywhere we accept strings we accept symbols.
So consistancy with that is good.

nalimilan · 2020-07-07T10:00:10Z

Sounds good. I agree it's nice to allow passing a string to suffix for simplicity (like other implementations to). Then we can discuss providing renaming functions like suffix_with separately.

I would also allow passing a pair when there are only two inputs (like on), and support symbols for consistency as @oxinabox noted.

bkamins · 2020-07-30T22:51:16Z

I have fully updated the PR. For now I decided not to support the cases of passing more than 2 data frames (we can do it later), as for now for 2 data frames we can use Pair interface only like with on, and later decide how to handle more than 2 data frames case (which is rare anyway I think and we need to decide how we extend on for this case also as currently it is not fully defined).

This PR is relatively important because it turned out that I need to rewrite the indicator kwarg logic to make it possible to provide column renaming functionality. In the end what this PR proposes is faster and cleaner. This in turn forced me to understand how different joins order rows, which is by far non obvious. I have added the descriptions of the rules applied in the docstrings (and I think these rules are relevant as they are contracts that we should check against when redesigning joins to make them faster - I have added tests that check if ordering follows the rules in the most important cases). Note that in particular rightjoin uses a by far non-obvious ordering IMO.

A review would be welcome (if you find my English unpolished - which is likely - just please suggest corrections directly). Thank you!

nalimilan

Documenting the row order is nice, but you noted somewhere that performance improvements could require changing it. Do you think that's likely? Should we mention that it might change in the future?

src/abstractdataframe/join.jl

test/join.jl

bkamins · 2020-08-04T12:23:44Z

but you noted somewhere that performance improvements could require changing it. Do you think that's likely? Should we mention that it might change in the future?

This note was after this PR was last updated. I would say that it is almost sure we will change the order (especially that e.g. what rightjoin now does is not make sense from the user perspective). I will change everywhere the docstring to stating that row order is undefined.

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins · 2020-08-04T12:33:30Z

For now I defensively say that the row order is undefined everywhere, but it is likely that for some of the joins we will be able to guarantee something in the future. In particular note that the key things that might influence row order in the future are:

threading (if we split one of the data frames into chunks per thread then the row order might be affected unless we restore is in post-processing)
imbalanced row count of joined tables (this might call for using different join algoirthms that might produce different row orders)

nalimilan · 2020-08-06T10:15:01Z

src/abstractdataframe/join.jl

        joined, left_indicator, right_indicator = compose_joined_table(joiner, kind,
-            update_row_maps!(joiner.dfl_on, joiner.dfr_on, group_rows(joiner.dfr_on),
-                             true, false, true, false)...,
-            makeunique, left_rename, right_rename, nothing)
+            inner_row_maps..., makeunique, left_rename, right_rename, nothing)


And then how about doing this?

joined, left_indicator, right_indicator = compose_joined_table(joiner, kind, inner_row_maps..., makeunique, left_rename, right_rename, nothing)

it is better indeed. fixed

src/abstractdataframe/join.jl

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

src/abstractdataframe/join.jl

bkamins · 2020-08-07T17:37:14Z

resolved conflicts with "drop deprecations" PR.

bkamins · 2020-08-09T14:00:20Z

Thank you!

implementation proposal of column renaming

186affc

bkamins added feature non-breaking The proposed change is not breaking labels Jul 6, 2020

bkamins added this to the 1.0 milestone Jul 6, 2020

bkamins added 3 commits July 30, 2020 16:44

finalize design of rename

09603f4

improve indicator

e8c3a0c

improve tests and documentation

598efbf

nalimilan reviewed Aug 4, 2020

View reviewed changes

bkamins and others added 2 commits August 4, 2020 14:26

Apply suggestions from code review

cb708cf

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

update docstrings

f2a3234

bkamins added 2 commits August 4, 2020 14:50

clean up code and add more tests

48dd8cb

one more docstring fix

ffb375d

nalimilan reviewed Aug 6, 2020

View reviewed changes

bkamins and others added 2 commits August 6, 2020 13:29

Apply suggestions from code review

83c10f5

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

changes after code review

76e6944

nalimilan reviewed Aug 6, 2020

View reviewed changes

src/abstractdataframe/join.jl Outdated Show resolved Hide resolved

fix .;

4f8690e

nalimilan approved these changes Aug 6, 2020

View reviewed changes

Merge branch 'master' into col_rename_join

61b2d74

bkamins merged commit abf5111 into JuliaData:master Aug 9, 2020

bkamins deleted the col_rename_join branch August 9, 2020 14:00

nalimilan mentioned this pull request Aug 30, 2020

add a kwarg to reuse the input column names in cols => fun in select/transform/combine #2396

Closed

JuliaRegistrator mentioned this pull request Nov 15, 2020

New version: DataFrames v0.22.0 JuliaRegistries/General#24650

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implementation proposal of column renaming #2313

implementation proposal of column renaming #2313

bkamins commented Jul 6, 2020

oxinabox commented Jul 6, 2020 •

edited

Loading

oxinabox commented Jul 6, 2020

bkamins commented Jul 6, 2020

oxinabox commented Jul 6, 2020

nalimilan commented Jul 7, 2020

bkamins commented Jul 30, 2020

nalimilan left a comment

bkamins commented Aug 4, 2020

bkamins commented Aug 4, 2020 •

edited

Loading

nalimilan Aug 6, 2020

bkamins Aug 6, 2020

bkamins commented Aug 7, 2020

bkamins commented Aug 9, 2020

implementation proposal of column renaming #2313

implementation proposal of column renaming #2313

Conversation

bkamins commented Jul 6, 2020

oxinabox commented Jul 6, 2020 • edited Loading

oxinabox commented Jul 6, 2020

bkamins commented Jul 6, 2020

oxinabox commented Jul 6, 2020

nalimilan commented Jul 7, 2020

bkamins commented Jul 30, 2020

nalimilan left a comment

Choose a reason for hiding this comment

bkamins commented Aug 4, 2020

bkamins commented Aug 4, 2020 • edited Loading

nalimilan Aug 6, 2020

Choose a reason for hiding this comment

bkamins Aug 6, 2020

Choose a reason for hiding this comment

bkamins commented Aug 7, 2020

bkamins commented Aug 9, 2020

oxinabox commented Jul 6, 2020 •

edited

Loading

bkamins commented Aug 4, 2020 •

edited

Loading