Handling of duplicate column names by join #1333

bkamins · 2017-12-29T13:47:10Z

Following the discussion in #1308 (comment) here is the satus what other packages do:

Pandas has lsuffix and rsuffix arguments
R base has suffixes argument
dplyr has suffix

I guess we should add something similar. Any comments?

The text was updated successfully, but these errors were encountered:

bkamins · 2017-12-29T22:23:39Z

As a side note base R and Pandas may produce duplicates by adding a suffix, whereas dplyr is careful enough to avoid them.

ararslan · 2017-12-29T22:48:16Z

How annoying would it be to simply throw an error for duplicate column names to force the user to handle them manually? We could also prefix duplicate column names with the name of the table they came from, e.g. joining A and B where both have a column c will result in a table with A_c and B_c or something like that.

bkamins · 2017-12-29T23:01:25Z

Now we have makeunique that allows or disallows duplicates. I guess it is annoying to disallow it as every other framework has this functionality.

Here is a snippet that could be considered to fix the names by suffixing if makeunique is set to true (it seems to be a standard to use suffix not a prefix):

function fixnames(left, right; lsuffix="_left", rsuffix="_right")
    lsuffix == rsuffix && throw(ArgumentError("left and right suffix must be different"))
    common = Set(intersect(left, right))
    for (x, suffix) in [(left, lsuffix), (right, rsuffix)]
        for i in eachindex(x)
            if x[i] in common
                x[i] = Symbol(string(x[i], suffix))
            end
        end
    end
    pool = Set{Symbol}()
    for (x, suffix) in [(left, lsuffix), (right, rsuffix)]
        for (i, s) in enumerate(x)
            while s in pool
                s = Symbol(string(s, suffix))
            end
            x[i] = s
            push!(pool, s)
        end
    end
end

nalimilan · 2017-12-30T18:57:30Z

@ararslan I think it's annoying to handle manually, because you have to find which columns are present in the two inputs, skipping columns you're joining on.

Regarding the API, why not use suffix as in dplyr, and make it a pair? That would be consistent with what we do with on (#1297).

bkamins · 2020-02-12T12:50:03Z

CC @oxinabox - this is directly relevant to what you have asked. Do you have a preference for API?

bkamins · 2020-02-12T12:50:23Z

Note that in some joins we allow more than 2 data frames.

bkamins · 2020-06-08T20:57:13Z

When working on this the following case should be tested against as it is potentially confusing I think:

julia> df1 = DataFrame(a=[1,2], b=["x","y"])
2×2 DataFrame
│ Row │ a     │ b      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ x      │
│ 2   │ 2     │ y      │

julia> df2 = DataFrame(b=[1,2], c=["x","y"])
2×2 DataFrame
│ Row │ b     │ c      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ x      │
│ 2   │ 2     │ y      │

julia> leftjoin(df1, df2, on=:a => :b)
2×3 DataFrame
│ Row │ a     │ b      │ c       │
│     │ Int64 │ String │ String? │
├─────┼───────┼────────┼─────────┤
│ 1   │ 1     │ x      │ x       │
│ 2   │ 2     │ y      │ y       │

Also a general question is if we rename columns with suffixes should on columns be also altered.

bkamins · 2020-06-09T09:43:10Z

I think we should go for suffix with a vector of suffixes following the rules:

on columns are not suffixed
all other columns are suffixed if suffix is passed
deduplication is made after adding of the suffix
technical columns, if generated are not suffixed

Any comments on this?

oxinabox · 2020-06-09T10:39:53Z

that sounds right to me.
Though a few points:

Having to give a vector, of 1 suffix per joined table.

This seems like it is going to get annoying in the cases where you are just joining 2 things.
I suggest that leftjoin and rightjoin also accept a single suffix, which they only apply to the right/left columns.
but also allow 2 names for that case.
and if you have more than 2 tables being joined, and you want to rename then you need to use the 1 rename per column including original

Suffix only:

The choice of suffix only seems off to me.
For a few reasons:
firstly, I have hundreds of lines of code right now that does prefix, and the fact that people i work with have kind of on there own just decided to use prefix makes me think that it is no less natural than suffix.

Further depending what your table names are prefix or suffix may be more natural.
Example,
lets say I have produce info for a number of resturants
Say: McDannys, BurgerDuke, Wendals
with columns: product, price, calories
then I will join on product e.g. burger, fries, cola.
and I would like my columns to end-up as McDannys_price, BurgerDuke_price, Wendal_price etc

I suggest instead we should accept a (vector of) function like rename does.
Arguably we should also take a list/dict of pairs like rename does
We could consider defining prefix(pre)=x->pre*x and suffix(post)=x*pos
then we could have prefix("McDannys_)

also means we can use identity for "leave it alone"
which probably matters for bigger vectors.

technical columns, if generated are not suffixed

I don't know what a technical column is.

bkamins · 2020-06-09T11:25:07Z

Technical columns: currently it is only column specified by indicator kwarg in some of joins.

Regarding suffix - I agree with your point. I am a bit afraid to allow such flexibility as rename does. It seems to me that going for renaming function like in unstack should be enough. OK?

oxinabox · 2020-06-09T11:41:10Z

Regarding suffix - I agree with your point. I am a bit afraid to allow such flexibility as rename does. It seems to me that going for renaming function like in unstack should be enough. OK?

Yes, i think symmetry with unstack is desirable, and indeed after unstacking one is likely to join and so it kind of makes sense that the same piece of information could be renamed either.

bkamins mentioned this issue Jan 15, 2019

DataFrames.jl roadmap #1678

Closed

31 tasks

bkamins added the non-breaking The proposed change is not breaking label Feb 12, 2020

bkamins added this to the 1.x milestone Apr 4, 2020

bkamins added the priority label Jun 8, 2020

oxinabox mentioned this issue Jun 9, 2020

Make unstack + rename + join nice invenia/Wrangling.jl#1

Open

bosonbaas mentioned this issue Jun 9, 2020

Added makeunique argument for Table -> DataFrame #2278

Closed

bkamins mentioned this issue Jul 6, 2020

implementation proposal of column renaming #2313

Merged

bkamins closed this as completed in #2313 Aug 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling of duplicate column names by join #1333

Handling of duplicate column names by join #1333

bkamins commented Dec 29, 2017

bkamins commented Dec 29, 2017

ararslan commented Dec 29, 2017

bkamins commented Dec 29, 2017

nalimilan commented Dec 30, 2017

bkamins commented Feb 12, 2020

bkamins commented Feb 12, 2020

bkamins commented Jun 8, 2020

bkamins commented Jun 9, 2020

oxinabox commented Jun 9, 2020

bkamins commented Jun 9, 2020

oxinabox commented Jun 9, 2020

Handling of duplicate column names by join #1333

Handling of duplicate column names by join #1333

Comments

bkamins commented Dec 29, 2017

bkamins commented Dec 29, 2017

ararslan commented Dec 29, 2017

bkamins commented Dec 29, 2017

nalimilan commented Dec 30, 2017

bkamins commented Feb 12, 2020

bkamins commented Feb 12, 2020

bkamins commented Jun 8, 2020

bkamins commented Jun 9, 2020

oxinabox commented Jun 9, 2020

Having to give a vector, of 1 suffix per joined table.

Suffix only:

bkamins commented Jun 9, 2020

oxinabox commented Jun 9, 2020