BUG: New feature allowing merging on combination of columns and index levels drops levels of index #20452

Dr-Irv · 2018-03-22T15:22:21Z

Code Sample, a copy-pastable example if possible

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '0.23.0.dev0+657.g01882ba5b'

In [3]: df1 =  pd.DataFrame({'v1' : range(12)}, index=pd.MultiIndex.from_product([list('abc'),list('xy'),[1,2]], names=['abc','xy','num']))
   ...: df1
   ...:
Out[3]:
            v1
abc xy num
a   x  1     0
       2     1
    y  1     2
       2     3
b   x  1     4
       2     5
    y  1     6
       2     7
c   x  1     8
       2     9
    y  1    10
       2    11

In [4]: df2 = pd.DataFrame({'v2': [100*i for i in range(1,7)]}, index=pd.MultiIndex.from_product([list('abc'), list('xy')],names=['abc','xy']))

In [5]: df2
Out[5]:
         v2
abc xy
a   x   100
    y   200
b   x   300
    y   400
c   x   500
    y   600

In [6]: df1.merge(df2, on=['abc','xy'])  # 'num' disappears
Out[6]:
        v1   v2
abc xy
a   x    0  100
    x    1  100
    y    2  200
    y    3  200
b   x    4  300
    x    5  300
    y    6  400
    y    7  400
c   x    8  500
    x    9  500
    y   10  600
    y   11  600

In [7]: df1.reset_index().merge(df2, on=['abc','xy']) # This preserves 'num'
Out[7]:
   abc xy  num  v1   v2
0    a  x    1   0  100
1    a  x    2   1  100
2    a  y    1   2  200
3    a  y    2   3  200
4    b  x    1   4  300
5    b  x    2   5  300
6    b  y    1   6  400
7    b  y    2   7  400
8    c  x    1   8  500
9    c  x    2   9  500
10   c  y    1  10  600
11   c  y    2  11  600

In [8]: df1.merge(df2, on='xy')  # 'abc' and 'num' disappear
Out[8]:
    v1   v2
xy
x    0  100
x    0  300
x    0  500
x    1  100
x    1  300
x    1  500
x    4  100
x    4  300
x    4  500
x    5  100
x    5  300
x    5  500
x    8  100
x    8  300
x    8  500
x    9  100
x    9  300
x    9  500
y    2  200
y    2  400
y    2  600
y    3  200
y    3  400
y    3  600
y    6  200
y    6  400
y    6  600
y    7  200
y    7  400
y    7  600
y   10  200
y   10  400
y   10  600
y   11  200
y   11  400
y   11  600

Problem description

It seems that the new feature implemented in #17484 that allows merging on a combination of columns and index levels can drop index levels, which is really non-intuitive. In the first example, the index level named "num" gets dropped, while in the last example, both "abc" and "xy" are dropped.

If this is the desired behavior, then it needs to be carefully documented.

N.B. There is also an error in the docs of merging.rst that says this feature was introduced in v.0.22, but it will be introduced in v0.23

I'm guessing @jmmease will need to look at this.

Expected Output

In [6]: df1.merge(df2, on=['abc','xy'])
Out[6]:
            v1   v2
abc xy num
a   x  1     0  100
       2     1  100
    y  1     2  200
       2     3  200
b   x  1     4  300
       2     5  300
    y  1     6  400
       2     7  400
c   x  1     8  500
       2     9  500
    y  1    10  600
       2    11  600

In [8]: df1.merge(df2, on='xy')
Out[8]:
   abc_x  num  v1 abc_y   v2
xy
x      a    1   0     a  100
x      a    1   0     b  300
x      a    1   0     c  500
x      a    2   1     a  100
x      a    2   1     b  300
x      a    2   1     c  500
x      b    1   4     a  100
x      b    1   4     b  300
x      b    1   4     c  500
x      b    2   5     a  100
x      b    2   5     b  300
x      b    2   5     c  500
x      c    1   8     a  100
x      c    1   8     b  300
x      c    1   8     c  500
x      c    2   9     a  100
x      c    2   9     b  300
x      c    2   9     c  500
y      a    1   2     a  200
y      a    1   2     b  400
y      a    1   2     c  600
y      a    2   3     a  200
y      a    2   3     b  400
y      a    2   3     c  600
y      b    1   6     a  200
y      b    1   6     b  400
y      b    1   6     c  600
y      b    2   7     a  200
y      b    2   7     b  400
y      b    2   7     c  600
y      c    1  10     a  200
y      c    1  10     b  400
y      c    1  10     c  600
y      c    2  11     a  200
y      c    2  11     b  400
y      c    2  11     c  600

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.0.dev0+657.g01882ba5b
pytest: 3.4.0
pip: 9.0.1
setuptools: 38.5.1
Cython: 0.25.1
numpy: 1.14.1
scipy: 1.0.0
pyarrow: 0.8.0
xarray: None
IPython: 6.2.1
sphinx: 1.7.1
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2018.3
blosc: 1.5.1
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.2.0
openpyxl: 2.5.0
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.5
pymysql: 0.8.0
psycopg2: None
jinja2: 2.10
s3fs: 0.1.3
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jonmmease · 2018-03-22T23:17:58Z

This is the expected behavior.

Rationale

When index levels are not included as on params to merge, the merge operation drops all index levels (this is true of 0.23 and earlier). For example, if we reset abc into a column in both df1 and df2, and then merge on the common abc column, the xy and num index levels are dropped in the result.

In [1]: df1.reset_index('abc').merge(df2.reset_index('abc'), on='abc')
Out[1]: 
   abc  v1   v2
0    a   0  100
1    a   0  200
2    a   1  100
3    a   1  200
4    a   2  100
5    a   2  200
6    a   3  100
7    a   3  200
8    b   4  300
9    b   4  400
10   b   5  300
11   b   5  400
12   b   6  300
13   b   6  400
14   b   7  300
15   b   7  400
16   c   8  500
17   c   8  600
18   c   9  500
19   c   9  600
20   c  10  500
21   c  10  600
22   c  11  500
23   c  11  600

Now (in 0.23), when index levels are included as on params to merge, the merge operation preserves these index levels (and only these levels) in the output. For example:

In[2]: df1.merge(df2, on='abc')
Out[2]: 
     v1   v2
abc         
a     0  100
a     0  200
a     1  100
a     1  200
a     2  100
a     2  200
a     3  100
a     3  200
b     4  300
b     4  400
b     5  300
b     5  400
b     6  300
b     6  400
b     7  300
b     7  400
c     8  500
c     8  600
c     9  500
c     9  600
c    10  500
c    10  600
c    11  500
c    11  600

Preserving all index levels (even those not referenced as on params) would be a reasonable thing to do, but doing so consistently would require a change to the legacy behavior of merge.

Documentation

In terms of documentation, there is a note in the sphinx docs for Merging on a combination of columns and index levels that says:

Note: When DataFrames are merged on a string that matches an index level in both frames, the index level is preserved as an index level in the resulting DataFrame.

I think this is accurate, but it could be more explicit regarding the fate of the remaining index levels.

Thanks for pointing out the errant New in version version in the documentation! I guess this one was missed when 0.22 turned into 0.23.

Care to contribute some changes to the documentation that would have helped clarify the expected behavior for you?

cc @TomAugspurger @jorisvandenbossche @jreback

jreback · 2018-03-23T10:25:49Z

I agree here with @jmmease here. merge is an SQL style merge that happens to allow level merging. What to do with a non-merged level is to drop it. I suppose you could add it as an additional level in the result, but this would not be normally very useful / intuitive. We might consider warning if levels are dropped though.

Dr-Irv · 2018-03-23T14:32:42Z

@jreback I think this is about whether levels of the row axis MultiIndex are thought of as columns or not. To me, I think of a MultiIndex as just another set of columns that is used as a primary key. Using reset_index or set_index, I can make levels into columns and vice-versa. And in SQL, if I have a set of columns, and do a merge, all of the columns are preserved. My view relates to the ideas presented in #8162.

So would we consider an option to merge that preserves the index columns not directly involved in the join?

What @jmmease is saying is that when you do a merge with a MultiIndex, levels not included in the merge keys are dropped. I'll attempt to make this clear in the docs.

Dr-Irv · 2018-03-23T17:25:04Z

So with 0.23 development, here is an interesting behavior for join.

In [7]: df1.join(df2, on=['abc','xy'], how='inner')
Out[7]:
         v1   v2
    num
a x 1     0  100
    2     1  100
  y 1     2  200
    2     3  200
b x 1     4  300
    2     5  300
  y 1     6  400
    2     7  400
c x 1     8  500
    2     9  500
  y 1    10  600
    2    11  600

In [8]: df1.join(df2, on='xy', how='inner')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
#### Traceback omitted

ValueError: len(left_on) must equal the number of levels in the index of "right"

In [9]: df2.join(df1, on=['abc','xy'], how='inner')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
### Traceback omitted

ValueError: len(left_on) must equal the number of levels in the index of "right"

So the first join (which does not work in v0.22) gives the "correct" result, except that the names of the index have been lost. Isn't that a bug?

The second join and third join both fail. The third one is just a reversal of which DataFrame is on the left versus right.

So explaining all of this behavior in the docs is a bit of a challenge.

Dr-Irv · 2018-03-23T21:56:49Z

I have a fix for the bug I reported in the previous comment about join dropping the names

jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode API Design labels Mar 23, 2018

Dr-Irv mentioned this issue Mar 23, 2018

Additional DOC and BUG fix related to merging with mix of columns and… #20475

Merged

4 tasks

jreback added this to the 0.24.0 milestone Dec 4, 2018

jreback closed this as completed in #20475 Dec 4, 2018

mroeschke mentioned this issue Jan 13, 2019

Join on multi-index drops the index names when specifying "on" and "how=outer" #22820

Closed

Dr-Irv mentioned this issue Aug 1, 2019

API: Meta-issue for making consistent API's to refer to column names and index names #27652

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: New feature allowing merging on combination of columns and index levels drops levels of index #20452

BUG: New feature allowing merging on combination of columns and index levels drops levels of index #20452

Dr-Irv commented Mar 22, 2018 •

edited

Loading

INSTALLED VERSIONS

jonmmease commented Mar 22, 2018

jreback commented Mar 23, 2018

Dr-Irv commented Mar 23, 2018 •

edited

Loading

Dr-Irv commented Mar 23, 2018

Dr-Irv commented Mar 23, 2018

BUG: New feature allowing merging on combination of columns and index levels drops levels of index #20452

BUG: New feature allowing merging on combination of columns and index levels drops levels of index #20452

Comments

Dr-Irv commented Mar 22, 2018 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jonmmease commented Mar 22, 2018

Rationale

Documentation

jreback commented Mar 23, 2018

Dr-Irv commented Mar 23, 2018 • edited Loading

Dr-Irv commented Mar 23, 2018

Dr-Irv commented Mar 23, 2018

Dr-Irv commented Mar 22, 2018 •

edited

Loading

Output of `pd.show_versions()`

Dr-Irv commented Mar 23, 2018 •

edited

Loading