sort_index behavior differs for the same DataFrame? #9212

tlmaloney · 2015-01-08T09:28:20Z

I feel like I've encountered a bug. In the following scenario, the first sort_index call behaves as expected, but the second does not. Does someone know what the difference is here?

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '0.15.2'

In [3]: tuples = [(' foo', 'bar'), ('foo', 'bar'), (' foo ()', 'bar')]

In [4]: cols = pd.MultiIndex.from_tuples(tuples)

In [5]: df = pd.DataFrame(index=cols, data={'baz': [0, 1, 2]})

In [6]: df
Out[6]: 
             baz
 foo    bar    0
foo     bar    1
 foo () bar    2

In [7]: df.sort_index()
Out[7]: 
             baz
 foo    bar    0
 foo () bar    2
foo     bar    1

In [8]: tuples = [(' foo', 'bar'), ('foo', 'bar')]

In [9]: cols = pd.MultiIndex.from_tuples(tuples)

In [10]: df = pd.DataFrame(index=cols, data={'baz': [0, 1]})

In [11]: df
Out[11]: 
          baz
 foo bar    0
foo  bar    1

In [12]: df.ix[(' foo ()', 'bar'), 'baz'] = 2

In [13]: df
Out[13]: 
             baz
 foo    bar    0
foo     bar    1
 foo () bar    2

In [14]: df.sort_index()
Out[14]: 
             baz
 foo    bar    0
foo     bar    1
 foo () bar    2

The text was updated successfully, but these errors were encountered:

tlmaloney · 2015-01-08T09:37:48Z

FWIW:

~
tmaloney@aal-lpc-2-03$  locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

jreback · 2015-01-08T12:02:29Z

under the hood this uses something like this: http://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html

this is kind=quicksort by default, which does not preserve the original order for ties, while mergesort does. It is not possible to pass thru the option for kind in sort_index, you can specify it for sort though.

So i'll mark this an enhancement if you'd like to put forth a PR.

tlmaloney · 2015-01-08T13:36:22Z

I'll give a crack at it, thanks.

tlmaloney · 2015-01-08T14:40:31Z

I do see that kind is a variable that can be passed into sort_index, see docs here. Although it says that it does not apply to a MultiIndex.

I'm not sure kind is the reason for the issue. I was able to get the behavior I want, although through a roundabout way, still using quicksort:

In [1]: import pandas as pd

In [2]: tuples = [(' foo', 'bar'), ('foo', 'bar')]

In [3]: cols = pd.MultiIndex.from_tuples(tuples)

In [4]: df = pd.DataFrame(index=cols, data={'baz': [0, 1]})

In [5]: df.ix[(' foo ()', 'bar'), 'baz'] = 2

In [6]: df
Out[6]: 
             baz
 foo    bar    0
foo     bar    1
 foo () bar    2

In [7]: df.reset_index()
Out[7]: 
   level_0 level_1  baz
0      foo     bar    0
1      foo     bar    1
2   foo ()     bar    2

In [8]: df.reset_index().sort(columns=['level_0', 'level_1'])
Out[8]: 
   level_0 level_1  baz
0      foo     bar    0
2   foo ()     bar    2
1      foo     bar    1

In [9]: df.reset_index().sort(columns=['level_0', 'level_1']).set_index(['level_0', 'level_1'])
Out[9]: 
                 baz
level_0 level_1     
 foo    bar        0
 foo () bar        2
foo     bar        1

tlmaloney · 2015-01-09T01:17:07Z

This is also odd behavior. Adding a row doesn't simply append it to the bottom, it also switches the labels in the original two rows of the index.

In [1]: import pandas as pd

In [2]: tuples = [('foo', 'bar'), (' foo', 'bar')]

In [3]: cols = pd.MultiIndex.from_tuples(tuples)

In [4]: df = pd.DataFrame(index=cols, data={'baz': [0, 1]})

In [5]: df
Out[5]: 
          baz
foo  bar    0
 foo bar    1

In [6]: df.index
Out[6]: 
MultiIndex(levels=[[u' foo', u'foo'], [u'bar']],
           labels=[[1, 0], [0, 0]])

In [7]: df.ix[(' foo ()', 'bar'), 'baz'] = 2

In [8]: df
Out[8]: 
             baz
 foo    bar    1
foo     bar    0
 foo () bar    2

In [9]: df.index
Out[9]: 
MultiIndex(levels=[[u' foo', u'foo', u' foo ()'], [u'bar']],
           labels=[[0, 1, 2], [0, 0, 0]])

behzadnouri · 2015-01-10T15:43:11Z

This is a bug, and the problem is when the frame is modified, df.index.lexsort_depth becomes invalid, and that breaks .sort_index:

>>> pd.__version__
'0.15.2-68-ge2b014c'
>>> df = pd.DataFrame([['b', 'c', 1]],
...                   columns=['1st', '2nd', '3rd']).set_index(['1st', '2nd'])
>>> df
         3rd
1st 2nd     
b   c      1
>>> df.ix[('a', 'b'), '3rd'] = 0
>>> df
         3rd
1st 2nd     
b   c      1
a   b      0
>>> df.index.lexsort_depth  #  <<< this is the BUG
2
>>> df.sort_index()
         3rd
1st 2nd     
b   c      1
a   b      0

jreback · 2015-01-10T16:24:36Z

hmm, looks like sortorder needs to be invalidated. (however, then we de-facto have a new index at that point, so may need to copy the index in this case).

tlmaloney · 2015-01-11T23:39:46Z

It looks like this has been an issue at least going back to v0.13.1:

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '0.13.1'

In [3]: df = pd.DataFrame([['b', 'c', 1]], columns=['1st', '2nd', '3rd']).set_index(['1st', '2nd'])

In [4]: df.ix[('a', 'b'), '3rd'] = 0

In [5]: df.index.is_lexsorted()
Out[5]: True

In [6]: df
Out[6]: 
         3rd
1st 2nd     
b   c      1
a   b      0

In v0.12.0, setting values was not working with .ix[...] =, but using set_value to set the new row, sort_index does work as expected.

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '0.12.0'

In [3]: df = pd.DataFrame([['b', 'c', 1]], columns=['1st', '2nd', '3rd']).set_index(['1st', '2nd'])

In [4]: df.ix[('a', 'b'), '3rd'] = 0
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-4-b91d80d789ee> in <module>()
----> 1 df.ix[('a', 'b'), '3rd'] = 0

/home/tmaloney/vedev/pandas-test-02/lib/python2.7/site-packages/pandas-0.12.0-py2.7-linux-x86_64.egg/pandas/core/indexing.pyc in __setitem__(self, key, value)
     82                 raise IndexingError('only tuples of length <= %d supported',
     83                                     self.ndim)
---> 84             indexer = self._convert_tuple(key)
     85         else:
     86             indexer = self._convert_to_indexer(key)

/home/tmaloney/vedev/pandas-test-02/lib/python2.7/site-packages/pandas-0.12.0-py2.7-linux-x86_64.egg/pandas/core/indexing.pyc in _convert_tuple(self, key)
     94         keyidx = []
     95         for i, k in enumerate(key):
---> 96             idx = self._convert_to_indexer(k, axis=i)
     97             keyidx.append(idx)
     98         return tuple(keyidx)

/home/tmaloney/vedev/pandas-test-02/lib/python2.7/site-packages/pandas-0.12.0-py2.7-linux-x86_64.egg/pandas/core/indexing.pyc in _convert_to_indexer(self, obj, axis)
    608                 mask = check == -1
    609                 if mask.any():
--> 610                     raise KeyError('%s not in index' % objarr[mask])
    611 
    612                 return indexer

KeyError: "['a'] not in index"

In [5]: df.set_value(('a', 'b'), '3rd', 0)
Out[5]: 
     3rd
b c    1
a b    0

In [6]: df = df.set_value(('a', 'b'), '3rd', 0)

In [7]: df
Out[7]: 
     3rd
b c    1
a b    0

In [8]: df.sort_index()
Out[8]: 
     3rd
a b    0
b c    1

tlmaloney · 2015-01-12T01:01:25Z

Looks like GH4039 discussion is relevant. @jreback Should this issue be tagged as Bug instead of Enhancement?

jreback · 2015-01-12T01:06:11Z

The main issue I have with this is that you cannot simply resort the index after an append. It will be completely non-performant. So you can simply try to invalidate sortorder which will recompute the sort depth when it is called for.

tlmaloney · 2015-01-12T01:10:25Z

Agreed. How do you invalidate the sortorder? I don't know what that means.

jreback · 2015-01-12T01:18:31Z

sorry...misspoke, you need to invalid the cache on lexsort_depth
you can call mi._reset_cache()

tlmaloney · 2015-01-12T01:47:44Z

I'm not sure that works.

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '0.15.2'

In [3]: df = pd.DataFrame([['b', 'c', 1]], columns=['1st', '2nd', '3rd']).set_index(['1st', '2nd'])

In [4]: df.ix[('a', 'b'), '3rd'] = 0

In [5]: df.index.is_lexsorted()
Out[5]: True

In [6]: df.index._cache
Out[6]: 
{'_engine': <pandas.index.ObjectEngine at 0x351c3f8>,
 'is_unique': True,
 'lexsort_depth': 2}

In [7]: df.index._res
df.index._reset_cache     df.index._reset_identity  

In [7]: df.index._reset_cache()

In [8]: df.index.is_lexsorted()
Out[8]: True

In [9]: df
Out[9]: 
         3rd
1st 2nd     
b   c      1
a   b      0

In [10]: df.index._cache
Out[10]: {'lexsort_depth': 2}

In [11]: df.index._reset_cache()

In [12]: df.index._cache
Out[12]: {}

In [13]: df
Out[13]: 
         3rd
1st 2nd     
b   c      1
a   b      0

In [14]: df.sort_index()
Out[14]: 
         3rd
1st 2nd     
b   c      1
a   b      0

behzadnouri · 2015-01-16T17:37:53Z

so this does not depend on cache, but also occurs on a freshly generated index:

In [24]: mi = MultiIndex(levels=[['b', 'a'], ['b', 'a']],
   ....:                 labels=[[0, 1], [0, 1]])

In [25]: df = DataFrame([[0], [1]], index=mi)

In [26]: df
Out[26]:
     0
b b  0
a a  1

In [27]: df.sort_index()
Out[27]:
     0
b b  0
a a  1

lexsort_depth is only based on labels ( not levels ), so here the labels are in fact lexically sorted:

In [29]: df.index.labels
Out[29]: FrozenList([[0, 1], [0, 1]])

In [30]: df.index.lexsort_depth
Out[30]: 2

it is worth inspecting that if anywhere else in the code base breaks if labels are lexically sorted, but not the levels. (i.e. if they make the assumption that labels are always sorted)

sort_index also only looks at the labels; and since they are lexically sorted it stops there.

if other places in the code also make the assumption that the labels are sorted then this should be fixed in MultiIndex.__new__; otherwise sort_index should also check levels in addition to labels.

computationally, former path, is not very cheap, so it is worth confirming first that other places in the code depend on levels being sorted and break otherwise.

tlmaloney · 2015-01-18T16:46:12Z

@behzadnouri Thanks for your looking more into this.

@jreback

I have a couple of comments. (FWIW, I've been using pandas since 0.9.1 but haven't had a need to dig in until now since it has really just worked. I hope to one day make a contribution myself. My mental model may be out of date with the fast pace of development.)

This issue might have raised a couple of distinct issues, which may be worth breaking apart into separate issues.
I'm trying to get the correct mental model of the MultiIndex. From here on out I'm thinking in terms of a MultiIndex in the df.index sense and not the df.columns sense. Essentially a MultiIndex is made up of labels (the rows, in my mind). A MultiIndex has a number of levels (the columns, in my mind). So when I see something like this I get confused:

In [5]: mi = pd.MultiIndex(levels=[['a', 'b'], ['c', 'd'], ['e', 'f']], labels=[[0, 1], [0, 1], [0, 1]])

In [6]: mi
Out[6]: 
MultiIndex(levels=[[u'a', u'b'], [u'c', u'd'], [u'e', u'f']],
           labels=[[0, 1], [0, 1], [0, 1]])

In [7]: print mi

a  c  e
b  d  f

Because I think of ('a', 'c', 'e') and ('b', 'd', 'f') as the labels. I understand the above __repr__ (line 6) is an internal representation. Am I wrong to think there's two labels, and not three? Perhaps this is a naming bug on MultiIndex.labels.

jreback · 2015-01-18T20:23:13Z

@tlmaloney If you'd like to create a separate issue for a distince issue/bug, pls do so, keeping in mind that they should have reproducible examples. can always xref back to here if needed. Generally having 1 issue per 'thing' is a good idea.

Using another example with differnt label lenghts to clarify your mental model

In [16]: mi = pd.MultiIndex(levels=[['a', 'b'], ['c', 'd'], ['e', 'f']], labels=[[0, 1], [0, 1], [0, 1]])

In [18]: mi.values
Out[18]: array([('a', 'c', 'e'), ('b', 'd', 'f')], dtype=object)

In [20]: mi2 = pd.MultiIndex(levels=[['a', 'b'], ['c', 'd'], ['e', 'f']], labels=[[0, 1, 1], [0, 1, 1], [0, 0, 1]])

In [22]: mi2.values
Out[22]: array([('a', 'c', 'e'), ('b', 'd', 'e'), ('b', 'd', 'f')], dtype=object)

So you see that the labels define how long the combinations are, while the length of the labels/levels themselves are the number of levels in the MI. The labels are an indexer INTO the levels array. This is conceptually what a Categorical is (and is actually implemented like this, kind of an unrolled Categorical).

Note that this has nothing to do with df.index/df.columns, they BOTH could be MultiIndexes or Index. The result of mi2.values are the column labels (or index labels), each TUPLE is a label (though its shown in a sparsified way and not a tuple).

tlmaloney · 2015-01-18T22:35:27Z

@jreback That's really interesting, I now understand what's going on a lot better, thanks. There is some cognitive dissonance in me with these two definitions:

Index.labels as the indexer into the levels array
label as the key value in an Index

Do you also see how it could be a bit confusing? I think there is a naming bug, but since naming is hard and is unrelated to the original issue, if you agree I can create a separate issue and xref this one.

jreback · 2015-01-18T22:41:28Z

.labels only exists on MultiIndex. I agree these are somewhat confusing, but these are internal references (e.g. .levels/.labels).

We did change these for Categorical, e.g. .categories and .codes are the .levels and .labels. However for back-compat I don't think we are likely to change them for MultiIndex. (and even if we did, I think .levels is very natural, so I suppose labels->codes IS possible).

These are came original from R, fyi.

jreback · 2015-01-18T22:44:38Z

you can see somewhat of a related discussion in #3268

tlmaloney · 2015-01-18T23:33:16Z

The last comment by @behzadnouri gets at the heart of the problem I think. I'm not sure why the sorting methods look at .labels instead of .levels. Referencing .labels also seems to be an issue with sortlevel:

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '0.15.2-103-gfda5012'

In [3]: tuples = [('a', 'x'), ('c', 'x')]

In [4]: idx = pd.MultiIndex.from_tuples(tuples)

In [5]: df = pd.DataFrame(index=idx, data={'baz': [0, 1]})

In [6]: df
Out[6]: 
     baz
a x    0
c x    1

In [7]: df.index
Out[7]: 
MultiIndex(levels=[[u'a', u'c'], [u'x']],
           labels=[[0, 1], [0, 0]])

In [8]: df.ix[('b', 'x'), 'baz'] = 2

In [9]: df
Out[9]: 
     baz
a x    0
c x    1
b x    2

In [10]: df.index
Out[10]: 
MultiIndex(levels=[[u'a', u'c', u'b'], [u'x']],
           labels=[[0, 1, 2], [0, 0, 0]])

In [11]: df.sort_index()
Out[11]: 
     baz
a x    0
c x    1
b x    2

In [12]: df.ix[('a', 'y'), 'baz'] = 3

In [13]: df.index
Out[13]: 
MultiIndex(levels=[[u'a', u'c', u'b'], [u'x', u'y']],
           labels=[[0, 1, 2, 0], [0, 0, 0, 1]])

In [14]: df
Out[14]: 
     baz
a x    0
c x    1
b x    2
a y    3

In [15]: df.sort_index()
Out[15]: 
     baz
a x    0
  y    3
b x    2
c x    1

In [16]: df
Out[16]: 
     baz
a x    0
c x    1
b x    2
a y    3

In [17]: df.sortlevel(0)
Out[17]: 
     baz
a x    0
  y    3
c x    1
b x    2

tlmaloney · 2015-01-19T18:15:21Z

Similar discussion in #8017.

tlmaloney · 2015-01-19T18:54:38Z

@jreback @behzadnouri

The below highlights different behavior between the MultiIndex append method and setting with enlargement on a DataFrame with .ix[...] =. Compare idx3 and df.index after line 13. Do you think this is getting closer to the problem?

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '0.15.2-103-gfda5012'

In [3]: tuples1 = [('a', 'x'), ('c', 'x')]

In [4]: tuples2 = [('b', 'x')]

In [5]: idx1 = pd.MultiIndex.from_tuples(tuples1)

In [6]: idx2 = pd.MultiIndex.from_tuples(tuples2)

In [7]: idx3 = idx1.append(idx2)

In [8]: idx1
Out[8]: 
MultiIndex(levels=[[u'a', u'c'], [u'x']],
           labels=[[0, 1], [0, 0]])

In [9]: idx2
Out[9]: 
MultiIndex(levels=[[u'b'], [u'x']],
           labels=[[0], [0]])

In [10]: idx3
Out[10]: 
MultiIndex(levels=[[u'a', u'b', u'c'], [u'x']],
           labels=[[0, 2, 1], [0, 0, 0]])

In [11]: df = pd.DataFrame(index=idx1, data={'baz': [0, 1]})

In [12]: df
Out[12]: 
     baz
a x    0
c x    1

In [13]: df.ix[('b', 'x'), 'baz'] = 2

In [14]: df
Out[14]: 
     baz
a x    0
c x    1
b x    2

In [15]: df.index
Out[15]: 
MultiIndex(levels=[[u'a', u'c', u'b'], [u'x']],
           labels=[[0, 1, 2], [0, 0, 0]])

In [16]: idx3.values
Out[16]: array([('a', 'x'), ('c', 'x'), ('b', 'x')], dtype=object)

In [17]: idx3.sortlevel(0)
Out[17]: 
(MultiIndex(levels=[[u'a', u'b', u'c'], [u'x']],
            labels=[[0, 1, 2], [0, 0, 0]]), array([0, 2, 1]))

In [18]: df.index.sortlevel(0)
Out[18]: 
(MultiIndex(levels=[[u'a', u'c', u'b'], [u'x']],
            labels=[[0, 1, 2], [0, 0, 0]]), array([0, 1, 2]))

tlmaloney · 2015-01-19T22:01:45Z

Quick question: what does "associated factor" mean here?

8one6 · 2015-06-05T13:27:31Z

What's the status on this one now? Is anyone actively working on a fix of the underlying issue? And, separately, is there any workaround for when "you just need to get the darn dataframe sorted" while we wait for/work on a more permanent fix for sorting?

jreback · 2015-06-05T13:46:53Z

@8one6 well its an active issue with no pr, so no-one is working on it. you are welcome to. .sortlevel() works just fine.

8one6 · 2015-06-05T13:51:51Z

Got it. So in a pinch, I can just sortlevel on each level (starting from innermost) and that'll get the job done?

jreback · 2015-06-05T13:53:42Z

no just sortlevel, you don't need to do anything special. this is a quite degenerate case.

jreback · 2016-06-12T16:50:05Z

duplicate of #13431

jreback added Enhancement API Design and removed API Design labels Jan 8, 2015

jreback added this to the 0.16.0 milestone Jan 8, 2015

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

jreback modified the milestones: Next Major Release, 0.16.0 Mar 6, 2015

jreback mentioned this issue Jul 22, 2015

Confusion around sortedness with MultiIndex #10651

Closed

jreback added Bug MultiIndex and removed Enhancement labels Jul 22, 2015

zym1010 mentioned this issue Jun 12, 2016

BUG: different behaviors of sort_index() and sort_index(level=0) #13431

Closed

jreback added the Duplicate Report Duplicate issue or pull request label Jun 12, 2016

jreback closed this as completed Jun 12, 2016

jorisvandenbossche modified the milestones: No action, Next Major Release Jul 20, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sort_index behavior differs for the same DataFrame? #9212

sort_index behavior differs for the same DataFrame? #9212

tlmaloney commented Jan 8, 2015

tlmaloney commented Jan 8, 2015

jreback commented Jan 8, 2015

tlmaloney commented Jan 8, 2015

tlmaloney commented Jan 8, 2015

tlmaloney commented Jan 9, 2015

behzadnouri commented Jan 10, 2015

jreback commented Jan 10, 2015

tlmaloney commented Jan 11, 2015

tlmaloney commented Jan 12, 2015

jreback commented Jan 12, 2015

tlmaloney commented Jan 12, 2015

jreback commented Jan 12, 2015

tlmaloney commented Jan 12, 2015

behzadnouri commented Jan 16, 2015

tlmaloney commented Jan 18, 2015

jreback commented Jan 18, 2015

tlmaloney commented Jan 18, 2015

jreback commented Jan 18, 2015

jreback commented Jan 18, 2015

tlmaloney commented Jan 18, 2015

tlmaloney commented Jan 19, 2015

tlmaloney commented Jan 19, 2015

tlmaloney commented Jan 19, 2015

8one6 commented Jun 5, 2015

jreback commented Jun 5, 2015

8one6 commented Jun 5, 2015

jreback commented Jun 5, 2015

jreback commented Jun 12, 2016

sort_index behavior differs for the same DataFrame? #9212

sort_index behavior differs for the same DataFrame? #9212

Comments

tlmaloney commented Jan 8, 2015

tlmaloney commented Jan 8, 2015

jreback commented Jan 8, 2015

tlmaloney commented Jan 8, 2015

tlmaloney commented Jan 8, 2015

tlmaloney commented Jan 9, 2015

behzadnouri commented Jan 10, 2015

jreback commented Jan 10, 2015

tlmaloney commented Jan 11, 2015

tlmaloney commented Jan 12, 2015

jreback commented Jan 12, 2015

tlmaloney commented Jan 12, 2015

jreback commented Jan 12, 2015

tlmaloney commented Jan 12, 2015

behzadnouri commented Jan 16, 2015

tlmaloney commented Jan 18, 2015

jreback commented Jan 18, 2015

tlmaloney commented Jan 18, 2015

jreback commented Jan 18, 2015

jreback commented Jan 18, 2015

tlmaloney commented Jan 18, 2015

tlmaloney commented Jan 19, 2015

tlmaloney commented Jan 19, 2015

tlmaloney commented Jan 19, 2015

8one6 commented Jun 5, 2015

jreback commented Jun 5, 2015

8one6 commented Jun 5, 2015

jreback commented Jun 5, 2015

jreback commented Jun 12, 2016