Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sort_index behavior differs for the same DataFrame? #9212

Closed
tlmaloney opened this issue Jan 8, 2015 · 28 comments
Closed

sort_index behavior differs for the same DataFrame? #9212

tlmaloney opened this issue Jan 8, 2015 · 28 comments
Labels

Comments

@tlmaloney
Copy link

I feel like I've encountered a bug. In the following scenario, the first sort_index call behaves as expected, but the second does not. Does someone know what the difference is here?

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '0.15.2'

In [3]: tuples = [(' foo', 'bar'), ('foo', 'bar'), (' foo ()', 'bar')]

In [4]: cols = pd.MultiIndex.from_tuples(tuples)

In [5]: df = pd.DataFrame(index=cols, data={'baz': [0, 1, 2]})

In [6]: df
Out[6]: 
             baz
 foo    bar    0
foo     bar    1
 foo () bar    2

In [7]: df.sort_index()
Out[7]: 
             baz
 foo    bar    0
 foo () bar    2
foo     bar    1

In [8]: tuples = [(' foo', 'bar'), ('foo', 'bar')]

In [9]: cols = pd.MultiIndex.from_tuples(tuples)

In [10]: df = pd.DataFrame(index=cols, data={'baz': [0, 1]})

In [11]: df
Out[11]: 
          baz
 foo bar    0
foo  bar    1

In [12]: df.ix[(' foo ()', 'bar'), 'baz'] = 2

In [13]: df
Out[13]: 
             baz
 foo    bar    0
foo     bar    1
 foo () bar    2

In [14]: df.sort_index()
Out[14]: 
             baz
 foo    bar    0
foo     bar    1
 foo () bar    2
@tlmaloney
Copy link
Author

FWIW:

~
tmaloney@aal-lpc-2-03$  locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

@jreback
Copy link
Contributor

jreback commented Jan 8, 2015

under the hood this uses something like this: http://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html

this is kind=quicksort by default, which does not preserve the original order for ties, while mergesort does. It is not possible to pass thru the option for kind in sort_index, you can specify it for sort though.

So i'll mark this an enhancement if you'd like to put forth a PR.

@jreback jreback added this to the 0.16.0 milestone Jan 8, 2015
@tlmaloney
Copy link
Author

I'll give a crack at it, thanks.

@tlmaloney
Copy link
Author

I do see that kind is a variable that can be passed into sort_index, see docs here. Although it says that it does not apply to a MultiIndex.

I'm not sure kind is the reason for the issue. I was able to get the behavior I want, although through a roundabout way, still using quicksort:

In [1]: import pandas as pd

In [2]: tuples = [(' foo', 'bar'), ('foo', 'bar')]

In [3]: cols = pd.MultiIndex.from_tuples(tuples)

In [4]: df = pd.DataFrame(index=cols, data={'baz': [0, 1]})

In [5]: df.ix[(' foo ()', 'bar'), 'baz'] = 2

In [6]: df
Out[6]: 
             baz
 foo    bar    0
foo     bar    1
 foo () bar    2

In [7]: df.reset_index()
Out[7]: 
   level_0 level_1  baz
0      foo     bar    0
1      foo     bar    1
2   foo ()     bar    2

In [8]: df.reset_index().sort(columns=['level_0', 'level_1'])
Out[8]: 
   level_0 level_1  baz
0      foo     bar    0
2   foo ()     bar    2
1      foo     bar    1

In [9]: df.reset_index().sort(columns=['level_0', 'level_1']).set_index(['level_0', 'level_1'])
Out[9]: 
                 baz
level_0 level_1     
 foo    bar        0
 foo () bar        2
foo     bar        1

@tlmaloney
Copy link
Author

This is also odd behavior. Adding a row doesn't simply append it to the bottom, it also switches the labels in the original two rows of the index.

In [1]: import pandas as pd

In [2]: tuples = [('foo', 'bar'), (' foo', 'bar')]

In [3]: cols = pd.MultiIndex.from_tuples(tuples)

In [4]: df = pd.DataFrame(index=cols, data={'baz': [0, 1]})

In [5]: df
Out[5]: 
          baz
foo  bar    0
 foo bar    1

In [6]: df.index
Out[6]: 
MultiIndex(levels=[[u' foo', u'foo'], [u'bar']],
           labels=[[1, 0], [0, 0]])

In [7]: df.ix[(' foo ()', 'bar'), 'baz'] = 2

In [8]: df
Out[8]: 
             baz
 foo    bar    1
foo     bar    0
 foo () bar    2

In [9]: df.index
Out[9]: 
MultiIndex(levels=[[u' foo', u'foo', u' foo ()'], [u'bar']],
           labels=[[0, 1, 2], [0, 0, 0]])

@behzadnouri
Copy link
Contributor

This is a bug, and the problem is when the frame is modified, df.index.lexsort_depth becomes invalid, and that breaks .sort_index:

>>> pd.__version__
'0.15.2-68-ge2b014c'
>>> df = pd.DataFrame([['b', 'c', 1]],
...                   columns=['1st', '2nd', '3rd']).set_index(['1st', '2nd'])
>>> df
         3rd
1st 2nd     
b   c      1
>>> df.ix[('a', 'b'), '3rd'] = 0
>>> df
         3rd
1st 2nd     
b   c      1
a   b      0
>>> df.index.lexsort_depth  #  <<< this is the BUG
2
>>> df.sort_index()
         3rd
1st 2nd     
b   c      1
a   b      0

@jreback
Copy link
Contributor

jreback commented Jan 10, 2015

hmm, looks like sortorder needs to be invalidated. (however, then we de-facto have a new index at that point, so may need to copy the index in this case).

@tlmaloney
Copy link
Author

It looks like this has been an issue at least going back to v0.13.1:

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '0.13.1'

In [3]: df = pd.DataFrame([['b', 'c', 1]], columns=['1st', '2nd', '3rd']).set_index(['1st', '2nd'])

In [4]: df.ix[('a', 'b'), '3rd'] = 0

In [5]: df.index.is_lexsorted()
Out[5]: True

In [6]: df
Out[6]: 
         3rd
1st 2nd     
b   c      1
a   b      0

In v0.12.0, setting values was not working with .ix[...] =, but using set_value to set the new row, sort_index does work as expected.

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '0.12.0'

In [3]: df = pd.DataFrame([['b', 'c', 1]], columns=['1st', '2nd', '3rd']).set_index(['1st', '2nd'])

In [4]: df.ix[('a', 'b'), '3rd'] = 0
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-4-b91d80d789ee> in <module>()
----> 1 df.ix[('a', 'b'), '3rd'] = 0

/home/tmaloney/vedev/pandas-test-02/lib/python2.7/site-packages/pandas-0.12.0-py2.7-linux-x86_64.egg/pandas/core/indexing.pyc in __setitem__(self, key, value)
     82                 raise IndexingError('only tuples of length <= %d supported',
     83                                     self.ndim)
---> 84             indexer = self._convert_tuple(key)
     85         else:
     86             indexer = self._convert_to_indexer(key)

/home/tmaloney/vedev/pandas-test-02/lib/python2.7/site-packages/pandas-0.12.0-py2.7-linux-x86_64.egg/pandas/core/indexing.pyc in _convert_tuple(self, key)
     94         keyidx = []
     95         for i, k in enumerate(key):
---> 96             idx = self._convert_to_indexer(k, axis=i)
     97             keyidx.append(idx)
     98         return tuple(keyidx)

/home/tmaloney/vedev/pandas-test-02/lib/python2.7/site-packages/pandas-0.12.0-py2.7-linux-x86_64.egg/pandas/core/indexing.pyc in _convert_to_indexer(self, obj, axis)
    608                 mask = check == -1
    609                 if mask.any():
--> 610                     raise KeyError('%s not in index' % objarr[mask])
    611 
    612                 return indexer

KeyError: "['a'] not in index"

In [5]: df.set_value(('a', 'b'), '3rd', 0)
Out[5]: 
     3rd
b c    1
a b    0

In [6]: df = df.set_value(('a', 'b'), '3rd', 0)

In [7]: df
Out[7]: 
     3rd
b c    1
a b    0

In [8]: df.sort_index()
Out[8]: 
     3rd
a b    0
b c    1

@tlmaloney
Copy link
Author

Looks like GH4039 discussion is relevant. @jreback Should this issue be tagged as Bug instead of Enhancement?

@jreback
Copy link
Contributor

jreback commented Jan 12, 2015

The main issue I have with this is that you cannot simply resort the index after an append. It will be completely non-performant. So you can simply try to invalidate sortorder which will recompute the sort depth when it is called for.

@tlmaloney
Copy link
Author

Agreed. How do you invalidate the sortorder? I don't know what that means.

@jreback
Copy link
Contributor

jreback commented Jan 12, 2015

sorry...misspoke, you need to invalid the cache on lexsort_depth
you can call mi._reset_cache()

@tlmaloney
Copy link
Author

I'm not sure that works.

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '0.15.2'

In [3]: df = pd.DataFrame([['b', 'c', 1]], columns=['1st', '2nd', '3rd']).set_index(['1st', '2nd'])

In [4]: df.ix[('a', 'b'), '3rd'] = 0

In [5]: df.index.is_lexsorted()
Out[5]: True

In [6]: df.index._cache
Out[6]: 
{'_engine': <pandas.index.ObjectEngine at 0x351c3f8>,
 'is_unique': True,
 'lexsort_depth': 2}

In [7]: df.index._res
df.index._reset_cache     df.index._reset_identity  

In [7]: df.index._reset_cache()

In [8]: df.index.is_lexsorted()
Out[8]: True

In [9]: df
Out[9]: 
         3rd
1st 2nd     
b   c      1
a   b      0

In [10]: df.index._cache
Out[10]: {'lexsort_depth': 2}

In [11]: df.index._reset_cache()

In [12]: df.index._cache
Out[12]: {}

In [13]: df
Out[13]: 
         3rd
1st 2nd     
b   c      1
a   b      0

In [14]: df.sort_index()
Out[14]: 
         3rd
1st 2nd     
b   c      1
a   b      0

@behzadnouri
Copy link
Contributor

so this does not depend on cache, but also occurs on a freshly generated index:

In [24]: mi = MultiIndex(levels=[['b', 'a'], ['b', 'a']],
   ....:                 labels=[[0, 1], [0, 1]])

In [25]: df = DataFrame([[0], [1]], index=mi)

In [26]: df
Out[26]:
     0
b b  0
a a  1

In [27]: df.sort_index()
Out[27]:
     0
b b  0
a a  1

lexsort_depth is only based on labels ( not levels ), so here the labels are in fact lexically sorted:

In [29]: df.index.labels
Out[29]: FrozenList([[0, 1], [0, 1]])

In [30]: df.index.lexsort_depth
Out[30]: 2

it is worth inspecting that if anywhere else in the code base breaks if labels are lexically sorted, but not the levels. (i.e. if they make the assumption that labels are always sorted)

sort_index also only looks at the labels; and since they are lexically sorted it stops there.

if other places in the code also make the assumption that the labels are sorted then this should be fixed in MultiIndex.__new__; otherwise sort_index should also check levels in addition to labels.

computationally, former path, is not very cheap, so it is worth confirming first that other places in the code depend on levels being sorted and break otherwise.

@tlmaloney
Copy link
Author

@behzadnouri Thanks for your looking more into this.

@jreback

I have a couple of comments. (FWIW, I've been using pandas since 0.9.1 but haven't had a need to dig in until now since it has really just worked. I hope to one day make a contribution myself. My mental model may be out of date with the fast pace of development.)

  1. This issue might have raised a couple of distinct issues, which may be worth breaking apart into separate issues.
  2. I'm trying to get the correct mental model of the MultiIndex. From here on out I'm thinking in terms of a MultiIndex in the df.index sense and not the df.columns sense. Essentially a MultiIndex is made up of labels (the rows, in my mind). A MultiIndex has a number of levels (the columns, in my mind). So when I see something like this I get confused:
In [5]: mi = pd.MultiIndex(levels=[['a', 'b'], ['c', 'd'], ['e', 'f']], labels=[[0, 1], [0, 1], [0, 1]])

In [6]: mi
Out[6]: 
MultiIndex(levels=[[u'a', u'b'], [u'c', u'd'], [u'e', u'f']],
           labels=[[0, 1], [0, 1], [0, 1]])

In [7]: print mi

a  c  e
b  d  f

Because I think of ('a', 'c', 'e') and ('b', 'd', 'f') as the labels. I understand the above __repr__ (line 6) is an internal representation. Am I wrong to think there's two labels, and not three? Perhaps this is a naming bug on MultiIndex.labels.

@jreback
Copy link
Contributor

jreback commented Jan 18, 2015

@tlmaloney If you'd like to create a separate issue for a distince issue/bug, pls do so, keeping in mind that they should have reproducible examples. can always xref back to here if needed. Generally having 1 issue per 'thing' is a good idea.

Using another example with differnt label lenghts to clarify your mental model

In [16]: mi = pd.MultiIndex(levels=[['a', 'b'], ['c', 'd'], ['e', 'f']], labels=[[0, 1], [0, 1], [0, 1]])

In [18]: mi.values
Out[18]: array([('a', 'c', 'e'), ('b', 'd', 'f')], dtype=object)

In [20]: mi2 = pd.MultiIndex(levels=[['a', 'b'], ['c', 'd'], ['e', 'f']], labels=[[0, 1, 1], [0, 1, 1], [0, 0, 1]])

In [22]: mi2.values
Out[22]: array([('a', 'c', 'e'), ('b', 'd', 'e'), ('b', 'd', 'f')], dtype=object)

So you see that the labels define how long the combinations are, while the length of the labels/levels themselves are the number of levels in the MI. The labels are an indexer INTO the levels array. This is conceptually what a Categorical is (and is actually implemented like this, kind of an unrolled Categorical).

Note that this has nothing to do with df.index/df.columns, they BOTH could be MultiIndexes or Index. The result of mi2.values are the column labels (or index labels), each TUPLE is a label (though its shown in a sparsified way and not a tuple).

@tlmaloney
Copy link
Author

@jreback That's really interesting, I now understand what's going on a lot better, thanks. There is some cognitive dissonance in me with these two definitions:

  1. Index.labels as the indexer into the levels array
  2. label as the key value in an Index

Do you also see how it could be a bit confusing? I think there is a naming bug, but since naming is hard and is unrelated to the original issue, if you agree I can create a separate issue and xref this one.

@jreback
Copy link
Contributor

jreback commented Jan 18, 2015

.labels only exists on MultiIndex. I agree these are somewhat confusing, but these are internal references (e.g. .levels/.labels).

We did change these for Categorical, e.g. .categories and .codes are the .levels and .labels. However for back-compat I don't think we are likely to change them for MultiIndex. (and even if we did, I think .levels is very natural, so I suppose labels->codes IS possible).

These are came original from R, fyi.

@jreback
Copy link
Contributor

jreback commented Jan 18, 2015

you can see somewhat of a related discussion in #3268

@tlmaloney
Copy link
Author

The last comment by @behzadnouri gets at the heart of the problem I think. I'm not sure why the sorting methods look at .labels instead of .levels. Referencing .labels also seems to be an issue with sortlevel:

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '0.15.2-103-gfda5012'

In [3]: tuples = [('a', 'x'), ('c', 'x')]

In [4]: idx = pd.MultiIndex.from_tuples(tuples)

In [5]: df = pd.DataFrame(index=idx, data={'baz': [0, 1]})

In [6]: df
Out[6]: 
     baz
a x    0
c x    1

In [7]: df.index
Out[7]: 
MultiIndex(levels=[[u'a', u'c'], [u'x']],
           labels=[[0, 1], [0, 0]])

In [8]: df.ix[('b', 'x'), 'baz'] = 2

In [9]: df
Out[9]: 
     baz
a x    0
c x    1
b x    2

In [10]: df.index
Out[10]: 
MultiIndex(levels=[[u'a', u'c', u'b'], [u'x']],
           labels=[[0, 1, 2], [0, 0, 0]])

In [11]: df.sort_index()
Out[11]: 
     baz
a x    0
c x    1
b x    2

In [12]: df.ix[('a', 'y'), 'baz'] = 3

In [13]: df.index
Out[13]: 
MultiIndex(levels=[[u'a', u'c', u'b'], [u'x', u'y']],
           labels=[[0, 1, 2, 0], [0, 0, 0, 1]])

In [14]: df
Out[14]: 
     baz
a x    0
c x    1
b x    2
a y    3

In [15]: df.sort_index()
Out[15]: 
     baz
a x    0
  y    3
b x    2
c x    1

In [16]: df
Out[16]: 
     baz
a x    0
c x    1
b x    2
a y    3

In [17]: df.sortlevel(0)
Out[17]: 
     baz
a x    0
  y    3
c x    1
b x    2

@tlmaloney
Copy link
Author

Similar discussion in #8017.

@tlmaloney
Copy link
Author

@jreback @behzadnouri

The below highlights different behavior between the MultiIndex append method and setting with enlargement on a DataFrame with .ix[...] =. Compare idx3 and df.index after line 13. Do you think this is getting closer to the problem?

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '0.15.2-103-gfda5012'

In [3]: tuples1 = [('a', 'x'), ('c', 'x')]

In [4]: tuples2 = [('b', 'x')]

In [5]: idx1 = pd.MultiIndex.from_tuples(tuples1)

In [6]: idx2 = pd.MultiIndex.from_tuples(tuples2)

In [7]: idx3 = idx1.append(idx2)

In [8]: idx1
Out[8]: 
MultiIndex(levels=[[u'a', u'c'], [u'x']],
           labels=[[0, 1], [0, 0]])

In [9]: idx2
Out[9]: 
MultiIndex(levels=[[u'b'], [u'x']],
           labels=[[0], [0]])

In [10]: idx3
Out[10]: 
MultiIndex(levels=[[u'a', u'b', u'c'], [u'x']],
           labels=[[0, 2, 1], [0, 0, 0]])

In [11]: df = pd.DataFrame(index=idx1, data={'baz': [0, 1]})

In [12]: df
Out[12]: 
     baz
a x    0
c x    1

In [13]: df.ix[('b', 'x'), 'baz'] = 2

In [14]: df
Out[14]: 
     baz
a x    0
c x    1
b x    2

In [15]: df.index
Out[15]: 
MultiIndex(levels=[[u'a', u'c', u'b'], [u'x']],
           labels=[[0, 1, 2], [0, 0, 0]])

In [16]: idx3.values
Out[16]: array([('a', 'x'), ('c', 'x'), ('b', 'x')], dtype=object)

In [17]: idx3.sortlevel(0)
Out[17]: 
(MultiIndex(levels=[[u'a', u'b', u'c'], [u'x']],
            labels=[[0, 1, 2], [0, 0, 0]]), array([0, 2, 1]))

In [18]: df.index.sortlevel(0)
Out[18]: 
(MultiIndex(levels=[[u'a', u'c', u'b'], [u'x']],
            labels=[[0, 1, 2], [0, 0, 0]]), array([0, 1, 2]))

@tlmaloney
Copy link
Author

Quick question: what does "associated factor" mean here?

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@jreback jreback modified the milestones: Next Major Release, 0.16.0 Mar 6, 2015
@8one6
Copy link

8one6 commented Jun 5, 2015

What's the status on this one now? Is anyone actively working on a fix of the underlying issue? And, separately, is there any workaround for when "you just need to get the darn dataframe sorted" while we wait for/work on a more permanent fix for sorting?

@jreback
Copy link
Contributor

jreback commented Jun 5, 2015

@8one6 well its an active issue with no pr, so no-one is working on it. you are welcome to. .sortlevel() works just fine.

@8one6
Copy link

8one6 commented Jun 5, 2015

Got it. So in a pinch, I can just sortlevel on each level (starting from innermost) and that'll get the job done?

@jreback
Copy link
Contributor

jreback commented Jun 5, 2015

no just sortlevel, you don't need to do anything special. this is a quite degenerate case.

@jreback
Copy link
Contributor

jreback commented Jun 12, 2016

duplicate of #13431

@jreback jreback closed this as completed Jun 12, 2016
@jorisvandenbossche jorisvandenbossche modified the milestones: No action, Next Major Release Jul 20, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants