Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: select levels of a MultiIndex #10816

Open
jorisvandenbossche opened this issue Aug 13, 2015 · 8 comments
Open

API: select levels of a MultiIndex #10816

jorisvandenbossche opened this issue Aug 13, 2015 · 8 comments
Labels
API Design Enhancement MultiIndex Needs Discussion Requires discussion from core team before further action

Comments

@jorisvandenbossche
Copy link
Member

Say you have a multi-index:

In [34]: idx = pd.MultiIndex.from_product([['a', 'b', 'c'], [1, 2, 3], ['f', 'g'
]], names=['lev0', 'lev1', 'lev2'])

In [35]: df = pd.DataFrame(range(len(idx)), index=idx)

In [36]: df
Out[36]:
                 0
lev0 lev1 lev2
a    1    f      0
          g      1
     2    f      2
          g      3
     3    f      4
          g      5
b    1    f      6
          g      7
     2    f      8
          g      9
     3    f     10
          g     11
c    1    f     12
          g     13
     2    f     14
          g     15
     3    f     16
          g     17

and you want to select certain levels of the Index (like you select columns of a frame, I want to select levels of an index and get a subset of the index).

At the moment, some possibilities:

In [37]: pd.MultiIndex.from_arrays([df.index.get_level_values(0), df.index.get_level_values(1)])
Out[37]:
MultiIndex(levels=[[u'a', u'b', u'c'], [1, 2, 3]],
           labels=[[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2], [0, 0, 1, 1, 2, 2, 0, 0, 1, 1, 2, 2, 0, 0, 1, 1, 2, 2]],
           names=[u'lev0', u'lev1'])

In [38]: idx.droplevel(-1)    # if you know the ones to drop
Out[38]:
MultiIndex(levels=[[u'a', u'b', u'c'], [1, 2, 3]],
           labels=[[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2], [0, 0
, 1, 1, 2, 2, 0, 0, 1, 1, 2, 2, 0, 0, 1, 1, 2, 2]],
           names=[u'lev0', u'lev1'])

In [39]: df.reset_index().set_index(['lev0','lev1']).index
Out[39]:
MultiIndex(levels=[[u'a', u'b', u'c'], [1, 2, 3]],
           labels=[[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2], [0, 0
, 1, 1, 2, 2, 0, 0, 1, 1, 2, 2, 0, 0, 1, 1, 2, 2]],
           names=[u'lev0', u'lev1'])

In [40]: df.reset_index(-1).index       # if you know the ones to drop
Out[40]:
MultiIndex(levels=[[u'a', u'b', u'c'], [1, 2, 3]],
           labels=[[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2], [0, 0
, 1, 1, 2, 2, 0, 0, 1, 1, 2, 2, 0, 0, 1, 1, 2, 2]],
           names=[u'lev0', u'lev1'])

Am I missing an easy way to do this?

And if not, I think we should have a better way to do this.

Note: triggerd by this SO question: http://stackoverflow.com/questions/31991388/combinations-of-multiindex-levels-which-occur-in-a-dataframe (but had already encountered this multiple times)

@rockg
Copy link
Contributor

rockg commented Aug 14, 2015

I swear that there was an old issue/PR that addressed this. If I recall it was proposed do have syntax like df.index['lev0'] or perhaps even df['lev0'] both of which I think are much better than get_level_values which is too verbose. However, searches have come up empty. #10461 seems to discuss some variants.

@jorisvandenbossche
Copy link
Member Author

There is the proposal that the index, if it has a name, should also be accessible as a column (eg df['lev0']), but also returning a series then. See #8162, and although I think that would be very nice, this is still something else.

@nehalecky
Copy link
Contributor

+1

hsharrison added a commit to hsharrison/pandas that referenced this issue Feb 21, 2016
@jreback jreback added this to the Next Major Release milestone May 7, 2016
@jorisvandenbossche
Copy link
Member Author

Somebody a good idea for a possible API here?

Some ideas (selecting a single level here as example, but it should expand to list of names as well) :

  • df.index["level_name"] -> as index __getitem__ is already used to select values of the index, I don't think it would be a good design to also let this select levels of the index (and, it would conflict in case of integer level names)
  • df.index.levels["level_name"] -> this is already used to select the actual 'level' (so the unique values, or the categories using that terminology) and not for the full level values. So again, that it does something different with a string would not be good design (and it would conflict with integer level names)
  • df.index.get_levels("level_name") -> this I would like (and doesn't exist yet), but, it would not be consistent with what set_levels does (it would not be its opposite, what you would expect from the name), as set_levels sets actual 'levels' and not level values.
  • df.index.get_level_values(["level_name", ..]) -> expand the existing get_level_values to also accept a list of integers/names instead of only a scalar, and in that case returning a MultiIndex instead of Index. This is probably the easiest to add.

@jreback
Copy link
Contributor

jreback commented Apr 11, 2017

why would you do this, rather than #8162?

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Apr 11, 2017

In my mind, this is something different, as I said above (#10816 (comment)). But of course that is something to discuss in #8162, but in my mind accessing an index like a column, should also return it as a column, and here I want to select specific levels but still keep an index as the final result.

@jreback
Copy link
Contributor

jreback commented Apr 11, 2017

My question is why would you want to add moar API? what is the usecase?

df.index.get_level_values(["level_name", ..]) this seems reasonable actually (though a single level and a list of a single level would be the same?, or would first be a Series, 2nd DataFrame)?

@jorisvandenbossche
Copy link
Member Author

(though a single level and a list of a single level would be the same?, or would first be a Series, 2nd DataFrame)?

yes, that would be the same, since a MultiIndex with only one level just becomes a plain Index

I can't come up with a specific usecase right now, but I remember that I wanted something like this from time to time. The SO question at the top post also gives a usecase.
The specific trigger was that a colleague asked how to convert the values of a subset of the levels to tuples (like multi_index.values gives tuples, but then for a subset of the levels) (why that is needed is then another question).

@mroeschke mroeschke added Enhancement Needs Discussion Requires discussion from core team before further action labels Apr 18, 2021
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Enhancement MultiIndex Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

5 participants