Partial Selection on MultiIndex: The need for empty slice support & dict indexing #4036

dragoljub · 2013-06-26T00:22:06Z

related #4036, #4116
from SO with 0.14.0 multi-index slicers: http://stackoverflow.com/questions/24126542/pandas-multi-index-slices-for-level-names/24126676#24126676

Here's the example from there:

In [11]: midx = pd.MultiIndex.from_product([list(range(3)),['a','b','c'],pd.date_range('20130101',periods=3)],names=['numbers','letters','dates'])

In [12]: midx.names.index('letters')
Out[12]: 1

In [13]: midx.names.index('dates')
Out[13]: 2

Here's a complete example

In [18]: df = DataFrame(np.random.randn(len(midx),1),index=midx)

In [19]: df
Out[19]: 
                                   0
numbers letters dates               
0       a       2013-01-01  0.261092
                2013-01-02 -1.267770
                2013-01-03  0.008230
        b       2013-01-01 -1.515866
                2013-01-02  0.351942
                2013-01-03 -0.245463
        c       2013-01-01 -0.253103
                2013-01-02 -0.385411
                2013-01-03 -1.740821
1       a       2013-01-01 -0.108325
                2013-01-02 -0.212350
                2013-01-03  0.021097
        b       2013-01-01 -1.922214
                2013-01-02 -1.769003
                2013-01-03 -0.594216
        c       2013-01-01 -0.419775
                2013-01-02  1.511700
                2013-01-03  0.994332
2       a       2013-01-01 -0.020299
                2013-01-02 -0.749474
                2013-01-03 -1.478558
        b       2013-01-01 -1.357671
                2013-01-02  0.161185
                2013-01-03 -0.658246
        c       2013-01-01 -0.564796
                2013-01-02 -0.333106
                2013-01-03 -2.814611

This is your dict of level names -> slices

In [20]: slicers = { 'numbers' : slice(0,1), 'dates' : slice('20130102','20130103') }

This creates an indexer that is empty (selects everything)

In [21]: indexer = [ slice(None) ] * len(df.index.levels)

Add in your slicers

In [22]: for n, idx in slicers.items():
              indexer[df.index.names.index(n)] = idx

And select (this has to be a tuple, but was a list to start as we had to modify it)

In [23]: df.loc[tuple(indexer),:]
Out[23]: 
                                   0
numbers letters dates               
0       a       2013-01-02 -1.267770
                2013-01-03  0.008230
        b       2013-01-02  0.351942
                2013-01-03 -0.245463
        c       2013-01-02 -0.385411
                2013-01-03 -1.740821
1       a       2013-01-02 -0.212350
                2013-01-03  0.021097
        b       2013-01-02 -1.769003
                2013-01-03 -0.594216
        c       2013-01-02  1.511700
                2013-01-03  0.994332

I use hierarchical indices regularly with pandas DataFrames and Series objects. It is invaluable to be able to partially select subsets of rows based on a set of arbitrary index values, and retain the index information for subsequent groupby operations etc.

I am looking for an elegant way to pass an ordered tuple (with possibly empty slices) or an arbitrary dict of {index_level_name:value,...} pairs to select rows matching the passed index:value pairs. Note: I am aware that with Boolean indexing on data columns and nested np.logical_and() statements you can construct such a Boolean select index. I'm looking for an elegant solution using indexes & levels to avoid repeatedly using df.reset_index and building Boolean arrays. Also, df.xs() does not work in every situation (see below) and does not exist for Series with MultiIndex.

To explain this lets create a DataFrame with 5 index levels:

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: print pd.__version__
0.10.0

In [4]: # Generate Test DataFrame
   ...: NUM_ROWS = 100000
   ...: 

In [5]: NUM_COLS = 10

In [6]: col_names = ['A'+num for num in map(str,np.arange(NUM_COLS).tolist())]

In [7]: index_cols = col_names[:5]

In [8]: # Set DataFrame to have 5 level Hierarchical Index.
   ...: # The dtype does not matter try str or np.int64 same results.
   ...: # Sort the index!
   ...: df = pd.DataFrame(np.random.randint(5, size=(NUM_ROWS,NUM_COLS)), dtype=np.int64, columns=col_names)
   ...: 

In [9]: df = df.set_index(index_cols).sort_index()

In [10]: df
Out[10]: <class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (0, 0, 0, 0, 0) to (4, 4, 4, 4, 4)
Data columns:
A5    100000  non-null values
A6    100000  non-null values
A7    100000  non-null values
A8    100000  non-null values
A9    100000  non-null values
dtypes: int64(5)

In [11]: df.index.names
Out[11]: ['A0', 'A1', 'A2', 'A3', 'A4']

Now index on every level and we get back the rows we want :) I love that I get back the complete index too because it may be useful later.

In [12]: df.ix[(0,1,2,3,4)]
Out[12]:                 A5  A6  A7  A8  A9
A0 A1 A2 A3 A4                    
0  1  2  3  4    1   3   1   1   3
            4    4   3   4   2   4
            4    0   2   3   1   3
      ...
            4    1   1   3   4   3
            4    0   1   2   4   1

Now if we index on the first 4 levels we get back something different, a data frame with the first 4 index levels dropped. It would be nice to have the option to keep all index levels even though they are repetitive (like above).

In [13]: df.ix[(0,1,2,3)]
Out[13]: <class 'pandas.core.frame.DataFrame'>
Int64Index: 144 entries, 0 to 4
Data columns:
A5    144  non-null values
A6    144  non-null values
A7    144  non-null values
A8    144  non-null values
A9    144  non-null values
dtypes: int64(5)

Now comes the tricky part. What if I only want to index on the first and last 2 index levels, and want everything from the 3rd level? Empty slicing is not supported.

In [14]: df.ix[(0,1,:,3,4)]
  File "<ipython-input-14-7e44e59fa5b1>", line 1
    df.ix[(0,1,:,3,4)]
               ^
SyntaxError: invalid syntax

In [15]: df.ix[(0,1,slice(None),3,4)]
---------------------------------------------------------------------------
IndexingError                             Traceback (most recent call last)
<ipython-input-15-0e4517ae7fc5> in <module>()
----> 1 df.ix[(0,1,slice(None),3,4)]

C:\Python27\lib\site-packages\pandas\core\indexing.pyc in __getitem__(self, key)
     30                 pass
     31 
---> 32             return self._getitem_tuple(key)
     33         else:
     34             return self._getitem_axis(key, axis=0)

C:\Python27\lib\site-packages\pandas\core\indexing.pyc in _getitem_tuple(self, tup)
    212         for i, key in enumerate(tup):
    213             if i >= self.obj.ndim:
--> 214                 raise IndexingError('Too many indexers')
    215 
    216             if _is_null_slice(key):

IndexingError: Too many indexers

df.xs can somewhat help here but its useless for MultiIndex on a series. And it drops the indexed levels leaving you unsure to what fixed index levels you have drilled to. :(

In [16]: df.xs((0,2,3),level=df.index.names[::2])
Out[16]: <class 'pandas.core.frame.DataFrame'>
MultiIndex: 805 entries, (0, 0) to (4, 4)
Data columns:
A5    805  non-null values
A6    805  non-null values
A7    805  non-null values
A8    805  non-null values
A9    805  non-null values
dtypes: int64(5)

Interestingly df.xs() is not consistant, because you cannot explicitly index on every level giving it the list of all level names:

In [17]: df.xs((0,1,2,3,4), level=df.index.names)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-19-d0e373dfcd5f> in <module>()
----> 1 df.xs((0,1,2,3,4), level=df.index.names)

C:\Python27\lib\site-packages\pandas\core\frame.pyc in xs(self, key, axis, level, copy)
   2233         labels = self._get_axis(axis)
   2234         if level is not None:
-> 2235             loc, new_ax = labels.get_loc_level(key, level=level)
   2236 
   2237             if not copy and not isinstance(loc, slice):

C:\Python27\lib\site-packages\pandas\core\index.pyc in get_loc_level(self, key, level)
   2193 
   2194                 result = loc if result is None else result & loc
-> 2195             return result, _drop_levels(result, level)
   2196 
   2197         level = self._get_level_number(level)

C:\Python27\lib\site-packages\pandas\core\index.pyc in _drop_levels(indexer, levels)
   2177             levels = [self._get_level_number(i) for i in levels]
   2178             for i in sorted(levels, reverse=True):
-> 2179                 new_index = new_index.droplevel(i)
   2180             return new_index
   2181 

AttributeError: 'Int64Index' object has no attribute 'droplevel'

However df.xs without the level attribute on all index levels works as expected...

In [18]: df.xs((0,1,2,3,4))
Out[18]:                 A5  A6  A7  A8  A9
A0 A1 A2 A3 A4                    
0  1  2  3  4    1   3   1   1   3
            4    4   3   4   2   4
            4    0   2   3   1   3
           ...
            4    1   1   3   4   3
            4    0   1   2   4   1

Thoughts:
One (somewhat limiting) solution could be allowing df.ix[(0,1,3,:,4)] to take an empty slice for an index level and return the data frame indexed on only the the passed index levels that are known. Today this capability does not exist, although an ordered partial list of index levels works.

The next and more general approach could be to pass a dict of df.ix[{index_level:value}] pairs and return the rows where the specified index levels equal the passed values. Unspecified levels are not filtered down and we have the option to return all index levels.

The text was updated successfully, but these errors were encountered:

dragoljub · 2013-06-30T17:28:42Z

This is somewhat related to Issue #3057. Being able to select on arbitrary combinations of index levels and return arbitrary index levels.

hsharrison · 2014-02-13T23:43:59Z

@jreback

I did look through the new code and the docs so apologies if I missed this, but it would still be nice to be able to select by index names, as in @dragoljub 's suggestion df.ix[{index_level:value}].

Obviously this is a minor convenience compared to what you just merged, so thank you for that. But maybe this could be left open? Or should a separate issue be opened?

jreback · 2014-02-13T23:54:06Z

@hsharrison ahh...did even see that, so you want to pass a dict (rather than a nested tuple) bascically? (with keys of the level numbers or names)....

hmm...I don't think that would be that difficult....

can you post an example of using the partial slicing syntax (what I just merged) and your proposed
addition? so we can see side-by-side just what it looks like? thanks...

hsharrison · 2014-02-14T07:25:39Z

In [8]: levels = ['A', 'B', 'C']

In [9]: columns = ['n_foos', 'n_bars']

In [10]: index = MultiIndex.from_product([range(3) for _ in levels], names=levels)

In [12]: df = DataFrame(random_integers(0, 100, size=(len(index), len(columns))), index=index, columns=columns)

In [13]: df
Out[13]: 
       n_foos  n_bars
A B C                
0 0 0      35      91
    1      17      89
    2      20      80
  1 0      35      12
    1      14       0
    2      78      23
  2 0      61      93
    1      86      26
    2      78      76
1 0 0      25      93
    1      46      87
    2      14      21
  1 0      78      33
    1      59      54
    2      63       2
  2 0      40      99
    1      57      78
    2      99      22
2 0 0      48      78
    1      18      58
    2      12      87
  1 0      86      96
    1       9      47
    2       8      40
  2 0      93      32
    1      30      69
    2      74      15

[27 rows x 2 columns]

In [15]: df.loc[(1, slice(None), 2), :]
Out[15]: 
       n_foos  n_bars
A B C                
1 0 2      14      21
  1 2      63       2
  2 2      99      22

[3 rows x 2 columns]

And reference by name using a dict:

In [15]: df.loc[dict(A=1, C=2), :]
Out[15]: 
       n_foos  n_bars
A B C                
1 0 2      14      21
  1 2      63       2
  2 2      99      22

[3 rows x 2 columns]

Another possibility is to use keyword arguments in a function instead of a dict in a getitem. Not sure if xs is appropriate, but consider as an example df.xs(A=1, C=2).

(well, maybe not. The conciseness is nice but it can't refer to non-named levels)

SylvainCorlay · 2014-11-05T16:30:27Z

+1 on allowing empty slicing.

dragoljub · 2014-11-05T18:24:52Z

Hey cool to see this being revived! 👍

I also like how the curly brace dict notation looks. 😄 Althought the dict(A=1, C=2) seems more natural. Its interesting the parsing the dict constructor does to infer the string column name. However you will not be able to specify the index level with dict(0=3, 2=2), but you could do {0:2, 2:2} if you were so inclined. 😎

In [15]: df.loc[{'A':1, 'C':2}, :]
Out[15]: 
       n_foos  n_bars
A B C                
1 0 2      14      21
  1 2      63       2
  2 2      99      22

[3 rows x 2 columns]

shoyer · 2014-11-05T18:33:48Z

In xray (pandas for N-dim data) we have somewhat similar support for doing indexing with named axes (rather than levels) via the keyword arguments in a function pattern, e.g., ds.sel(time='2000-01-01', letter=slice('a', 'z')). This works pretty well as long as all the levels/dimensions are named (xray enforces this). I have also considered parsing dictionaries in loc as well, but haven't gotten around to implementing it yet.

jreback · 2014-11-06T00:45:14Z

so we can revisit this

someone want to post a short (but complete) usecase / example and proposed syntax (in a new issue)?

kianho · 2016-03-21T05:19:09Z

👍 on this feature, would find it very useful

toobaz · 2017-12-04T15:35:42Z

More related SO questions 1, 2

aurelije · 2020-10-13T09:52:39Z

So is it somehow possible to slice Series on part of multindex without loosing levels that I need for further computation? I have tried slice(None) and IndexSlice... The only thing that really helped was to box Series to DataFrame, somehow the same slicing in dataframe doesn't drop levels... Why do we have such inconsistency between Series and DataFrames?

toobaz · 2020-10-13T09:59:36Z

Why do we have such inconsistency between Series and DataFrames?

@aurelije can you provide a simple reproducible example of such inconsistency? Are you sure you did pass indexers for both axes when indexing the DataFrame?

aurelije · 2020-10-13T10:33:59Z

@toobaz famous Operation Research example:

plants_s = pd.Index(['seattle', 'san-diego'], name='plant')

markets_s = pd.Index(['new-york', 'chicago', 'topeka'], name='market')

unit_cost_from_plant_to_market = pd.Series(
...     [2.5, 1.7, 1.8, 2.5, 1.8, 1.4],
...     index=pd.MultiIndex.from_product([plants_s, markets_s]),
...     name='unit_cost'
... )

unit_cost_from_plant_to_market

plant      market  
seattle    new-york    2.5
           chicago     1.7
           topeka      1.8
san-diego  new-york    2.5
           chicago     1.8
           topeka      1.4
Name: unit_cost, dtype: float64

Now dropping one of the level:

unit_cost_from_plant_to_market.loc[slice(None), 'new-york'] # .loc[(slice(None), 'new-york')] gives the same result

plant
seattle      2.5
san-diego    2.5
Name: unit_cost, dtype: float64

unit_cost_from_plant_to_market.loc['seattle', slice(None)]

market
new-york    2.5
chicago     1.7
topeka      1.8
Name: unit_cost, dtype: float64

Same thing in Dataframe doesn't drop the level:

unit_cost_from_plant_to_market.to_frame().loc[('seattle', slice(None)), :]

                  unit_cost
plant   market             
seattle new-york        2.5
        chicago         1.7
        topeka          1.8

toobaz · 2020-10-13T11:10:41Z

@toobaz famous Operation Research example:

I think this is related to #12827 (comment) more than to this issue. Notice that the behavior you look for (not dropping the level) is what we (pandas) would like to get rid of, but the same result can be easily obtained by wrapping the label in a list:

In [13]: unit_cost_from_plant_to_market.loc[['seattle'], slice(None)]                          
Out[13]: 
plant    market  
seattle  new-york    2.5
         chicago     1.7
         topeka      1.8
Name: unit_cost, dtype: float64

aurelije · 2020-10-13T14:39:24Z

Thanks a lot @toobaz, I was not aware of that functionality

mroeschke · 2024-09-12T21:21:46Z

Looks like this feature request hasn't has much engagement over the years so closing. If there's renewed interest it would be best to have it in a a new issue

dragoljub mentioned this issue Aug 5, 2013

ENH: add expression evaluation functionality via eval #4164

Merged

64 tasks

ghost assigned jreback Dec 18, 2013

jreback mentioned this issue Dec 18, 2013

Allow level wildcard via slice(None) in df.ix[] with MultiIndex #2425

Closed

ghost mentioned this issue Jan 27, 2014

ENH: add per axis, per level indexing with tuples using loc #6134

Closed

jreback mentioned this issue Feb 8, 2014

ENH: per axis and per level indexing (orig GH6134) #6301

Merged

4 tasks

jreback closed this as completed in #6301 Feb 13, 2014

jreback reopened this Feb 13, 2014

jreback modified the milestones: 0.15.0, 0.14.0 Mar 11, 2014

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

jreback mentioned this issue Oct 9, 2015

Cannot subclass MultiIndex #11267

Closed

jreback mentioned this issue Oct 19, 2015

Request for some kind of named arguments loc #11373

Closed

Dr-Irv mentioned this issue Dec 23, 2015

Slicing with no keys found #10695

Closed

max-sixty mentioned this issue May 20, 2016

Idea: use df.index/df.columns names to automatically choose axis along which to broadcast #13243

Closed

makmanalp mentioned this issue May 24, 2016

Allowing the index to be referenced by name, like a column #8162

Closed

3 tasks

jreback mentioned this issue Mar 18, 2018

ENH: Allow list of keys in Dataframe.xs for each level #20397

Closed

mroeschke removed the Ideas Long-Term Enhancement Discussions label Apr 4, 2020

mroeschke added the Enhancement label Apr 4, 2020

jreback removed their assignment Dec 23, 2020

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

gwerbin mentioned this issue May 31, 2024

ENH: Support for semi-join on index (subsetting a multi-index dataframe with a subset of the multi-index) #58873

Open

3 tasks

mroeschke closed this as completed Sep 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partial Selection on MultiIndex: The need for empty slice support & dict indexing #4036

Partial Selection on MultiIndex: The need for empty slice support & dict indexing #4036

dragoljub commented Jun 26, 2013

dragoljub commented Jun 30, 2013

hsharrison commented Feb 13, 2014

jreback commented Feb 13, 2014

hsharrison commented Feb 14, 2014

SylvainCorlay commented Nov 5, 2014

dragoljub commented Nov 5, 2014

shoyer commented Nov 5, 2014

jreback commented Nov 6, 2014

kianho commented Mar 21, 2016

toobaz commented Dec 4, 2017

aurelije commented Oct 13, 2020

toobaz commented Oct 13, 2020

aurelije commented Oct 13, 2020 •

edited

Loading

toobaz commented Oct 13, 2020

aurelije commented Oct 13, 2020

mroeschke commented Sep 12, 2024

Partial Selection on MultiIndex: The need for empty slice support & dict indexing #4036

Partial Selection on MultiIndex: The need for empty slice support & dict indexing #4036

Comments

dragoljub commented Jun 26, 2013

dragoljub commented Jun 30, 2013

hsharrison commented Feb 13, 2014

jreback commented Feb 13, 2014

hsharrison commented Feb 14, 2014

SylvainCorlay commented Nov 5, 2014

dragoljub commented Nov 5, 2014

shoyer commented Nov 5, 2014

jreback commented Nov 6, 2014

kianho commented Mar 21, 2016

toobaz commented Dec 4, 2017

aurelije commented Oct 13, 2020

toobaz commented Oct 13, 2020

aurelije commented Oct 13, 2020 • edited Loading

toobaz commented Oct 13, 2020

aurelije commented Oct 13, 2020

mroeschke commented Sep 12, 2024

aurelije commented Oct 13, 2020 •

edited

Loading