Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partial Selection on MultiIndex: The need for empty slice support & dict indexing #4036

Closed
dragoljub opened this issue Jun 26, 2013 · 16 comments · Fixed by #6301
Closed

Partial Selection on MultiIndex: The need for empty slice support & dict indexing #4036

dragoljub opened this issue Jun 26, 2013 · 16 comments · Fixed by #6301
Labels
API Design Enhancement Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex

Comments

@dragoljub
Copy link

related #4036, #4116
from SO with 0.14.0 multi-index slicers: http://stackoverflow.com/questions/24126542/pandas-multi-index-slices-for-level-names/24126676#24126676

Here's the example from there:

In [11]: midx = pd.MultiIndex.from_product([list(range(3)),['a','b','c'],pd.date_range('20130101',periods=3)],names=['numbers','letters','dates'])

In [12]: midx.names.index('letters')
Out[12]: 1

In [13]: midx.names.index('dates')
Out[13]: 2

Here's a complete example

In [18]: df = DataFrame(np.random.randn(len(midx),1),index=midx)

In [19]: df
Out[19]: 
                                   0
numbers letters dates               
0       a       2013-01-01  0.261092
                2013-01-02 -1.267770
                2013-01-03  0.008230
        b       2013-01-01 -1.515866
                2013-01-02  0.351942
                2013-01-03 -0.245463
        c       2013-01-01 -0.253103
                2013-01-02 -0.385411
                2013-01-03 -1.740821
1       a       2013-01-01 -0.108325
                2013-01-02 -0.212350
                2013-01-03  0.021097
        b       2013-01-01 -1.922214
                2013-01-02 -1.769003
                2013-01-03 -0.594216
        c       2013-01-01 -0.419775
                2013-01-02  1.511700
                2013-01-03  0.994332
2       a       2013-01-01 -0.020299
                2013-01-02 -0.749474
                2013-01-03 -1.478558
        b       2013-01-01 -1.357671
                2013-01-02  0.161185
                2013-01-03 -0.658246
        c       2013-01-01 -0.564796
                2013-01-02 -0.333106
                2013-01-03 -2.814611

This is your dict of level names -> slices

In [20]: slicers = { 'numbers' : slice(0,1), 'dates' : slice('20130102','20130103') }

This creates an indexer that is empty (selects everything)

In [21]: indexer = [ slice(None) ] * len(df.index.levels)

Add in your slicers

In [22]: for n, idx in slicers.items():
              indexer[df.index.names.index(n)] = idx

And select (this has to be a tuple, but was a list to start as we had to modify it)

In [23]: df.loc[tuple(indexer),:]
Out[23]: 
                                   0
numbers letters dates               
0       a       2013-01-02 -1.267770
                2013-01-03  0.008230
        b       2013-01-02  0.351942
                2013-01-03 -0.245463
        c       2013-01-02 -0.385411
                2013-01-03 -1.740821
1       a       2013-01-02 -0.212350
                2013-01-03  0.021097
        b       2013-01-02 -1.769003
                2013-01-03 -0.594216
        c       2013-01-02  1.511700
                2013-01-03  0.994332

I use hierarchical indices regularly with pandas DataFrames and Series objects. It is invaluable to be able to partially select subsets of rows based on a set of arbitrary index values, and retain the index information for subsequent groupby operations etc.

I am looking for an elegant way to pass an ordered tuple (with possibly empty slices) or an arbitrary dict of {index_level_name:value,...} pairs to select rows matching the passed index:value pairs. Note: I am aware that with Boolean indexing on data columns and nested np.logical_and() statements you can construct such a Boolean select index. I'm looking for an elegant solution using indexes & levels to avoid repeatedly using df.reset_index and building Boolean arrays. Also, df.xs() does not work in every situation (see below) and does not exist for Series with MultiIndex.

To explain this lets create a DataFrame with 5 index levels:

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: print pd.__version__
0.10.0

In [4]: # Generate Test DataFrame
   ...: NUM_ROWS = 100000
   ...: 

In [5]: NUM_COLS = 10

In [6]: col_names = ['A'+num for num in map(str,np.arange(NUM_COLS).tolist())]

In [7]: index_cols = col_names[:5]

In [8]: # Set DataFrame to have 5 level Hierarchical Index.
   ...: # The dtype does not matter try str or np.int64 same results.
   ...: # Sort the index!
   ...: df = pd.DataFrame(np.random.randint(5, size=(NUM_ROWS,NUM_COLS)), dtype=np.int64, columns=col_names)
   ...: 

In [9]: df = df.set_index(index_cols).sort_index()

In [10]: df
Out[10]: <class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (0, 0, 0, 0, 0) to (4, 4, 4, 4, 4)
Data columns:
A5    100000  non-null values
A6    100000  non-null values
A7    100000  non-null values
A8    100000  non-null values
A9    100000  non-null values
dtypes: int64(5)

In [11]: df.index.names
Out[11]: ['A0', 'A1', 'A2', 'A3', 'A4']

Now index on every level and we get back the rows we want :) I love that I get back the complete index too because it may be useful later.

In [12]: df.ix[(0,1,2,3,4)]
Out[12]:                 A5  A6  A7  A8  A9
A0 A1 A2 A3 A4                    
0  1  2  3  4    1   3   1   1   3
            4    4   3   4   2   4
            4    0   2   3   1   3
      ...
            4    1   1   3   4   3
            4    0   1   2   4   1

Now if we index on the first 4 levels we get back something different, a data frame with the first 4 index levels dropped. It would be nice to have the option to keep all index levels even though they are repetitive (like above).

In [13]: df.ix[(0,1,2,3)]
Out[13]: <class 'pandas.core.frame.DataFrame'>
Int64Index: 144 entries, 0 to 4
Data columns:
A5    144  non-null values
A6    144  non-null values
A7    144  non-null values
A8    144  non-null values
A9    144  non-null values
dtypes: int64(5)

Now comes the tricky part. What if I only want to index on the first and last 2 index levels, and want everything from the 3rd level? Empty slicing is not supported.

In [14]: df.ix[(0,1,:,3,4)]
  File "<ipython-input-14-7e44e59fa5b1>", line 1
    df.ix[(0,1,:,3,4)]
               ^
SyntaxError: invalid syntax

In [15]: df.ix[(0,1,slice(None),3,4)]
---------------------------------------------------------------------------
IndexingError                             Traceback (most recent call last)
<ipython-input-15-0e4517ae7fc5> in <module>()
----> 1 df.ix[(0,1,slice(None),3,4)]

C:\Python27\lib\site-packages\pandas\core\indexing.pyc in __getitem__(self, key)
     30                 pass
     31 
---> 32             return self._getitem_tuple(key)
     33         else:
     34             return self._getitem_axis(key, axis=0)

C:\Python27\lib\site-packages\pandas\core\indexing.pyc in _getitem_tuple(self, tup)
    212         for i, key in enumerate(tup):
    213             if i >= self.obj.ndim:
--> 214                 raise IndexingError('Too many indexers')
    215 
    216             if _is_null_slice(key):

IndexingError: Too many indexers

df.xs can somewhat help here but its useless for MultiIndex on a series. And it drops the indexed levels leaving you unsure to what fixed index levels you have drilled to. :(

In [16]: df.xs((0,2,3),level=df.index.names[::2])
Out[16]: <class 'pandas.core.frame.DataFrame'>
MultiIndex: 805 entries, (0, 0) to (4, 4)
Data columns:
A5    805  non-null values
A6    805  non-null values
A7    805  non-null values
A8    805  non-null values
A9    805  non-null values
dtypes: int64(5)

Interestingly df.xs() is not consistant, because you cannot explicitly index on every level giving it the list of all level names:

In [17]: df.xs((0,1,2,3,4), level=df.index.names)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-19-d0e373dfcd5f> in <module>()
----> 1 df.xs((0,1,2,3,4), level=df.index.names)

C:\Python27\lib\site-packages\pandas\core\frame.pyc in xs(self, key, axis, level, copy)
   2233         labels = self._get_axis(axis)
   2234         if level is not None:
-> 2235             loc, new_ax = labels.get_loc_level(key, level=level)
   2236 
   2237             if not copy and not isinstance(loc, slice):

C:\Python27\lib\site-packages\pandas\core\index.pyc in get_loc_level(self, key, level)
   2193 
   2194                 result = loc if result is None else result & loc
-> 2195             return result, _drop_levels(result, level)
   2196 
   2197         level = self._get_level_number(level)

C:\Python27\lib\site-packages\pandas\core\index.pyc in _drop_levels(indexer, levels)
   2177             levels = [self._get_level_number(i) for i in levels]
   2178             for i in sorted(levels, reverse=True):
-> 2179                 new_index = new_index.droplevel(i)
   2180             return new_index
   2181 

AttributeError: 'Int64Index' object has no attribute 'droplevel'

However df.xs without the level attribute on all index levels works as expected...

In [18]: df.xs((0,1,2,3,4))
Out[18]:                 A5  A6  A7  A8  A9
A0 A1 A2 A3 A4                    
0  1  2  3  4    1   3   1   1   3
            4    4   3   4   2   4
            4    0   2   3   1   3
           ...
            4    1   1   3   4   3
            4    0   1   2   4   1

Thoughts:
One (somewhat limiting) solution could be allowing df.ix[(0,1,3,:,4)] to take an empty slice for an index level and return the data frame indexed on only the the passed index levels that are known. Today this capability does not exist, although an ordered partial list of index levels works.

The next and more general approach could be to pass a dict of df.ix[{index_level:value}] pairs and return the rows where the specified index levels equal the passed values. Unspecified levels are not filtered down and we have the option to return all index levels.

@dragoljub
Copy link
Author

This is somewhat related to Issue #3057. Being able to select on arbitrary combinations of index levels and return arbitrary index levels.

@hsharrison
Copy link

@jreback

I did look through the new code and the docs so apologies if I missed this, but it would still be nice to be able to select by index names, as in @dragoljub 's suggestion df.ix[{index_level:value}].

Obviously this is a minor convenience compared to what you just merged, so thank you for that. But maybe this could be left open? Or should a separate issue be opened?

@jreback
Copy link
Contributor

jreback commented Feb 13, 2014

@hsharrison ahh...did even see that, so you want to pass a dict (rather than a nested tuple) bascically? (with keys of the level numbers or names)....

hmm...I don't think that would be that difficult....

can you post an example of using the partial slicing syntax (what I just merged) and your proposed
addition? so we can see side-by-side just what it looks like? thanks...

@jreback jreback reopened this Feb 13, 2014
@hsharrison
Copy link

In [8]: levels = ['A', 'B', 'C']

In [9]: columns = ['n_foos', 'n_bars']

In [10]: index = MultiIndex.from_product([range(3) for _ in levels], names=levels)

In [12]: df = DataFrame(random_integers(0, 100, size=(len(index), len(columns))), index=index, columns=columns)

In [13]: df
Out[13]: 
       n_foos  n_bars
A B C                
0 0 0      35      91
    1      17      89
    2      20      80
  1 0      35      12
    1      14       0
    2      78      23
  2 0      61      93
    1      86      26
    2      78      76
1 0 0      25      93
    1      46      87
    2      14      21
  1 0      78      33
    1      59      54
    2      63       2
  2 0      40      99
    1      57      78
    2      99      22
2 0 0      48      78
    1      18      58
    2      12      87
  1 0      86      96
    1       9      47
    2       8      40
  2 0      93      32
    1      30      69
    2      74      15

[27 rows x 2 columns]

In [15]: df.loc[(1, slice(None), 2), :]
Out[15]: 
       n_foos  n_bars
A B C                
1 0 2      14      21
  1 2      63       2
  2 2      99      22

[3 rows x 2 columns]

And reference by name using a dict:

In [15]: df.loc[dict(A=1, C=2), :]
Out[15]: 
       n_foos  n_bars
A B C                
1 0 2      14      21
  1 2      63       2
  2 2      99      22

[3 rows x 2 columns]

Another possibility is to use keyword arguments in a function instead of a dict in a getitem. Not sure if xs is appropriate, but consider as an example df.xs(A=1, C=2).

(well, maybe not. The conciseness is nice but it can't refer to non-named levels)

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Mar 11, 2014
@SylvainCorlay
Copy link
Contributor

+1 on allowing empty slicing.

@dragoljub
Copy link
Author

Hey cool to see this being revived! 👍

I also like how the curly brace dict notation looks. 😄 Althought the dict(A=1, C=2) seems more natural. Its interesting the parsing the dict constructor does to infer the string column name. However you will not be able to specify the index level with dict(0=3, 2=2), but you could do {0:2, 2:2} if you were so inclined. 😎

In [15]: df.loc[{'A':1, 'C':2}, :]
Out[15]: 
       n_foos  n_bars
A B C                
1 0 2      14      21
  1 2      63       2
  2 2      99      22

[3 rows x 2 columns]

@shoyer
Copy link
Member

shoyer commented Nov 5, 2014

In xray (pandas for N-dim data) we have somewhat similar support for doing indexing with named axes (rather than levels) via the keyword arguments in a function pattern, e.g., ds.sel(time='2000-01-01', letter=slice('a', 'z')). This works pretty well as long as all the levels/dimensions are named (xray enforces this). I have also considered parsing dictionaries in loc as well, but haven't gotten around to implementing it yet.

@jreback
Copy link
Contributor

jreback commented Nov 6, 2014

so we can revisit this

someone want to post a short (but complete) usecase / example and proposed syntax (in a new issue)?

@kianho
Copy link

kianho commented Mar 21, 2016

👍 on this feature, would find it very useful

@toobaz
Copy link
Member

toobaz commented Dec 4, 2017

More related SO questions 1, 2

@mroeschke mroeschke removed the Ideas Long-Term Enhancement Discussions label Apr 4, 2020
@aurelije
Copy link

So is it somehow possible to slice Series on part of multindex without loosing levels that I need for further computation? I have tried slice(None) and IndexSlice... The only thing that really helped was to box Series to DataFrame, somehow the same slicing in dataframe doesn't drop levels... Why do we have such inconsistency between Series and DataFrames?

@toobaz
Copy link
Member

toobaz commented Oct 13, 2020

Why do we have such inconsistency between Series and DataFrames?

@aurelije can you provide a simple reproducible example of such inconsistency? Are you sure you did pass indexers for both axes when indexing the DataFrame?

@aurelije
Copy link

aurelije commented Oct 13, 2020

@toobaz famous Operation Research example:

plants_s = pd.Index(['seattle', 'san-diego'], name='plant')

markets_s = pd.Index(['new-york', 'chicago', 'topeka'], name='market')

unit_cost_from_plant_to_market = pd.Series(
...     [2.5, 1.7, 1.8, 2.5, 1.8, 1.4],
...     index=pd.MultiIndex.from_product([plants_s, markets_s]),
...     name='unit_cost'
... )

unit_cost_from_plant_to_market

plant      market  
seattle    new-york    2.5
           chicago     1.7
           topeka      1.8
san-diego  new-york    2.5
           chicago     1.8
           topeka      1.4
Name: unit_cost, dtype: float64

Now dropping one of the level:

unit_cost_from_plant_to_market.loc[slice(None), 'new-york'] # .loc[(slice(None), 'new-york')] gives the same result

plant
seattle      2.5
san-diego    2.5
Name: unit_cost, dtype: float64

unit_cost_from_plant_to_market.loc['seattle', slice(None)]

market
new-york    2.5
chicago     1.7
topeka      1.8
Name: unit_cost, dtype: float64

Same thing in Dataframe doesn't drop the level:

unit_cost_from_plant_to_market.to_frame().loc[('seattle', slice(None)), :]

                  unit_cost
plant   market             
seattle new-york        2.5
        chicago         1.7
        topeka          1.8

@toobaz
Copy link
Member

toobaz commented Oct 13, 2020

@toobaz famous Operation Research example:

I think this is related to #12827 (comment) more than to this issue. Notice that the behavior you look for (not dropping the level) is what we (pandas) would like to get rid of, but the same result can be easily obtained by wrapping the label in a list:

In [13]: unit_cost_from_plant_to_market.loc[['seattle'], slice(None)]                          
Out[13]: 
plant    market  
seattle  new-york    2.5
         chicago     1.7
         topeka      1.8
Name: unit_cost, dtype: float64

@aurelije
Copy link

Thanks a lot @toobaz, I was not aware of that functionality

@jreback jreback removed their assignment Dec 23, 2020
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@mroeschke
Copy link
Member

Looks like this feature request hasn't has much engagement over the years so closing. If there's renewed interest it would be best to have it in a a new issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Enhancement Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants