Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Deprecating support of incomplete indexing on MultiIndexes #10574

Closed
tgarc opened this issue Jul 15, 2015 · 4 comments
Closed

Proposal: Deprecating support of incomplete indexing on MultiIndexes #10574

tgarc opened this issue Jul 15, 2015 · 4 comments

Comments

@tgarc
Copy link

tgarc commented Jul 15, 2015

First, here's the example DataFrame

                    0  1  2  3  4  5  6  7
first second third                        
bar   one    three  4  9  7  8  5  0  7  8
             four   6  8  1  5  9  9  1  7
      two    three  7  5  2  6  7  8  5  9
             four   0  8  8  5  3  5  8  3
baz   one    three  6  0  8  0  0  9  8  8
             four   9  3  2  0  2  7  4  9
      two    three  0  3  7  7  4  3  7  0
             four   6  3  2  8  3  9  7  8
foo   one    three  6  7  3  7  3  0  3  6
             four   5  8  0  8  1  5  1  5
      two    three  2  0  8  2  8  1  8  3
             four   9  0  2  7  0  6  8  3
qux   one    three  1  2  5  5  0  7  0  1
             four   6  6  7  0  0  4  5  3
      two    three  1  8  2  8  7  5  7  5
             four   1  1  3  8  8  6  0  3

For DataFrames with MultiIndexed rows, pandas allows this type of indexing

df.loc[('foo','bar'), ('one','two'), ('three','four')]

To be taken to mean

df.loc[(('foo','bar'), ('one','two'), ('three','four')), :]

But this type of indexing is ambiguous in the case when the number of indexing tuples is 2 since

df.loc[('foo','bar'), ('one','two')]

could mean incomplete indexing as in

df.loc[(('foo','bar'), ('one','two')),:]

or row,column indexing as in

df.loc[(('foo','bar'),), (('one','two'),)]

I appreciate that there is already a warning for this in the documentation, but I wonder if the functionality is worth the complications it adds to the code/docs.

Personally, I would suggest offloading the responsibility of complete indexing on a MultiIndex DataFrame to the user (obviously this doesn't apply to Series as they are 1d so to speak). This would take away the minor syntactical convenience of not specifying the column index, but it simplifies the code and gives the user only one way to index on a MultiIndex DataFrame (which makes usage less confusing).

The consequence to the user in the specific case of selecting multiple levels of a row-MultiIndex on a DataFrame is that instead of writing

df.loc['foo','one']

they would have to write

df.loc[('foo','one'), :]

And, in the syntactically worst case, instead of writing

df.loc[('foo','bar'), ('one','two'), ('three','four')]

they would have to write

df.loc[(('foo','bar'), ('one','two'), ('three','four')), :]

I'm fairly new to pandas (don't think I started using it until v0.16), so I realize I may be missing the bigger picture. If so, enlighten me!

@jreback
Copy link
Contributor

jreback commented Jul 15, 2015

it's not whether we should do this

it's just ambiguous and I don't think it's possible
in the general case
however if you would like to try to fix be my guest

@jreback jreback added this to the Someday milestone Jul 15, 2015
@tgarc
Copy link
Author

tgarc commented Jul 15, 2015

@jreback I'm not sure I understand you're comment. What I was saying is I don't think it is possible to do the incomplete indexing in the general case in any reasonable way which is why I was advocating deprecating incomplete indexing for multiIndexed dataframes.

@jreback
Copy link
Contributor

jreback commented Jul 15, 2015

@tgarc my point is that the following are both completely legitimate, but mean different things. Their is no way to disambiguate what is meant except for the user providing context (e.g. both axes). So how do you propose to deprecate this then?

In [35]: df = DataFrame(np.arange(12).reshape(4,3),columns=[0,2,1],index=MultiIndex.from_product([range(2),range(2)],names=['first','second'])).sortlevel()

In [36]: df
Out[36]: 
              0   2   1
first second           
0     0       0   1   2
      1       3   4   5
1     0       6   7   8
      1       9  10  11

In [37]: df.loc[(0,[1]),:]
Out[37]: 
              0  2  1
first second         
0     1       3  4  5

In [38]: df.loc[(0,[1])]
Out[38]: 
        1
second   
0       2
1       5

@toobaz
Copy link
Member

toobaz commented May 18, 2018

Closing since there is no obvious recommendation, other issues (#19110 for instance) face the same problem and there doesn't seem to be interest in just disabling incomplete indexing. @tgarc feel free to argument if you still think this is valid

@toobaz toobaz closed this as completed May 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants