DEPR/API: tuples in labels #24688

h-vetinari · 2019-01-09T21:08:00Z

I haven't found a previous issue for this despite a bunch of searching, apologies if I overlooked something.

Yes, tuples are hashable and can therefore be used for indexing in principle, but not least with using df.loc[(lvl1, lvl2), col] for MultiIndexing, it becomes a big complexity burden to allow tuples as labels.

There are also a couple of places where some inputs take either labels or arrays (e.g. DataFrame.set_index), and it's a hassle to get all the corner cases right, because one needs to check if a tuple is a valid key (both for the frame and the MI.names), and interpret it as a list-like otherwise (or whatever the precedence might be within the method).

Tuples are also notoriously hard to get into an array, because numpy will often change the semantics if it's not is_scalar.

>>> pd.MultiIndex.from_arrays([[1, 2, 3], [1, (1, 2), 1]])
ValueError: setting an array element with a sequence.

Furthermore, there's lots of bugs hiding because many of these corner cases are not tested:

>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': [(1, 1), (1, 2), (2, 2)], 'c': [10, 20, 30]})
>>> df
   a       b   c
0  1  (1, 1)  10
1  2  (1, 2)  20
2  3  (2, 2)  30
>>> df2 = df.set_index(['a', 'b'])
>>> df2
           c
a b
1 (1, 1)  10
2 (1, 2)  20
3 (2, 2)  30
>>>
>>> # should ALL be 10
>>> df2.iloc[0, 0]
10
>>> df2.loc[(1, (1, 1)), 'c']
Series([], Name: c, dtype: int64)
>>> df2.loc[1].c.loc[(1, 1)]
TypeError: cannot do label indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [1] of <class 'int'>
>>> df2.loc[1].loc[(1, 1), 'c']
KeyError: "None of [Int64Index([1, 1], dtype='int64', name='b')] are in the [index]"
>>> df2.loc[1].at[(1, 1)]
ValueError: At based indexing on an non-integer index can only have non-integer indexers
>>> df2.swaplevel().loc[((1, 1), 1)]
TypeError: cannot do label indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [1] of <class 'int'>
>>> df2.swaplevel().loc[(1, 1)].loc[1, 'c']  # Eureka! One that works!
10

I'd bet that there are 100s of similar cases that are hidden because the vast majority of tests does not check tuples as index entries.

@WillAyd just commented on an issue:

Using a tuple as a label is generally not supported, but if you want to take a look PRs are always welcome

Why allow it at all in this case? Why not just deprecate tuples in Index / column names / MultiIndex.names?

The text was updated successfully, but these errors were encountered:

WillAyd · 2019-01-09T21:25:04Z

Worth a discussion for sure. Taking into consideration my own workflow as a bias I've never intentionally found a use case for tuples as a label, though we get a surprising amount of issues pop up as a result so I suspect others do have workflows where this is valid

We just added a .flatten method to MultiIndex recently which intentionally collapses the levels into tuples which would need consideration here

jorisvandenbossche · 2019-01-09T21:50:41Z

As long as we have MultiIndex, I am not sure it is possible to get rid of all cases where you might end up with tuples.

First, when referring to a MultiIndex key, you need to use a tuple. Eg the set_index you mentioned. Even when disallowing tuples as column names, you would still need to use a tuple to refer to a column in case of MultiIndexed columns. So disallowing tuples will not solve the ambiguity problem here I think.

Another aspect is that some operations will give you tuples. flatten is one (recent) example that @WillAyd gave. But eg setting a column as the index (again with MultiIndexed columns), will give you an index level with a tuple as the name:

In [2]: df = pd.DataFrame(np.random.randn(5, 4), columns=pd.MultiIndex.from_product([['A', 'B'], ['a', 'b']]))                                                                  

In [3]: df                                                                                                                                                                      
Out[3]: 
          A                   B          
          a         b         a         b
0 -0.659414  0.898374  1.378103  0.144247
1  0.167956  0.625324  2.122758  0.346808
2  1.282092  0.640114  0.668349 -1.614398
3  1.683959  0.347517 -2.566367  0.326387
4 -0.862953  0.608134 -0.405303 -0.145457

In [4]: df.set_index(('A', 'a'))                                                                                                                                                
Out[4]: 
                  A         B          
                  b         a         b
(A, a)                                 
-0.659414  0.898374  1.378103  0.144247
 0.167956  0.625324  2.122758  0.346808
 1.282092  0.640114  0.668349 -1.614398
 1.683959  0.347517 -2.566367  0.326387
-0.862953  0.608134 -0.405303 -0.145457

In a simplified data model where there would be no multi-indexed columns possible, it would make sense to restrict column names eg to only strings (and disallow tuples), but in the current pandas data model, I am not sure it is possible / desirable.

WillAyd · 2019-01-09T21:56:40Z

Even when disallowing tuples as column names, you would still need to use a tuple to refer to a column in case of MultiIndexed columns. So disallowing tuples will not solve the ambiguity problem here I think.

Wouldn't this exactly solve that? Today it's potentially unclear whether a tuple used as an indexer refers to a location in a MultiIndex or one particular label that happens to be a tuple. If we disallow the latter, the ambiguity is gone, no?

h-vetinari · 2019-01-09T22:00:15Z

@jorisvandenbossche: First, when referring to a MultiIndex key, you need to use a tuple. Eg the set_index you mentioned. Even when disallowing tuples as column names, you would still need to use a tuple to refer to a column in case of MultiIndexed columns. So disallowing tuples will not solve the ambiguity problem here I think.

This was badly formulated on my part. Of course, the tuples would be necessary for accessing a MultiIndex. I was talking about the (say:) atomic unit of labels within Index/MultiIndex. [just saw that @WillAyd posted essentially the same while I'm writing]

Your second example is intriguing and more relevant. Because with reset_index() it'd be back in the columns. Another thing (i.e. setting with MultiColumns) to consider deprecating for df.set_index, haha. ;-)

jorisvandenbossche · 2019-01-09T22:04:04Z

The ambiguity I referred to was the one between a tuple as values vs a tuple as a single label (but indeed, for sure, the amount of ambiguous cases decreases a bit :-))

Whether the tuple label is referring to a tuple key in a flat index or to a single key of multi-level columns, it is still this tuple that is conflicting with tuple as array-like (that is how I interpreted "because one needs to check if a tuple is a valid key (both for the frame and the MI.names), and interpret it as a list-like otherwise")

h-vetinari · 2019-01-09T22:07:31Z

@jorisvandenbossche

@hvetinari: Your second example is intriguing and more relevant. Because with reset_index() it'd be back in the columns.

On second thought, it may be sufficient to allow tuples in MultiIndex.names (which is not an Index), and when doing something like .reset_index the tuple would only be allowed to be inserted back into the columns if those are (as in your example) already a MultiIndex (with the right number of levels)

h-vetinari · 2019-01-09T22:12:33Z

@jorisvandenbossche: Whether the tuple label is referring to a tuple key in a flat index or to a single key of multi-level columns, it is still this tuple that is conflicting with tuple as array-like (that is how I interpreted "because one needs to check if a tuple is a valid key (both for the frame and the MI.names), and interpret it as a list-like otherwise")

You're right there as well, those cases would still need extra logic (for a good reason, as you showed in the example).

But I'm guessing the vast majority of bugs are due to your first point (i.e. using "tuple-as-single-label"), so that would still be a substantial improvement/simplification (considering the complexity that would be needed to not just crash and burn as the OP illustrated)

jorisvandenbossche · 2019-01-09T22:16:31Z

But I'm guessing the vast majority of bugs are due to your first point (i.e. using "tuple-as-single-label"),

Yes, if we are speaking about bugs and not ambiguities/complexities in the API, you are probably right.
Although I think the example bugs you showed is a quite esoteric use case (a MultiIndex where one level contains tuples). I think a more common case is a single flat index / columns that contains tuples (i.e. flattened MI), which might be a bit less buggy (just a guess).

summonholmes · 2019-01-12T17:11:34Z

But I'm guessing the vast majority of bugs are due to your first point (i.e. using "tuple-as-single-label"),

Yes, if we are speaking about bugs and not ambiguities/complexities in the API, you are probably right.
Although I think the example bugs you showed is a quite esoteric use case (a MultiIndex where one level contains tuples). I think a more common case is a single flat index / columns that contains tuples (i.e. flattened MI), which might be a bit less buggy (just a guess).

That's exactly what I have in my bug report and PR. I'd welcome this suggestion.

See the following code:

from pandas._libs import lib

lib.clean_index_list([(('Turtle', 'Chicken'), (('Man', 'Monkey'), 'Dog'))])[0]

Output:
array([[['Turtle', 'Chicken'], [('Man', 'Monkey'), 'Dog']]], dtype=object)

simonjayhawkins · 2019-01-15T23:31:37Z

.iloc sets the label to a tuple when returning a series from a multi-level indexed DataFrame..

pandas/pandas/tests/indexing/multiindex/test_iloc.py

Lines 23 to 37 in f160c7d

    
           @pytest.mark.parametrize('indexer, expected', [ 
        
               (lambda df: df.iloc[0], 
        
                lambda arr: Series(arr[0], index=[[2, 2, 4], [6, 8, 10]], name=(4, 8))), 
        
               (lambda df: df.iloc[2], 
        
                lambda arr: Series(arr[2], index=[[2, 2, 4], [6, 8, 10]], name=(8, 12))), 
        
               (lambda df: df.iloc[:, 2], 
        
                lambda arr: Series( 
        
                    arr[:, 2], index=[[4, 4, 8], [8, 10, 12]], name=(4, 10))) 
        
           ]) 
        
           def test_iloc_returns_series(indexer, expected, simple_multiindex_dataframe): 
        
               arr = np.random.randn(3, 3) 
        
               df = simple_multiindex_dataframe(arr) 
        
               result = indexer(df) 
        
               expected = expected(arr) 
        
               tm.assert_series_equal(result, expected)

toobaz · 2019-02-04T08:59:42Z

Seeing this now only. I'm against suppressing tuples as values.

I think @h-vetinari 's strongest point is

Tuples are also notoriously hard to get into an array, because numpy will often change the semantics if it's not is_scalar.

... but even that is simply solved by recurring to stuff like construct_1d_object_array_from_listlike. And things will be even better when/if numpy will support ndmax.

Moreover, there is huge evidence of people using tuples as labels in their code. And even evidence of people who have no reasons to stop. And cases we can't provide better solutions.

Then, there is the more general argument of what exactly we would be forbidding. What about types which subclass tuples? Other hashable iterables? As in #24984 , in order to find not properly rewrite some internal code which should definitely be rewrite, we plan to complicate the (description) of the API.

By the way, disallowing tuples would mean at the very minimum showing a warning for every usage, for some time. That's much more annoying for me to implement than a couple of refactorings which we should do anyway.

The one argument that could convince me is that there are many cases of API ambiguity due to accepting tuples. However I can't find any. And on this topic,

and it's a hassle to get all the corner cases right, because one needs to check if a tuple is a valid key (both for the frame and the MI.names), and interpret it as a list-like otherwise (or whatever the precedence might be within the method).

This is wrong, or just backwards-compatible code to be removed. Tuples are not list-likes, and this is orthogonal to having them as atomic labels. And they can't be list-likes because they are MultiIndex keys. This was discussed several times. And yes, must be documented.

jbrockmendel · 2020-09-21T18:18:03Z

The use case I've had where tuple-labels were indispensible is where I had df.columns as a MultiIndex and then had to add another non-tuple column, which causes the MultiIndex to case to tuples.

WillAyd added Deprecate Functionality to remove in pandas Needs Discussion Requires discussion from core team before further action labels Jan 9, 2019

h-vetinari mentioned this issue Jan 10, 2019

DEPR/API: disallow lists within list for set_index #24697

Closed

5 tasks

summonholmes mentioned this issue Jan 12, 2019

BUG: Prevent 3D-ndarray for nested tuple labels (#24687) #24732

Closed

h-vetinari mentioned this issue Jan 15, 2019

is_list_like should return false for tuples #24702

Open

simonjayhawkins mentioned this issue Jan 15, 2019

Single-level tuple index to multi-level index on pd.concat #24783

Open

summonholmes mentioned this issue Jan 16, 2019

Styler background_gradient cmap with tuple indices sometimes resulting in ValueError #24687

Closed

This was referenced Jan 28, 2019

Regression in DataFrame.set_index with class instance column keys #24969

Closed

Towards "pandas 1.0" #10000

Closed

API: harmonize Index.to_flat_index to MultiIndex.to_flat_index #25094

Closed

simonjayhawkins mentioned this issue Jul 28, 2020

BUG: Shift on a group column when column name is a tuple-of-tuples results in NumPy VisibleDeprecationWarning #35434

Open

simonjayhawkins mentioned this issue Aug 4, 2020

BUG: Cannot select values in series using tuples when indexing with tuples in pandas 1.1.0 #35534

Closed

3 tasks

jbrockmendel added the Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.). label Sep 22, 2020

jbrockmendel added the Closing Candidate May be closeable, needs more eyeballs label Jan 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DEPR/API: tuples in labels #24688

DEPR/API: tuples in labels #24688

h-vetinari commented Jan 9, 2019

WillAyd commented Jan 9, 2019

jorisvandenbossche commented Jan 9, 2019

WillAyd commented Jan 9, 2019

h-vetinari commented Jan 9, 2019

jorisvandenbossche commented Jan 9, 2019

h-vetinari commented Jan 9, 2019

h-vetinari commented Jan 9, 2019 •

edited

Loading

jorisvandenbossche commented Jan 9, 2019 •

edited

Loading

summonholmes commented Jan 12, 2019 •

edited

Loading

simonjayhawkins commented Jan 15, 2019

toobaz commented Feb 4, 2019

jbrockmendel commented Sep 21, 2020

DEPR/API: tuples in labels #24688

DEPR/API: tuples in labels #24688

Comments

h-vetinari commented Jan 9, 2019

WillAyd commented Jan 9, 2019

jorisvandenbossche commented Jan 9, 2019

WillAyd commented Jan 9, 2019

h-vetinari commented Jan 9, 2019

jorisvandenbossche commented Jan 9, 2019

h-vetinari commented Jan 9, 2019

h-vetinari commented Jan 9, 2019 • edited Loading

jorisvandenbossche commented Jan 9, 2019 • edited Loading

summonholmes commented Jan 12, 2019 • edited Loading

simonjayhawkins commented Jan 15, 2019

toobaz commented Feb 4, 2019

jbrockmendel commented Sep 21, 2020

h-vetinari commented Jan 9, 2019 •

edited

Loading

jorisvandenbossche commented Jan 9, 2019 •

edited

Loading

summonholmes commented Jan 12, 2019 •

edited

Loading