Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEPR/API: tuples in labels #24688

Open
h-vetinari opened this issue Jan 9, 2019 · 12 comments
Open

DEPR/API: tuples in labels #24688

h-vetinari opened this issue Jan 9, 2019 · 12 comments
Labels
Closing Candidate May be closeable, needs more eyeballs Deprecate Functionality to remove in pandas Needs Discussion Requires discussion from core team before further action Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.).

Comments

@h-vetinari
Copy link
Contributor

I haven't found a previous issue for this despite a bunch of searching, apologies if I overlooked something.

Yes, tuples are hashable and can therefore be used for indexing in principle, but not least with using df.loc[(lvl1, lvl2), col] for MultiIndexing, it becomes a big complexity burden to allow tuples as labels.

There are also a couple of places where some inputs take either labels or arrays (e.g. DataFrame.set_index), and it's a hassle to get all the corner cases right, because one needs to check if a tuple is a valid key (both for the frame and the MI.names), and interpret it as a list-like otherwise (or whatever the precedence might be within the method).

Tuples are also notoriously hard to get into an array, because numpy will often change the semantics if it's not is_scalar.

>>> pd.MultiIndex.from_arrays([[1, 2, 3], [1, (1, 2), 1]])
ValueError: setting an array element with a sequence.

Furthermore, there's lots of bugs hiding because many of these corner cases are not tested:

>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': [(1, 1), (1, 2), (2, 2)], 'c': [10, 20, 30]})
>>> df
   a       b   c
0  1  (1, 1)  10
1  2  (1, 2)  20
2  3  (2, 2)  30
>>> df2 = df.set_index(['a', 'b'])
>>> df2
           c
a b
1 (1, 1)  10
2 (1, 2)  20
3 (2, 2)  30
>>>
>>> # should ALL be 10
>>> df2.iloc[0, 0]
10
>>> df2.loc[(1, (1, 1)), 'c']
Series([], Name: c, dtype: int64)
>>> df2.loc[1].c.loc[(1, 1)]
TypeError: cannot do label indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [1] of <class 'int'>
>>> df2.loc[1].loc[(1, 1), 'c']
KeyError: "None of [Int64Index([1, 1], dtype='int64', name='b')] are in the [index]"
>>> df2.loc[1].at[(1, 1)]
ValueError: At based indexing on an non-integer index can only have non-integer indexers
>>> df2.swaplevel().loc[((1, 1), 1)]
TypeError: cannot do label indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [1] of <class 'int'>
>>> df2.swaplevel().loc[(1, 1)].loc[1, 'c']  # Eureka! One that works!
10

I'd bet that there are 100s of similar cases that are hidden because the vast majority of tests does not check tuples as index entries.

@WillAyd just commented on an issue:

Using a tuple as a label is generally not supported, but if you want to take a look PRs are always welcome

Why allow it at all in this case? Why not just deprecate tuples in Index / column names / MultiIndex.names?

@WillAyd
Copy link
Member

WillAyd commented Jan 9, 2019

Worth a discussion for sure. Taking into consideration my own workflow as a bias I've never intentionally found a use case for tuples as a label, though we get a surprising amount of issues pop up as a result so I suspect others do have workflows where this is valid

We just added a .flatten method to MultiIndex recently which intentionally collapses the levels into tuples which would need consideration here

@WillAyd WillAyd added Deprecate Functionality to remove in pandas Needs Discussion Requires discussion from core team before further action labels Jan 9, 2019
@jorisvandenbossche
Copy link
Member

As long as we have MultiIndex, I am not sure it is possible to get rid of all cases where you might end up with tuples.

First, when referring to a MultiIndex key, you need to use a tuple. Eg the set_index you mentioned. Even when disallowing tuples as column names, you would still need to use a tuple to refer to a column in case of MultiIndexed columns. So disallowing tuples will not solve the ambiguity problem here I think.

Another aspect is that some operations will give you tuples. flatten is one (recent) example that @WillAyd gave. But eg setting a column as the index (again with MultiIndexed columns), will give you an index level with a tuple as the name:

In [2]: df = pd.DataFrame(np.random.randn(5, 4), columns=pd.MultiIndex.from_product([['A', 'B'], ['a', 'b']]))                                                                  

In [3]: df                                                                                                                                                                      
Out[3]: 
          A                   B          
          a         b         a         b
0 -0.659414  0.898374  1.378103  0.144247
1  0.167956  0.625324  2.122758  0.346808
2  1.282092  0.640114  0.668349 -1.614398
3  1.683959  0.347517 -2.566367  0.326387
4 -0.862953  0.608134 -0.405303 -0.145457

In [4]: df.set_index(('A', 'a'))                                                                                                                                                
Out[4]: 
                  A         B          
                  b         a         b
(A, a)                                 
-0.659414  0.898374  1.378103  0.144247
 0.167956  0.625324  2.122758  0.346808
 1.282092  0.640114  0.668349 -1.614398
 1.683959  0.347517 -2.566367  0.326387
-0.862953  0.608134 -0.405303 -0.145457

In a simplified data model where there would be no multi-indexed columns possible, it would make sense to restrict column names eg to only strings (and disallow tuples), but in the current pandas data model, I am not sure it is possible / desirable.

@WillAyd
Copy link
Member

WillAyd commented Jan 9, 2019

Even when disallowing tuples as column names, you would still need to use a tuple to refer to a column in case of MultiIndexed columns. So disallowing tuples will not solve the ambiguity problem here I think.

Wouldn't this exactly solve that? Today it's potentially unclear whether a tuple used as an indexer refers to a location in a MultiIndex or one particular label that happens to be a tuple. If we disallow the latter, the ambiguity is gone, no?

@h-vetinari
Copy link
Contributor Author

@jorisvandenbossche: First, when referring to a MultiIndex key, you need to use a tuple. Eg the set_index you mentioned. Even when disallowing tuples as column names, you would still need to use a tuple to refer to a column in case of MultiIndexed columns. So disallowing tuples will not solve the ambiguity problem here I think.

This was badly formulated on my part. Of course, the tuples would be necessary for accessing a MultiIndex. I was talking about the (say:) atomic unit of labels within Index/MultiIndex. [just saw that @WillAyd posted essentially the same while I'm writing]

Your second example is intriguing and more relevant. Because with reset_index() it'd be back in the columns. Another thing (i.e. setting with MultiColumns) to consider deprecating for df.set_index, haha. ;-)

@jorisvandenbossche
Copy link
Member

The ambiguity I referred to was the one between a tuple as values vs a tuple as a single label (but indeed, for sure, the amount of ambiguous cases decreases a bit :-))

Whether the tuple label is referring to a tuple key in a flat index or to a single key of multi-level columns, it is still this tuple that is conflicting with tuple as array-like (that is how I interpreted "because one needs to check if a tuple is a valid key (both for the frame and the MI.names), and interpret it as a list-like otherwise")

@h-vetinari
Copy link
Contributor Author

@jorisvandenbossche

@hvetinari: Your second example is intriguing and more relevant. Because with reset_index() it'd be back in the columns.

On second thought, it may be sufficient to allow tuples in MultiIndex.names (which is not an Index), and when doing something like .reset_index the tuple would only be allowed to be inserted back into the columns if those are (as in your example) already a MultiIndex (with the right number of levels)

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Jan 9, 2019

@jorisvandenbossche: Whether the tuple label is referring to a tuple key in a flat index or to a single key of multi-level columns, it is still this tuple that is conflicting with tuple as array-like (that is how I interpreted "because one needs to check if a tuple is a valid key (both for the frame and the MI.names), and interpret it as a list-like otherwise")

You're right there as well, those cases would still need extra logic (for a good reason, as you showed in the example).

But I'm guessing the vast majority of bugs are due to your first point (i.e. using "tuple-as-single-label"), so that would still be a substantial improvement/simplification (considering the complexity that would be needed to not just crash and burn as the OP illustrated)

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jan 9, 2019

But I'm guessing the vast majority of bugs are due to your first point (i.e. using "tuple-as-single-label"),

Yes, if we are speaking about bugs and not ambiguities/complexities in the API, you are probably right.
Although I think the example bugs you showed is a quite esoteric use case (a MultiIndex where one level contains tuples). I think a more common case is a single flat index / columns that contains tuples (i.e. flattened MI), which might be a bit less buggy (just a guess).

@summonholmes
Copy link

summonholmes commented Jan 12, 2019

But I'm guessing the vast majority of bugs are due to your first point (i.e. using "tuple-as-single-label"),

Yes, if we are speaking about bugs and not ambiguities/complexities in the API, you are probably right.
Although I think the example bugs you showed is a quite esoteric use case (a MultiIndex where one level contains tuples). I think a more common case is a single flat index / columns that contains tuples (i.e. flattened MI), which might be a bit less buggy (just a guess).

That's exactly what I have in my bug report and PR. I'd welcome this suggestion.

See the following code:

from pandas._libs import lib

lib.clean_index_list([(('Turtle', 'Chicken'), (('Man', 'Monkey'), 'Dog'))])[0]

Output:
array([[['Turtle', 'Chicken'], [('Man', 'Monkey'), 'Dog']]], dtype=object)

@simonjayhawkins
Copy link
Member

.iloc sets the label to a tuple when returning a series from a multi-level indexed DataFrame..

@pytest.mark.parametrize('indexer, expected', [
(lambda df: df.iloc[0],
lambda arr: Series(arr[0], index=[[2, 2, 4], [6, 8, 10]], name=(4, 8))),
(lambda df: df.iloc[2],
lambda arr: Series(arr[2], index=[[2, 2, 4], [6, 8, 10]], name=(8, 12))),
(lambda df: df.iloc[:, 2],
lambda arr: Series(
arr[:, 2], index=[[4, 4, 8], [8, 10, 12]], name=(4, 10)))
])
def test_iloc_returns_series(indexer, expected, simple_multiindex_dataframe):
arr = np.random.randn(3, 3)
df = simple_multiindex_dataframe(arr)
result = indexer(df)
expected = expected(arr)
tm.assert_series_equal(result, expected)

@toobaz
Copy link
Member

toobaz commented Feb 4, 2019

Seeing this now only. I'm against suppressing tuples as values.

I think @h-vetinari 's strongest point is

Tuples are also notoriously hard to get into an array, because numpy will often change the semantics if it's not is_scalar.

... but even that is simply solved by recurring to stuff like construct_1d_object_array_from_listlike. And things will be even better when/if numpy will support ndmax.

Moreover, there is huge evidence of people using tuples as labels in their code. And even evidence of people who have no reasons to stop. And cases we can't provide better solutions.

Then, there is the more general argument of what exactly we would be forbidding. What about types which subclass tuples? Other hashable iterables? As in #24984 , in order to find not properly rewrite some internal code which should definitely be rewrite, we plan to complicate the (description) of the API.

By the way, disallowing tuples would mean at the very minimum showing a warning for every usage, for some time. That's much more annoying for me to implement than a couple of refactorings which we should do anyway.

The one argument that could convince me is that there are many cases of API ambiguity due to accepting tuples. However I can't find any. And on this topic,

and it's a hassle to get all the corner cases right, because one needs to check if a tuple is a valid key (both for the frame and the MI.names), and interpret it as a list-like otherwise (or whatever the precedence might be within the method).

This is wrong, or just backwards-compatible code to be removed. Tuples are not list-likes, and this is orthogonal to having them as atomic labels. And they can't be list-likes because they are MultiIndex keys. This was discussed several times. And yes, must be documented.

@jbrockmendel
Copy link
Member

The use case I've had where tuple-labels were indispensible is where I had df.columns as a MultiIndex and then had to add another non-tuple column, which causes the MultiIndex to case to tuples.

@jbrockmendel jbrockmendel added the Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.). label Sep 22, 2020
@jbrockmendel jbrockmendel added the Closing Candidate May be closeable, needs more eyeballs label Jan 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Closing Candidate May be closeable, needs more eyeballs Deprecate Functionality to remove in pandas Needs Discussion Requires discussion from core team before further action Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.).
Projects
None yet
Development

No branches or pull requests

7 participants