Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: better error-handling for df.set_index #22486

Merged
merged 16 commits into from
Oct 19, 2018

Conversation

h-vetinari
Copy link
Contributor

@h-vetinari h-vetinari commented Aug 23, 2018

split off of #22236, and so builds on top of it.

@h-vetinari h-vetinari force-pushed the df_set_index_warn branch 2 times, most recently from 964e79c to 4027825 Compare August 23, 2018 15:24
@gfyoung gfyoung added Indexing Related to indexing on series/frames, not to indexes themselves Error Reporting Incorrect or improved errors from pandas labels Aug 25, 2018
'following: valid column keys, Series, Index, '
'MultiIndex, list or np.ndarray')

inplace = validate_bool_kwarg(inplace, 'inplace')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave the inplace invalidation where it was.

list, np.ndarray)) for x in keys):
raise TypeError('keys may only contain a combination of the '
'following: valid column keys, Series, Index, '
'MultiIndex, list or np.ndarray')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a lot of iterations over the keys (iterating over col_labels is equivalent time-complexity-wise). Could we have a single for-loop where we iterate over keys and fail fast?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the number of times iterating over the columns, it would IMO be very complicated and unreadable to do it in one loop, and - to me - not worth the performance gain, because the number of columns used for an index is generally so negligibly small, that traversing that list is a non-issue performance-wise.

@h-vetinari : That's fair, but I'm not sure I agree that the code would become that much more complicated e.g.:

for x in keys:
   if not (is_scalar(x) or isinstance(x, tuple)):
      if not isinstance(x, (ABCSeries, ABCIndexClass, ABCMultiIndex, list, np.ndarray):
         raise TypeError(...)
      else:
         if x not in self:
            raise KeyError(x)

How does that look?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not bad, but doesn't handle the duplicate column keys yet.

Copy link
Member

@gfyoung gfyoung Aug 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite see the difference between what you wrote and what I did. Where do you handle duplicate column keys in your changes above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind. Forgot that this case had been removed now by this PR

Copy link
Member

@gfyoung gfyoung Aug 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, unless there are other objections, I would suggest using the single for-loop instead. It seems pretty readable IMO.

@@ -0,0 +1,121 @@
import pytest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of things about this file:

  • Do you use all of these fixtures in your changes?
  • I would prefer if the naming is a little more consistent e.g.:
  • You have the word "frame" is some of your fixture names but not othres
  • You have underscores between words in some names but not others

Copy link
Member

@gfyoung gfyoung Aug 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is based on #22236, and so it will be easier to review once that is in.

I'm not sure I follow you here. How does adding currently unused fixtures in this PR get explained in #22236? I'm sensing you might want to break off some of the fixtures for a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's buried in the review there and related to #22471 - replacing the attributes of TestData with fixtures. I did it all in one go as a start towards #22471. It's a direct translation of the attributes, including their names.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm...I'm sorry, but I think this is a little too much for one PR. Keep in mind that the purpose of the PR, per your title, is better error reporting for df.set_index. This is a lot of new code that is arguably not pertinent to that aim.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diff of this PR will be tiny after #22236. Feel free to comment there about the changes of that PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. We can re-evaluate after #22236 gets merged.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, I literally have the same questions in that PR as I do in this one for that file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll answer in more detail in the other thread. But I already mentioned what it's about:

It's [...] related to #22471 - replacing the attributes of TestData with fixtures. I did it all in one go as a start towards #22471. It's a direct translation of the attributes, including their names.

@h-vetinari
Copy link
Contributor Author

@gfyoung, thanks for the review. This is based on #22236, and so it will be easier to review once that is in.

About the number of times iterating over the columns, it would IMO be very complicated and unreadable to do it in one loop, and - to me - not worth the performance gain, because the number of columns used for an index is generally so negligibly small, that traversing that list is a non-issue performance-wise.

@h-vetinari
Copy link
Contributor Author

@gfyoung

In that case, unless there are other objections, I would suggest using the single for-loop instead. It seems pretty readable IMO.

Incorporated your feedback, coming with the next commit after rebasing on top of #22236

@pep8speaks
Copy link

pep8speaks commented Sep 15, 2018

Hello @h-vetinari! Thanks for updating the PR.

Comment last updated on October 05, 2018 at 21:48 Hours UTC

Copy link
Contributor Author

@h-vetinari h-vetinari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that #22236 is in, we can continue here. :)

names.append(col.name)
elif isinstance(col, Index):
level = col
elif isinstance(col, ABCSeries):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ABC forms are a left-over from the review of #22236 by @jreback (#22236 (comment)):

huh? pls use the ABC version, we do this everywhere else.

with tm.assert_raises_regex(KeyError, '.*'):
df.set_index(keys, drop=drop, append=append)
# can't drop same column twice
first_drop = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you changing these?

these tests are not very hard to interpret

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes because now we're not failing for drop=True and keys=['A', 'A'] (for example), see #22484

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this tests is very confusing. pls make it simpler.

Copy link
Contributor Author

@h-vetinari h-vetinari Sep 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diff might be confusing, but it doesn't look so bad in practice:

# == gives ambiguous Boolean for Series

It doesn't get much simpler - to test against previously tested behaviour, I have to do the set_index-calls sequentially, but it's not possible to drop the same key twice (which this test is enabling), so I have to take that into account.

rgx = 'keys may only contain a combination of the following:.*'
# forbidden type, e.g. set
with tm.assert_raises_regex(TypeError, rgx):
df.set_index(set(df['A']), drop=drop, append=append)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you singling out an iterable (set) is there some reason?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, just any of the types that are not allowed (I've tried to indicate as much by using "e.g.")

@@ -3892,6 +3893,22 @@ def set_index(self, keys, drop=True, append=False, inplace=False,
if not isinstance(keys, list):
keys = [keys]

missing = []
for x in keys:
if not (is_scalar(x) or isinstance(x, tuple)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this overly complicated condition? can u show a simple example which fails now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This came out of the 2./3. outlined in #22484 and then the feedback from @gfyoung further up the thread

h-vetinari added a commit to h-vetinari/pandas that referenced this pull request Sep 18, 2018
@h-vetinari h-vetinari mentioned this pull request Sep 18, 2018
5 tasks
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, theme is reasonable, but need to simplify the code to avoid special casing things

if not (is_scalar(x) or isinstance(x, tuple)):
if not isinstance(x, (ABCSeries, ABCIndexClass, ABCMultiIndex,
list, np.ndarray)):
raise TypeError('keys may only contain a combination of '
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not true. either all have to be scalars, or all list-likes (yes they don't have to be the same, just list-convertible).

Please make this more readable from a user perspective. If I got this I would have no idea what to do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See longer comment in the main thread.

Tuples are allowed because they might be valid column labels.

with tm.assert_raises_regex(KeyError, '.*'):
df.set_index(keys, drop=drop, append=append)
# can't drop same column twice
first_drop = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this tests is very confusing. pls make it simpler.

h-vetinari added a commit to h-vetinari/pandas that referenced this pull request Sep 18, 2018
@h-vetinari
Copy link
Contributor Author

h-vetinari commented Sep 18, 2018

this is not true. either all have to be scalars, or all list-likes (yes they don't have to be the same, just list-convertible).

This is wrong on both counts. On master, a mix of column labels and a list:

>>> df = pd.DataFrame({'a': [1, 2], 'b': [11, 12]})
>>> df
   a   b
0  1  11
1  2  12
>>> df.set_index(['a', [101, 102]])
        b
a
1 101  11
2 102  12

Regarding the argument types, the list I wrote (and test) is exhaustive (see https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py#L3908) - currently only Series/Index/MultiIndex/list/ndarray are allowed, everything else gets tried as a frame key. This is one of the points in #22484.

>>> from pandas.core.dtypes.common import is_list_like
>>> LL = iter([21, 22])
>>> is_list_like(LL)
True
>>> df.set_index(LL)
KeyError: "None of [Int64Index([21, 22], dtype='int64')] are in the [columns]"

@codecov
Copy link

codecov bot commented Sep 19, 2018

Codecov Report

Merging #22486 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #22486      +/-   ##
==========================================
+ Coverage   92.19%   92.19%   +<.01%     
==========================================
  Files         169      169              
  Lines       50956    50966      +10     
==========================================
+ Hits        46978    46988      +10     
  Misses       3978     3978
Flag Coverage Δ
#multiple 90.61% <100%> (ø) ⬆️
#single 42.27% <54.16%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/frame.py 97.12% <100%> (+0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5e06c84...42d5f2a. Read the comment docs.

@h-vetinari
Copy link
Contributor Author

@jreback
Reworded the warning and rewrote the test - hopefully both clearer now.

@h-vetinari
Copy link
Contributor Author

@jreback
Green

@h-vetinari
Copy link
Contributor Author

@jreback
While you're at it with the reviewing, please don't forget this one. :)

missing = []
for x in keys:
if not (is_scalar(x) or isinstance(x, tuple)):
if not isinstance(x, (ABCSeries, ABCIndexClass, ABCMultiIndex,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use is_list_like here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. How to deal with tuples then. Can be valid column keys and valid list-likes.

Copy link
Contributor Author

@h-vetinari h-vetinari Sep 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would also change the capabilities of set_index. e.g. an iterator has is_list_like = True - do you want to support that? Currently, only the types that I'm instance-checking (plus column keys) are allowed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to explicity list types here as its verbose and fragile, use is_list_like and not is_iterator if you must (though to be honest an iterator is actually ok)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to explicity list types here as its verbose and fragile

I have a feeling the implementation you request will be much more verbose, but OK

OK. How to deal with tuples then. Can be valid column keys and valid list-likes.

You want to try tuples as column keys first and then as an array (would make the most sense to me), or something else?

df.set_index([df['A'], df['B'], 'X'], drop=drop, append=append)

rgx = 'The parameter "keys" may only contain a combination of.*'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rgx -> msg

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Changed. Please let me know what you want me to do about the allowed argument signature. Allowing all list-likes also means checking ndim==1 for np.ndarrays, etc.

See also:
#22486 (comment)
#22486 (comment)

missing = []
for x in keys:
if not (is_scalar(x) or isinstance(x, tuple)):
if not isinstance(x, (ABCSeries, ABCIndexClass, ABCMultiIndex,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to explicity list types here as its verbose and fragile, use is_list_like and not is_iterator if you must (though to be honest an iterator is actually ok)

# tuples that are not column keys are considered list-like,
# not considered missing
missing.append(col)
elif (not is_list_like(col) or isinstance(col, set)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am not averse to excluding, but i also don't want ad-infinitem special cases here. This code is already too complex. We accept an iterable, so this is a valid input, if the user wants to do it that's there issue; these are not currently excluded.

@@ -126,21 +131,29 @@ def test_set_index_pass_single_array(self, frame_of_index_cols,
df.index.name = index_name

key = box(df['B'])
# np.array and list "forget" the name of B
name = [None if box in [np.array, list] else 'B']
if box == list:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why exactly is a list interpreted differently here? this makes this test really really odd. I am worried something changed here, as this appears to work just fine in the master.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will see above (line 123) that list wasn't tested - for precisely this reason. df.set_index(['A', 'B']) interprets a A & B as column keys, so (assuming this df was length 2) it would not use the list ['A', 'B'] as the index. To do that, one would have to pass [['A', 'B']]. This PR proposes to add tests for the current behaviour.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback
In case the above was not very clearly worded, this corresponds exactly to behaviour on master:

>>> df = pd.DataFrame(np.random.randn(2,3))
>>> df.set_index(['A', 'B'])
KeyError: 'A'
>>> df.set_index([['A', 'B']])
          0         1         2
A  1.962370 -1.150770  0.843600
B -0.417274  0.509781 -0.697802

Copy link
Contributor Author

@h-vetinari h-vetinari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for review; added responses

# tuples that are not column keys are considered list-like,
# not considered missing
missing.append(col)
elif (not is_list_like(col) or isinstance(col, set)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The larger issue IMO is that is_list_like(<set>) should be False and not True. It's the only exclusion here, and it makes sense. On many issues, pandas prides itself on enforcing sensible defaults and behaviour. Why would we want to enable something that's broken by design?

I may also add that this is not something that's currently allowed, and it shouldn't be IMO

@@ -126,21 +131,29 @@ def test_set_index_pass_single_array(self, frame_of_index_cols,
df.index.name = index_name

key = box(df['B'])
# np.array and list "forget" the name of B
name = [None if box in [np.array, list] else 'B']
if box == list:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will see above (line 123) that list wasn't tested - for precisely this reason. df.set_index(['A', 'B']) interprets a A & B as column keys, so (assuming this df was length 2) it would not use the list ['A', 'B'] as the index. To do that, one would have to pass [['A', 'B']]. This PR proposes to add tests for the current behaviour.

@h-vetinari
Copy link
Contributor Author

@jreback
PTAL here :)

@h-vetinari h-vetinari closed this Oct 7, 2018
@h-vetinari h-vetinari reopened this Oct 7, 2018
@jreback
Copy link
Contributor

jreback commented Oct 9, 2018

@h-vetinari why don't you try (separate PR) excluding set from is_list_like and see what the implications of that are.

@h-vetinari
Copy link
Contributor Author

@jreback

@h-vetinari why don't you try (separate PR) excluding set from is_list_like and see what the implications of that are.

I did, seems doable: #23061 #23065

h-vetinari added a commit to h-vetinari/pandas that referenced this pull request Oct 16, 2018
@h-vetinari
Copy link
Contributor Author

h-vetinari commented Oct 18, 2018

@jreback
All green. After merging and incorporating #23065, all comments here should now be addressed.

@jreback jreback merged commit 145c227 into pandas-dev:master Oct 19, 2018
@jreback
Copy link
Contributor

jreback commented Oct 19, 2018

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API: better error-handling for df.set_index
4 participants