-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add set_index to Series #22225
ENH: Add set_index to Series #22225
Conversation
look in alter_axes there are a ton of tests |
Codecov Report
@@ Coverage Diff @@
## master #22225 +/- ##
==========================================
- Coverage 92.38% 92.37% -0.01%
==========================================
Files 166 166
Lines 52395 52398 +3
==========================================
+ Hits 48403 48404 +1
- Misses 3992 3994 +2
Continue to review full report at Codecov.
|
6ba8ca9
to
46eee7b
Compare
Hello @h-vetinari! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on January 06, 2019 at 14:40 Hours UTC |
7355d16
to
aadf50b
Compare
eeffe68
to
3edbaea
Compare
@gfyoung @jreback @WillAyd @TomAugspurger |
8ab863b
to
949e699
Compare
Now that all the preps (#22236 #22526 #22486) are out of the way, can finally tackle this seemingly simple PR - adding It's also green (Circle is behind but already passed before the last commit for linting) |
pandas/core/generic.py
Outdated
3 2013 7 84 | ||
4 2014 10 31 | ||
""" | ||
# parameter keys is checked in Series.set_index / DataFrame.set_index! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not duplicating the - different - checks from Series.set_index
and DataFrame.set_index
here - is there any chance that someone calls NDFrame.set_index
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
cc @jreback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks generally ok, need to fix the duplicated doc-strings
pandas/core/frame.py
Outdated
@@ -3980,12 +3991,7 @@ def set_index(self, keys, drop=True, append=False, inplace=False, | |||
2 2014 4 40 | |||
3 2013 7 84 | |||
4 2014 10 31 | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the doc-test run for this doc-string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
pandas/core/frame.py
Outdated
|
||
if not inplace: | ||
return frame | ||
vi = verify_integrity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use abbreviations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was just for the line length, but ok, re-broke the line differently
pandas/core/generic.py
Outdated
3 2013 7 84 | ||
4 2014 10 31 | ||
""" | ||
# parameter keys is checked in Series.set_index / DataFrame.set_index! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no
pandas/core/series.py
Outdated
def set_index(self, arrays, append=False, inplace=False, | ||
verify_integrity=False): | ||
""" | ||
Set the Series index (row labels) using one or more columns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment about the doc-string
pandas/core/series.py
Outdated
@@ -1075,6 +1075,87 @@ def _set_value(self, label, value, takeable=False): | |||
return self | |||
_set_value.__doc__ = set_value.__doc__ | |||
|
|||
def set_index(self, arrays, append=False, inplace=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't change the argument name, it is called keys currently, leave it that way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The argument currently does not exist - it makes sense for DataFrame
to have it named keys
, since the main application (arguably) is setting the index to a given column key, but it would be very confusing for Series
.
The signatures differ already anyway (no drop
for Series
), so I'm strongly -1 on naming this parameter keys
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that doesn't make any sense, you are conflating 2 different things. change then name to match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function signatures of subclasses should match the parent. We should also accept drop
in the Series version, even if it's not used..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That, to me, would be an argument to not share the implementation in generic.py
. A parameter named keys
makes zero sense for series (and neither does drop
for that matter - or rather it would be very confusing).
@TomAugspurger @jorisvandenbossche |
Apologies for the delay. I think I'm +1 to the idea of this. Taking a look at the diff now. |
def set_index(self, arrays, append=False, inplace=False, | ||
verify_integrity=False): | ||
|
||
if not isinstance(arrays, list) or all(is_scalar(x) for x in arrays): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What case does the first condition handle here? I'm having trouble finding a case where Series.set_index([foo])
but Series.set_index(foo)
doesn't (or wouldn't without the []
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomAugspurger
The first condition makes sure the second one can execute (i.e. iteration). If it is not true and we end up wrapping it in a list, it will fail in the next step.
The point that is crucial though is that only lists are affected by this (and not all list-likes), because NDFrame.set_index
would try to interpret a list of scalars as a list of keys.
In any case, I added some comments to clarify this behaviour.
I can't tell if the doc build error from https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=6464 is real, or whether it's a bug in the linter. I can take a look later if you want. |
@TomAugspurger |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added an explanatory comment.
def set_index(self, arrays, append=False, inplace=False, | ||
verify_integrity=False): | ||
|
||
if not isinstance(arrays, list) or all(is_scalar(x) for x in arrays): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomAugspurger
The first condition makes sure the second one can execute (i.e. iteration). If it is not true and we end up wrapping it in a list, it will fail in the next step.
The point that is crucial though is that only lists are affected by this (and not all list-likes), because NDFrame.set_index
would try to interpret a list of scalars as a list of keys.
In any case, I added some comments to clarify this behaviour.
@datapythonista My only guess would be that As a side note, it would be awesome to see in the log exactly which docstring was causing an error... |
pandas/core/generic.py
Outdated
|
||
See Also | ||
-------- | ||
%(other_klass)s.set_index: Method adapted for %(other_klass)s. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your validation error is that there is no space before the colon on this line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WillAyd
If that would cause the error then the [Series|DataFrame].set_index
docstrings should fail too, shouldn't they? Locally they pass, but let's hope you're right. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm yea. So it does fail locally on NDFrame.set_index with the same error as shown in Azure logs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's see if this works.
pandas/core/generic.py
Outdated
|
||
See Also | ||
-------- | ||
%(other_klass)s.set_index: Method adapted for %(other_klass)s. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WillAyd
If that would cause the error then the [Series|DataFrame].set_index
docstrings should fail too, shouldn't they? Locally they pass, but let's hope you're right. :)
@h-vetinari the error here is from You can create an issue to capture numpydoc parser errors, and reraise the exception adding the failing docstring, that would be great. It's the first time I see it failing, that's why it's not implemented. |
I think the error is because the docstring is actually created against NDFrame. AFACIT other items would add that docstring to |
@h-vetinari I know you keep updating this, but I am -1 on this for all of the reasons I have stated above. This completely changes the semantics of DataFrame.set_index by allowing thru ambiguous cases of arrays. Furthermore the semantics of Series are just plain weird, which only accept an array. This is going to lead to lots of confusion down the road. We have set_axis for exactly this purpose, now you are conflating them. So unless you remove this, and I don't now if this is possible for Series, then this is a no-go. |
I know you're very busy, but I get the feeling you don't read what I've repeated time and time again in this thread: this PR changes absolutely nothing about the behaviour of You have also not commented on #24046, which I opened specifically to discuss your opposition to the actual / current / existing capabilities of
The only difference between
I have responded to this in detail above and in #24046. I'm not conflating them.
I don't know what you're referring to by "this", can you please elaborate? There are two core devs who have approved this PR (not counting @WillAyd's change request since it was only about docstring validation), and you're objecting on grounds that do not reflect the actual state of affairs - so IMO you're quite off-base in claiming this PR is so outlandish. |
I have never said I have a problem with what I will repeat once again. The fact that DataFrame.set_index() happens to accept in a particular case an array proves the point here, it IS ambiguous and has been tried several times before. It is NOT correct. I have pointed you to |
I strongly object to how you're handling this - what does closing this PR achieve? How do you so easily overrule approving reviews (cc @gfyoung @TomAugspurger) and an ongoing discussion?
You are simply wrong about that - it does easily and readily accept arrays of values (and this is a fundamental aspect of its purpose of |
@h-vetinari you are not listening. If you want to raise an issue and comment there, feel free. |
Quoting myself from this comment:
Tagging the participants of #24046 and #24697: @jreback @gfyoung @WillAyd @jorisvandenbossche @TomAugspurger @toobaz @jbrockmendel |
@jreback @gfyoung @WillAyd @jorisvandenbossche @TomAugspurger |
git diff upstream/master -u -- "*.py" | flake8 --diff
Following #21684 (comment), I unified the method to
core/generic.py
. If this is approved in principle, I'll write some tests (both for DF/Series) and the whatsnew. On that note, it seems to me thatdf.set_index
- particularly its kwargs - are basically untested. The only tests I found are:pandas/tests/frame/test_indexing.TestDataFrameIndexingDatetimeWithTZ.test_set_reset
pandas/tests/frame/test_indexing.TestDataFrameIndexingUInt64.test_set_reset
both of which don't test any of the kwargs.