ENH: Add set_index to Series #22225

h-vetinari · 2018-08-06T22:07:25Z

closes ENH: set_index for Series #21684
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
examples for series.set_index
whatsnew entry

Following #21684 (comment), I unified the method to core/generic.py. If this is approved in principle, I'll write some tests (both for DF/Series) and the whatsnew. On that note, it seems to me that df.set_index - particularly its kwargs - are basically untested. The only tests I found are:

pandas/tests/frame/test_indexing.TestDataFrameIndexingDatetimeWithTZ.test_set_reset
pandas/tests/frame/test_indexing.TestDataFrameIndexingUInt64.test_set_reset

both of which don't test any of the kwargs.

jreback · 2018-08-06T23:15:32Z

look in alter_axes

there are a ton of tests

codecov · 2018-08-07T00:53:33Z

Codecov Report

Merging #22225 into master will decrease coverage by <.01%.
The diff coverage is 97.14%.

@@            Coverage Diff             @@
##           master   #22225      +/-   ##
==========================================
- Coverage   92.38%   92.37%   -0.01%     
==========================================
  Files         166      166              
  Lines       52395    52398       +3     
==========================================
+ Hits        48403    48404       +1     
- Misses       3992     3994       +2

Flag	Coverage Δ
#multiple	`90.8% <97.14%> (-0.01%)`	⬇️
#single	`43.03% <60%> (+0.02%)`	⬆️

Impacted Files	Coverage Δ
pandas/core/frame.py	`96.85% <100%> (-0.08%)`	⬇️
pandas/core/generic.py	`96.69% <100%> (+0.07%)`	⬆️
pandas/core/series.py	`93.73% <90%> (+0.17%)`	⬆️
pandas/core/internals/construction.py	`95.93% <0%> (-0.75%)`	⬇️
pandas/core/arrays/datetimes.py	`97.68% <0%> (-0.34%)`	⬇️
pandas/core/arrays/timedeltas.py	`88.09% <0%> (-0.1%)`	⬇️
pandas/util/testing.py	`88% <0%> (-0.1%)`	⬇️
pandas/io/formats/html.py	`99.34% <0%> (-0.03%)`	⬇️
pandas/core/indexes/datetimelike.py	`98.52% <0%> (-0.01%)`	⬇️
pandas/core/arrays/period.py	`98.52% <0%> (-0.01%)`	⬇️
... and 17 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cb31b2b...7f35d2c. Read the comment docs.

h-vetinari · 2018-08-07T22:46:53Z

@jreback I cleaned up the tests for df.set_index before I moved to the ones for series.set_index. Since I'm guessing that this would be preferred as separate PRs, here's just the cleanup for df, without any changes to functionality (except some better warnings): #22236

pep8speaks · 2018-09-18T08:41:57Z

Hello @h-vetinari! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on January 06, 2019 at 14:40 Hours UTC

h-vetinari · 2018-09-18T14:03:32Z

Rebased this on #22236 #22526 and split of the changes from #22486 into a separate commit.

If you want to review this without #22486, just follow this link: https://github.com/pandas-dev/pandas/pull/22225/commits/3edbaeae7731885bf418c5ac50d0a9bcf82c0ed6

h-vetinari · 2018-09-19T06:45:12Z

@gfyoung @jreback @WillAyd @TomAugspurger
Could someone please retrigger the failed travis job? Crazy waiting times on Circle/Appveyor currently.

h-vetinari · 2018-10-19T16:28:27Z

Now that all the preps (#22236 #22526 #22486) are out of the way, can finally tackle this seemingly simple PR - adding set_index to Series. I moved the code from frame.py to generic.py, as mentioned in the issue by @gfyoung.

It's also green (Circle is behind but already passed before the last commit for linting)

h-vetinari · 2018-10-19T16:30:24Z

pandas/core/generic.py

+        3  2013  7      84
+        4  2014  10     31
+        """
+        # parameter keys is checked in Series.set_index / DataFrame.set_index!


I'm not duplicating the - different - checks from Series.set_index and DataFrame.set_index here - is there any chance that someone calls NDFrame.set_index?

h-vetinari · 2018-10-28T20:35:09Z

@jreback @gfyoung
ping :)

gfyoung

Nice!

cc @jreback

jreback

looks generally ok, need to fix the duplicated doc-strings

jreback · 2018-10-30T12:44:58Z

pandas/core/frame.py

@@ -3980,12 +3991,7 @@ def set_index(self, keys, drop=True, append=False, inplace=False,
        2  2014  4      40
        3  2013  7      84
        4  2014  10     31
-


is the doc-test run for this doc-string?

jreback · 2018-10-30T12:45:44Z

pandas/core/frame.py

-
-        if not inplace:
-            return frame
+        vi = verify_integrity


don't use abbreviations

this was just for the line length, but ok, re-broke the line differently

pandas/core/generic.py

jreback · 2018-10-30T12:48:06Z

pandas/core/generic.py

+        3  2013  7      84
+        4  2014  10     31
+        """
+        # parameter keys is checked in Series.set_index / DataFrame.set_index!


jreback · 2018-10-30T12:48:38Z

pandas/core/series.py

+    def set_index(self, arrays, append=False, inplace=False,
+                  verify_integrity=False):
+        """
+        Set the Series index (row labels) using one or more columns.


same comment about the doc-string

jreback · 2018-10-30T12:48:57Z

pandas/core/series.py

@@ -1075,6 +1075,87 @@ def _set_value(self, label, value, takeable=False):
        return self
    _set_value.__doc__ = set_value.__doc__

+    def set_index(self, arrays, append=False, inplace=False,


don't change the argument name, it is called keys currently, leave it that way

The argument currently does not exist - it makes sense for DataFrame to have it named keys, since the main application (arguably) is setting the index to a given column key, but it would be very confusing for Series.

The signatures differ already anyway (no drop for Series), so I'm strongly -1 on naming this parameter keys.

that doesn't make any sense, you are conflating 2 different things. change then name to match.

Function signatures of subclasses should match the parent. We should also accept drop in the Series version, even if it's not used..

That, to me, would be an argument to not share the implementation in generic.py. A parameter named keys makes zero sense for series (and neither does drop for that matter - or rather it would be very confusing).

h-vetinari · 2019-01-04T17:00:42Z

@TomAugspurger @jorisvandenbossche
This would be ready (modulo a new docstring warning). Any comment before the cut-off?

TomAugspurger · 2019-01-04T17:10:37Z

Apologies for the delay. I think I'm +1 to the idea of this. Taking a look at the diff now.

TomAugspurger · 2019-01-04T17:19:30Z

pandas/core/series.py

+    def set_index(self, arrays, append=False, inplace=False,
+                  verify_integrity=False):
+
+        if not isinstance(arrays, list) or all(is_scalar(x) for x in arrays):


What case does the first condition handle here? I'm having trouble finding a case where Series.set_index([foo]) but Series.set_index(foo) doesn't (or wouldn't without the []).

@TomAugspurger
The first condition makes sure the second one can execute (i.e. iteration). If it is not true and we end up wrapping it in a list, it will fail in the next step.

The point that is crucial though is that only lists are affected by this (and not all list-likes), because NDFrame.set_index would try to interpret a list of scalars as a list of keys.

In any case, I added some comments to clarify this behaviour.

TomAugspurger · 2019-01-04T17:23:29Z

I can't tell if the doc build error from https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=6464 is real, or whether it's a bug in the linter. I can take a look later if you want.

h-vetinari · 2019-01-04T22:47:09Z

@TomAugspurger
The docstring validation errors were real - some problems with the templating + dedent, plus some plain old oversights that were not yet being tested before. Should hopefully be passing now (locally it is).

h-vetinari

Added an explanatory comment.

h-vetinari · 2019-01-04T23:14:20Z

pandas/core/series.py

+    def set_index(self, arrays, append=False, inplace=False,
+                  verify_integrity=False):
+
+        if not isinstance(arrays, list) or all(is_scalar(x) for x in arrays):


@TomAugspurger
The first condition makes sure the second one can execute (i.e. iteration). If it is not true and we end up wrapping it in a list, it will fail in the next step.

The point that is crucial though is that only lists are affected by this (and not all list-likes), because NDFrame.set_index would try to interpret a list of scalars as a list of keys.

In any case, I added some comments to clarify this behaviour.

h-vetinari · 2019-01-04T23:44:26Z

@datapythonista
There's an error in the docstring validation here that I don't understand, even though Series.set_index and DataFrame.set_index pass validation for me locally (the two docstrings affected by this PR).

My only guess would be that NDFrame.set_index is being validated as well, and crashing on the templates. How would you recommend to solve/circumvent that?

As a side note, it would be awesome to see in the log exactly which docstring was causing an error...

WillAyd · 2019-01-04T23:50:08Z

pandas/core/generic.py

+
+        See Also
+        --------
+        %(other_klass)s.set_index: Method adapted for %(other_klass)s.


I think your validation error is that there is no space before the colon on this line

@WillAyd
If that would cause the error then the [Series|DataFrame].set_index docstrings should fail too, shouldn't they? Locally they pass, but let's hope you're right. :)

Hmm yea. So it does fail locally on NDFrame.set_index with the same error as shown in Azure logs

h-vetinari

Let's see if this works.

h-vetinari · 2019-01-05T00:00:08Z

pandas/core/generic.py

+
+        See Also
+        --------
+        %(other_klass)s.set_index: Method adapted for %(other_klass)s.


@WillAyd
If that would cause the error then the [Series|DataFrame].set_index docstrings should fail too, shouldn't they? Locally they pass, but let's hope you're right. :)

datapythonista · 2019-01-05T00:56:16Z

@h-vetinari the error here is from numpydoc, not from our script. We use it to parse the parts of the docstring, and it's failing to parse it here. I guess there are different numpydoc versions in the CI and in your localhost, if it doesn't fail locally.

You can create an issue to capture numpydoc parser errors, and reraise the exception adding the failing docstring, that would be great. It's the first time I see it failing, that's why it's not implemented.

WillAyd · 2019-01-05T01:00:17Z

I think the error is because the docstring is actually created against NDFrame. AFACIT other items would add that docstring to _shared_docs which would then be referenced in the frame and series modules. You might just need to refactor that to get this to pass

jreback · 2019-01-06T16:30:25Z

@h-vetinari I know you keep updating this, but I am -1 on this for all of the reasons I have stated above. This completely changes the semantics of DataFrame.set_index by allowing thru ambiguous cases of arrays. Furthermore the semantics of Series are just plain weird, which only accept an array. This is going to lead to lots of confusion down the road. We have set_axis for exactly this purpose, now you are conflating them.

So unless you remove this, and I don't now if this is possible for Series, then this is a no-go.

h-vetinari · 2019-01-06T17:02:43Z

@jreback

This completely changes the semantics of DataFrame.set_index by allowing thru ambiguous cases of arrays.

I know you're very busy, but I get the feeling you don't read what I've repeated time and time again in this thread: this PR changes absolutely nothing about the behaviour of DataFrame.set_index - as evidenced by the lack of changes in tests/frame/test_alter_axes.py.

You have also not commented on #24046, which I opened specifically to discuss your opposition to the actual / current / existing capabilities of DataFrame.set_index.

Furthermore the semantics of Series are just plain weird, which only accept an array

The only difference between Series.set_index and DataFrame.set_index in this PR is that the former does not allow column keys - which simply do not make sense for Series.

We have set_axis for exactly this purpose, now you are conflating them.

I have responded to this in detail above and in #24046. I'm not conflating them.

So unless you remove this, and I don't now if this is possible for Series, then this is a no-go.

I don't know what you're referring to by "this", can you please elaborate?

There are two core devs who have approved this PR (not counting @WillAyd's change request since it was only about docstring validation), and you're objecting on grounds that do not reflect the actual state of affairs - so IMO you're quite off-base in claiming this PR is so outlandish.

jreback · 2019-01-06T17:20:10Z

I know you're very busy, but I get the feeling you don't read what I've repeated time and time again in this thread: this PR changes absolutely nothing about the behaviour of DataFrame.set_index - as evidenced by the lack of changes in tests/frame/test_alter_axes.py.

I have never said I have a problem with what DataFrame.set_index does.

I will repeat once again. Series.set_index() has completely different semantics than DataFrame.set_index() which accepts keys (which are column names / levels). NOT an array of values. I am not sure are getting this. There for Series.set_index() is imply not possible the way you have done it here.

The fact that DataFrame.set_index() happens to accept in a particular case an array proves the point here, it IS ambiguous and has been tried several times before. It is NOT correct.

I have pointed you to .set_axis() which does exactly what you are proposing for Series.set_index(). The problem is that the name .set_index() takes keys and that is violated for Series, as are there is no such thing as keys.

h-vetinari · 2019-01-06T17:33:10Z

I strongly object to how you're handling this - what does closing this PR achieve? How do you so easily overrule approving reviews (cc @gfyoung @TomAugspurger) and an ongoing discussion?

@jreback This completely changes the semantics of DataFrame.set_index ...

@h-vetinari this PR changes absolutely nothing about the behaviour of DataFrame.set_index

@jreback I have never said I have a problem with what DataFrame.set_index does.
[...] Series.set_index() has completely different semantics than DataFrame.set_index() which accepts keys (which are column names / levels). NOT an array of values.

You are simply wrong about that - it does easily and readily accept arrays of values (and this is a fundamental aspect of its purpose of setting the index).

jreback · 2019-01-06T17:41:51Z

@h-vetinari you are not listening. If you want to raise an issue and comment there, feel free.

h-vetinari · 2019-05-31T13:13:37Z

Quoting myself from this comment:

@h-vetinari: After not getting a response in #24697, I'd like to ask again how to proceed here. Deprecating lists within lists would have IMO been the most elegant solution, and was the only complaint against adding Series.set_index. Now that the decision has been taken to live with this ambiguity, can we please reopen #22225?

Tagging the participants of #24046 and #24697: @jreback @gfyoung @WillAyd @jorisvandenbossche @TomAugspurger @toobaz @jbrockmendel

h-vetinari · 2019-06-21T10:41:00Z

@jreback @gfyoung @WillAyd @jorisvandenbossche @TomAugspurger
Please reopen this PR.

gfyoung added Enhancement Indexing Related to indexing on series/frames, not to indexes themselves labels Aug 7, 2018

h-vetinari mentioned this pull request Aug 7, 2018

TST/CLN: break up & parametrize tests for df.set_index #22236

Merged

This was referenced Aug 28, 2018

ENH: add return_inverse to duplicated for DataFrame/Series/Index/MultiIndex #21645

Closed

TST: fixturize series/test_alter_axes.py #22526

Merged

h-vetinari force-pushed the series_set_index branch from 6ba8ca9 to 46eee7b Compare September 18, 2018 08:41

h-vetinari force-pushed the series_set_index branch 4 times, most recently from 7355d16 to aadf50b Compare September 18, 2018 09:00

h-vetinari changed the title ~~ENH: Add set_index to Series (WIP)~~ ENH: Add set_index to Series Sep 18, 2018

h-vetinari force-pushed the series_set_index branch 6 times, most recently from eeffe68 to 3edbaea Compare September 18, 2018 11:04

h-vetinari force-pushed the series_set_index branch 2 times, most recently from 8ab863b to 949e699 Compare October 19, 2018 14:44

h-vetinari commented Oct 19, 2018

View reviewed changes

gfyoung approved these changes Oct 28, 2018

View reviewed changes

jreback requested changes Oct 30, 2018

View reviewed changes

TomAugspurger approved these changes Jan 4, 2019

View reviewed changes

h-vetinari added 2 commits January 4, 2019 22:36

Merge remote-tracking branch 'upstream/master' into series_set_index

2428e09

Fix docstrings

31b5990

Review (TomAugspurger)

aaa8644

h-vetinari commented Jan 4, 2019

View reviewed changes

WillAyd requested changes Jan 4, 2019

View reviewed changes

Review (WillAyd)

81cfd82

h-vetinari commented Jan 5, 2019

View reviewed changes

Refactor docstring into _shared_docs

7f35d2c

jreback closed this Jan 6, 2019

pandas-dev locked as resolved and limited conversation to collaborators Jan 6, 2019

pandas-dev unlocked this conversation Jan 7, 2019

h-vetinari mentioned this pull request Jan 16, 2019

DOC: update DF.set_index #24762

Merged

h-vetinari mentioned this pull request Feb 1, 2019

API/ERR: allow iterators in df.set_index & improve errors #24984

Merged

3 tasks

h-vetinari mentioned this pull request Mar 10, 2019

DEPR/API: disallow lists within list for set_index #24697

Closed

5 tasks

ghost mentioned this pull request Jul 31, 2019

ENH: Add Series.set_index #27504

Closed

4 tasks

ENH: Add set_index to Series #22225

ENH: Add set_index to Series #22225

Conversation

h-vetinari commented Aug 6, 2018 • edited Loading

jreback commented Aug 6, 2018

codecov bot commented Aug 7, 2018 • edited Loading

Codecov Report

h-vetinari commented Aug 7, 2018

pep8speaks commented Sep 18, 2018 • edited Loading

Comment last updated on January 06, 2019 at 14:40 Hours UTC

h-vetinari commented Sep 18, 2018

h-vetinari commented Sep 19, 2018

h-vetinari commented Oct 19, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h-vetinari commented Oct 28, 2018

gfyoung left a comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h-vetinari Oct 30, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h-vetinari commented Jan 4, 2019 • edited Loading

TomAugspurger commented Jan 4, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Jan 4, 2019

h-vetinari commented Jan 4, 2019

h-vetinari left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h-vetinari commented Jan 4, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h-vetinari left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

datapythonista commented Jan 5, 2019

WillAyd commented Jan 5, 2019

jreback commented Jan 6, 2019

h-vetinari commented Jan 6, 2019

jreback commented Jan 6, 2019

h-vetinari commented Jan 6, 2019

jreback commented Jan 6, 2019 • edited Loading

h-vetinari commented May 31, 2019

h-vetinari commented Jun 21, 2019

h-vetinari commented Aug 6, 2018 •

edited

Loading

codecov bot commented Aug 7, 2018 •

edited

Loading

pep8speaks commented Sep 18, 2018 •

edited

Loading

h-vetinari commented Oct 19, 2018 •

edited

Loading

h-vetinari Oct 30, 2018 •

edited

Loading

h-vetinari commented Jan 4, 2019 •

edited

Loading

jreback commented Jan 6, 2019 •

edited

Loading