Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP [DOC/EA]: developer docs for implementing Series.round/sum/etc in EA #26918

Closed
wants to merge 31 commits into from
Closed

WIP [DOC/EA]: developer docs for implementing Series.round/sum/etc in EA #26918

wants to merge 31 commits into from

Conversation

ghost
Copy link

@ghost ghost commented Jun 18, 2019

This summarized my current understanding of how the dispatch of Series methods should work for EA (related #26913, #26730, and #26835) and It also provides helpful warnings against the several pitfalls I fell into while figuring things out. There are a lot more details involved then I foresaw.

The document is sloppy pseudo-sphinx for now (I haven't even compiled it yet), just to see if you agree with the content before spending the effort to polish it. I will, if/when the content represent a consensus instead of just my own take on the issue.

The description of how things work doesn't match the current state of the code right now, because there are a lot of little cleanups to be made in various functions:

  • Getting rid of com.values_for
  • Getting rid of _values and maybe even _data in some cases
  • Getting rid of redundant calls to np.asarray(self) or np.asarray(self._values) and such.

but I don't think there's actually any radical change implied. It's just about making methods consistent in how they delegate to numpy, so that EA developers can count on predictable behavior and so reason about how the pandas, numpy, and their own classes interact.

So I hope this will actually seem natural and undramatic, but I may have overlooked one or more vital issues.

One important realization I've come to is that the recent discussion regarding NEP18 was somewhat
of a red herring rabbit-hole. IMO, pandas should simply provide the explicit guarantee that some things go via _reduce (or a future __reduce_pandas__#26915), while the others go to np.<func>(array), and that the return value type should be an ExtensionArray when applicable. How the EA author deals with numpy's dispatch logic is nothing we need or should want to to speak out on. That's between the EA author and numpy.

Avoiding NEP18 altogether means that #26817, the round() PR, can be simplified to just a one line change to Series.round, and adding a new 4-line round method on DecimalArray as an example.
This would be the typical change needed to support (and provde examples) for other Series methods which are currently out of alignment. So going down the NEP18 (as suggested to me) seems like it was a misstep IMO, now we know.

Possible Concerns for future (not covered in doc):

  • How should we support EA for np operation which Series does not provide, like ceil (discussed in Series doesn't implement floor/ceil ops (+EA support needed) #26892). We may need to modify Series.apply with a special case for EA, so that s.apply(np.floor) works for EA. I'm not sure.
  • What if future operations are added to the Series namespace? If we ever add a new Series method, which is not a reduction (going via _reduce), and is not a convenience around some
    numpy function (so that the guarantee mentioned above doesn't hold), we will need to have a new
    mechanism in order to allow EA to support it. if this is a real concern, and there's no good answer, we might want to think about having a more general dispatching interface in place right now, instead of settling on the numpy dispatch guarantee mentioned, so as to be handle both existing and future operations.

As an imaginary example, if next year we add the Series.fnorb method, which fnorbs a numeric array using a new, cutting-edge, cython-optimized fnorbing algorithm, how do we then allow EA authors to provide an implementation for their own int ndarray backed EA?

cc @jorisvandenbossche, @TomAugspurger.

@codecov
Copy link

codecov bot commented Jun 18, 2019

Codecov Report

Merging #26918 into master will decrease coverage by 50.77%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #26918       +/-   ##
===========================================
- Coverage   91.87%   41.09%   -50.78%     
===========================================
  Files         180      180               
  Lines       50712    50712               
===========================================
- Hits        46590    20841    -25749     
- Misses       4122    29871    +25749
Flag Coverage Δ
#multiple ?
#single 41.09% <ø> (-0.09%) ⬇️
Impacted Files Coverage Δ
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/plotting/_matplotlib/__init__.py 0% <0%> (-100%) ⬇️
pandas/io/gcs.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/io/s3.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-99.37%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.16%) ⬇️
pandas/io/sas/sas_xport.py 0% <0%> (-90.1%) ⬇️
... and 132 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update baa77c3...92d4b9b. Read the comment docs.

@codecov
Copy link

codecov bot commented Jun 18, 2019

Codecov Report

Merging #26918 into master will decrease coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26918      +/-   ##
==========================================
- Coverage   91.87%   91.86%   -0.01%     
==========================================
  Files         180      180              
  Lines       50712    50746      +34     
==========================================
+ Hits        46590    46620      +30     
- Misses       4122     4126       +4
Flag Coverage Δ
#multiple 90.46% <ø> (ø) ⬆️
#single 41.09% <ø> (-0.09%) ⬇️
Impacted Files Coverage Δ
pandas/io/gbq.py 88.88% <0%> (-11.12%) ⬇️
pandas/core/frame.py 96.89% <0%> (-0.12%) ⬇️
pandas/core/strings.py 98.93% <0%> (ø) ⬆️
pandas/core/indexes/interval.py 96.44% <0%> (ø) ⬆️
pandas/core/indexes/base.py 96.71% <0%> (ø) ⬆️
pandas/core/internals/managers.py 95.21% <0%> (ø) ⬆️
pandas/compat/numpy/__init__.py 93.1% <0%> (ø) ⬆️
pandas/core/generic.py 94.2% <0%> (ø) ⬆️
pandas/core/series.py 93.66% <0%> (+0.01%) ⬆️
pandas/core/arrays/sparse.py 93.73% <0%> (+0.02%) ⬆️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update baa77c3...1f689c1. Read the comment docs.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jun 18, 2019

One important realization I've come to is that the recent discussion regarding NEP18 was somewhat
of a red herring rabbit-hole. IMO, pandas should simply provide the explicit guarantee that some things go via _reduce (or a future reduce_pandas#26915), while the others go to np.(array), ...How the EA author deals with numpy's dispatch logic is nothing we need or should want to to speak out on. That's between the EA author and numpy.

I agree that, from the pandas side, all we should expect is that the EA is handled by a numpy func and the EA can determine itself in which way it wants to be handled (for the non-ufuncs: only __arrray__, having a similar named method that is being dispatched to by numpy, or __array_function__).

But, that said, I think the exercise with __array_function__ is still very helpful and interesting, and we still might want to include a test case with that, as it seems, at this moment at least, the "right thing to do" given the direction numpy is taking (instead of the method dispatch)

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the write up. Didn't comment in detail yet, but I think it is certainly valuable to include this in some form (although it could probably be made a bit less verbose I think)

doc/source/development/extending.rst Outdated Show resolved Hide resolved
Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many typos, please review it yourself, and let us know when it's ready for someone else revision.

doc/source/development/extending.rst Outdated Show resolved Hide resolved
doc/source/development/extending.rst Outdated Show resolved Hide resolved
doc/source/development/extending.rst Outdated Show resolved Hide resolved
doc/source/development/extending.rst Outdated Show resolved Hide resolved
doc/source/development/extending.rst Outdated Show resolved Hide resolved
@datapythonista datapythonista added Docs ExtensionArray Extending pandas with custom dtypes or arrays. labels Jun 18, 2019
@ghost
Copy link
Author

ghost commented Jun 19, 2019

xref #26935, motivated by how messy this is.

@ghost
Copy link
Author

ghost commented Jun 19, 2019

Thanks for the review, @datapythonista. I think I fixed most of the remaining typos too. Let me know if you have more suggestions.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jun 19, 2019

The document is sloppy pseudo-sphinx for now (I haven't even compiled it yet), just to see if you agree with the content before spending the effort to polish it.

@pilkibun in this case, I find it perfectly reasonable to have such a draft PR to get early feedback.
But just a tip to make this more clear to potential reviewers: you can put WIP: or [WIP] in the beginning of the title.
And, github now also has a feature to make "draft" PRs (but this can only be done at the moment of opnening the PR, I think)

doc/source/development/extending.rst Outdated Show resolved Hide resolved
doc/source/development/extending.rst Outdated Show resolved Hide resolved
doc/source/development/extending.rst Outdated Show resolved Hide resolved
doc/source/development/extending.rst Outdated Show resolved Hide resolved
doc/source/development/extending.rst Outdated Show resolved Hide resolved
doc/source/development/extending.rst Outdated Show resolved Hide resolved
def round(self, decimals=0, **kwds):
pass

An alternative approach to implementing individual functions, is to override
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't recommend this.

doc/source/development/extending.rst Outdated Show resolved Hide resolved
doc/source/development/extending.rst Outdated Show resolved Hide resolved
doc/source/development/extending.rst Outdated Show resolved Hide resolved
@ghost ghost changed the title DOC/EA: developer docs for implementing Series.round/sum/etc in EA WIP [DOC/EA]: developer docs for implementing Series.round/sum/etc in EA Jun 19, 2019
@ghost
Copy link
Author

ghost commented Jun 19, 2019

Thanks for the detailed review @TomAugspurger, very helpful.

@ghost
Copy link
Author

ghost commented Jun 30, 2019

Pandas is moving in a more specific direction with this, and I don't have have a full picture of it. This already seems out of date. I'm closing, but I'll leave the branch in case someone wants to pick it up instead of starting from scratch.

@ghost ghost closed this Jun 30, 2019
@ghost ghost deleted the doc_implemnting_EA_ops branch July 31, 2019 16:20
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants