Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Support operators for ExtensionArray #20889

Closed
wants to merge 3 commits into from

Conversation

Dr-Irv
Copy link
Contributor

@Dr-Irv Dr-Irv commented Apr 30, 2018

Some Notes

  • @jorisvandenbossche suggested that I put up an initial implementation for review, so I expect there will be lots of suggestions for doing it more, better, or different.
  • There is a hook in there that is untested, where if an implementer (like me) wants to have the logical operators not return a Series with bool, they can do that. I'm testing it with my library, but once we agree on this basic implementation, I will add tests for that specific use case.
  • The idea is that you first see if the ExtensionArray class has implemented the operator. If not, then you just go element by element.

@codecov
Copy link

codecov bot commented May 1, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@93e7123). Click here to learn what that means.
The diff coverage is 93.81%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #20889   +/-   ##
=========================================
  Coverage          ?   91.78%           
=========================================
  Files             ?      153           
  Lines             ?    49423           
  Branches          ?        0           
=========================================
  Hits              ?    45362           
  Misses            ?     4061           
  Partials          ?        0
Flag Coverage Δ
#multiple 90.17% <93.81%> (?)
#single 41.89% <10.3%> (?)
Impacted Files Coverage Δ
pandas/util/testing.py 84.43% <100%> (ø)
pandas/core/arrays/base.py 84.7% <100%> (ø)
pandas/core/indexes/base.py 96.59% <91.66%> (ø)
pandas/core/series.py 93.87% <91.66%> (ø)
pandas/core/ops.py 96.19% <94.11%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 93e7123...8aea5cf. Read the comment docs.

@jorisvandenbossche jorisvandenbossche added the ExtensionArray Extending pandas with custom dtypes or arrays. label May 2, 2018
@jreback
Copy link
Contributor

jreback commented May 5, 2018

just an FYI, rather than a single PR, this needs to be a series of smaller, self-consistent PR's. much easier to review and revise, otherwise this will get bogged down.

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented May 5, 2018

@jreback Right now, I had to include some code from an early version of #20885. Once that is approved, that will shorten this one a little bit. I can then first submit something that deals with #20825 . And once that is approved, I can submit the things related to ops.

Let me know if this plan works for you.

@jbrockmendel
Copy link
Member

just an FYI, rather than a single PR, this needs to be a series of smaller, self-consistent PR's. much easier to review and revise, otherwise this will get bogged down.

@Dr-Irv There are a bunch of planned/discussed steps between the status quo and the functionality in this PR. If you're up to help with the whole transition that'd be great. If you're mostly interested in getting this specific feature in place ASAP, we can talk about workarounds that you can apply in the interim.

@jreback LMK if I'm overstepping by offering to shepherd this process. It's up my alley.

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented May 21, 2018

@jbrockmendel I'd like both. I need the operator functionality for ExtensionArray . So if you guys don't want to accept this PR, then I'd like to see the workaround so I can get my use case working, and since I have put a lot of thought into getting this PR to work, I can also help with the transition.

@TomAugspurger
Copy link
Contributor

@jreback LMK if I'm overstepping by offering to shepherd this process. It's up my alley.

I'd appreciate your expertise here @jbrockmendel.

FWIW, I'll have some availability this week, but very little next week.

@jbrockmendel
Copy link
Member

then I'd like to see the workaround so I can get my use case working

The workaround I have in mind assumes you are mostly interested in patching Series arithmetic and logical operators. Is that accurate?

and since I have put a lot of thought into getting this PR to work, I can also help with the transition

@Dr-Irv sounds good. The first thing I'd suggest is figuring out if there is anything in this PR that is not strictly necessary for the feature(s) you need or can otherwise be separated into independent PRs. Based on a very quick glance it looks like the tests for get or combine might be independent of the rest of the PR.

Are the changes to boolean ops logically independent of the changes to arithmetic ops? If so, I suggest focusing on the boolean ops first, since they aren't nearly so wound up with Index vs Series shenanigans. There might be something in #19795 worth resurrecting.

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented May 21, 2018

@jbrockmendel

The workaround I have in mind assumes you are mostly interested in patching Series arithmetic and logical operators. Is that accurate?

Yes, that is correct. I also need the results of arithmetic and logical operators to be Series of objects.

This PR has 3 things in it.

  1. Code from BUG: Fix Series.get() for ExtensionArray and Categorical #20885 that was still in review mode when I submitted the present PR. The tests for get that you mention fall into that category. I can remove that to make this PR more up-to-date.
  2. Fixes to support combine correctly. This addresses BUG: Series.combine() fails with ExtensionArray inside of Series #20825 . I can break that out into a separate PR. But that is needed for testing the operator stuff.
  3. The code to support the arithmetic and logical operators for ExtensionArray. It is the same code that does both.

So, if you want, I can create a new PR that only works on combine. Then this one we are discussing here will only have the operator support, the meat of which is in the new internal method dispatch_to_extension_ops.

@jreback
Copy link
Contributor

jreback commented May 21, 2018

FYI I have this almost all implemented for integer na support: jreback@6fc19f9 (this is a bit stale), will be updated soon.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented May 22, 2018

I think it is still worthwhile to push forward this PR to implement this separately from the integer-array PR (which might take more discussion, and it also a very big one right now).

@jreback can you give more concrete feedback on the way @Dr-Irv implemented things here, based on what you did in the other PR? (for the changes in ops.py)
(once the unrelated changes are removed (for combine and get), I don't think this PR needs to be splitted)

There are a bunch of planned/discussed steps between the status quo and the functionality in this PR.

@jbrockmendel can you elaborate on this a bit? (regarding this PR, what would you do instead)

However, if the underlying ExtensionDtype overrides the logical
operators, then the implementer may want to have an ExtensionArray
subclass contain the result. This can be done by changing the property
_logical_result from its default value of None to the _from_sequence
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it needed to have this property? Can't we simply detect whether the result is a boolean numpy array or again an ExtensionArray ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this use-case a bit more? I think we will certainly want Series <compare> Series to always be an ndarray of booleans.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't speak for the author, but my assumption was that this has to do with some of the spaghetti-code in ops._bool_method_SERIES, where sometimes a bool-dtype is returned and other times an int-dtype is returned (and datetimelike are currently all broken, see #19972, #19759). Straightening out this mess independently of EA implementations is part of the plan referred to above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, in my use case, I need the boolean operators to return an object that represents the relation. I'm using pandas on top of 2 different libraries (that functionally are the same) where the operators (x <= y), (x >= y) and (x == y) are not booleans, but objects representing the relations.

# ------------------------------------------------------------------------
# Utilities for use by subclasses
# ------------------------------------------------------------------------
def is_sequence_of_dtype(self, seq):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what is this needed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, this is expected to always be true. If it isn't I'd recommend making a superclass that has all the scalar types like I do in https://github.com/ContinuumIO/cyberpandas/blob/c66bbecaf5193bd284a0fddfde65395d119aad41/cyberpandas/ip_array.py#L22

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's suppose that you don't implement the ExtensionArray operators as methods of the subclass of ExtensionArray, but you let the underlying ExtensionDtype handle the operators for you. (This is what I used for Decimal). Some of the operators will return a sequence containing all objects of ExtensionDtype. Some operators (e.g., logical ones), will not. So internally, it's useful to have a test to know whether a sequence has objects of the corresponding ExtensionDtype so that you can then return an ExtensionArray as a result, otherwise, you just let things get coerced based on the type in the sequence.

@@ -990,6 +991,93 @@ def _construct_divmod_result(left, result, index, name, dtype):
)


def dispatch_to_extension_op(left, right, op_name=None, is_logical=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the dispatch_to_index_op uses op instead of op_name. Is there a reason for this difference? (and I mean in the actual implementation here it assumes a method name, and not an operator function that can be called)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for the difference is as follows. The way I implemented this, we first look to see if the operator is defined for the ExtensionArray subclass. If not, then we use the implementation of the operator on the underlying ExtensionDtype. So if you pass op, you get the operator bound to a specific class. If you have op_name, then we can translate to either the ExtensionArray subclass implementation or the ExtensionDtype implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the op is not defined on ExtensionArray, calling it directly (op(left_values, right_values)) will raise a TypeError that you can catch (which you already do), so I don't really see the difference

except TypeError:
pass
except Exception as e:
raise e
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not doing op(left.values, right/right.values)? What does this manual checking/trying does that the former does not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above. In the code above, I'm testing to see if the ExtensionArray subclass has the operator defined. op and method are the same, as method is computed as the operator on left.values. I could change the name of the variable from method to op to make this clearer.

However, if the underlying ExtensionDtype overrides the logical
operators, then the implementer may want to have an ExtensionArray
subclass contain the result. This can be done by changing the property
_logical_result from its default value of None to the _from_sequence
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this use-case a bit more? I think we will certainly want Series <compare> Series to always be an ndarray of booleans.

# ------------------------------------------------------------------------
# Utilities for use by subclasses
# ------------------------------------------------------------------------
def is_sequence_of_dtype(self, seq):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, this is expected to always be true. If it isn't I'd recommend making a superclass that has all the scalar types like I do in https://github.com/ContinuumIO/cyberpandas/blob/c66bbecaf5193bd284a0fddfde65395d119aad41/cyberpandas/ip_array.py#L22

deflen = len(left)
excons = type(left.values)._from_sequence
exclass = type(left.values)
testseq = left.values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having trouble understanding these names. (method makes sense).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excons is the constructor used to construct a result from a sequence that has all its elements of type ExtensionDtype. (so "ex" for "Extension" and "cons" for "constructor") exclass is the underlying class of the ExtensionArray subclass.

ovalues = [parm] * deflen
return ovalues

if res is NotImplemented:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain this fallback a bit more?

If the EA doesn't define ops, then I'm perfectly fine with raising NotImplementedError at the end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above. The idea here is that either the EA defines the ops, or the ExtensionDtype defines the ops.

The idea here is that if you know that your underlying ExtensionDtype already has the ops defined, you don't have to implement each of the ops at the ExtensionArray level.

I used DecimalArray as an example. The operators are already defined for Decimal so there is no reason to have to implement the operators for an array of Decimal.

@jbrockmendel
Copy link
Member

There are a bunch of planned/discussed steps between the status quo and the functionality in this PR.

can you elaborate on this a bit? (regarding this PR, what would you do instead)

@jorisvandenbossche

@jorisvandenbossche
Copy link
Member

@jbrockmendel but is it needed to fix all of this (make sure all of Series ops are dispatched correctly etc, resolve duplication between index and series, ...) before getting basic ops working for ExtensionArrays?
It's true there is a lot of work to do the full internal transition, but that doesn't mean we can already support basic ops for external extension arrays.

@jreback
Copy link
Contributor

jreback commented May 23, 2018

echo the above comments, this has a lot of othrogonal changes going on. let's strip out things to the minimal changes.

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented May 23, 2018

In response to @jbrockmendel:

The place to define the element-by-element fallback (essentially object-dtype behavior) is in the downstream EA subclass.

I really disagree with this. If you already have the operators defined for the ExtensionDtype, why should you have to implement the operators for ExtensionArray? My implementation gives the person implementing an ExtensionArray subclass the option of defining the operators at the EA level, or letting the operators defined (if they are defined) for the ExtensionDtype do all of the work.

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented May 23, 2018

@jreback Since you wanted something smaller, I have split off the code that fixes #20825 into #21183 . This is to handle the issue with combine. I suggest that people reviewing this PR look at #21183 first. Once it is accepted, I will then update this PR to only have the ops functionality.

Note that the combine bug has to be fixed in order for the tests in this PR to work right. So hopefully you will accept #21183, and then this PR for ops will end up being smaller.

@jorisvandenbossche
Copy link
Member

Why do the tests need the combine fix?

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented May 23, 2018

@jorisvandenbossche To test the operator functionality, I am leveraging the code that tests operators for Series that is in pandas/tests/series/test_operators.py:TestSeriesOperators, and those tests use combine to verify the results. By fixing the bug in combine, then the tests for operators are leveraging existing test code.

@jbrockmendel
Copy link
Member

I really disagree with this. If you already have the operators defined for the ExtensionDtype, why should you have to implement the operators for ExtensionArray? My implementation gives the person implementing an ExtensionArray subclass the option of defining the operators at the EA level, or letting the operators defined (if they are defined) for the ExtensionDtype do all of the work.

Implementing it on the DType is fine, I guess. It should be really easy to go from there to implementing it on the EA subclass. Once that is done, the changes to core.ops (and I hope/think Series._binop) can pretty much all be skipped and just call dispatch_to_index_op with the appropriate EA subclass (this is already done with pd.Categorical on L1193)

In most cases returning NotImplemented means "try the reversed operation". I'd like to avoid giving that return value a double meaning if possible.

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented May 24, 2018

@jbrockmendel You wrote:

Implementing it on the DType is fine, I guess. It should be really easy to go from there to implementing it on the EA subclass.

Are you suggesting that for someone implementing their own EA subclass, it should be really easy for them to implement the operators on the EA subclass? Or are you suggesting that pandas provide default operators for ExtensionArray that are defined based on the operators on the ExtensionDtype? If the former, I don't see how we can do that. If the latter, the implementation in this PR does that.

The changes in this PR that are in Series._binop are now moved to #21183 in order to make combine work.

I can investigate whether dispatch_to_index_ops can do what dispatch_to_extension_ops does, but I'm doubtful, because my goal was to allow supporting the operators whether they were defined by the user at the EA subclass level or at the ExtensionDtype subclass level.

@jorisvandenbossche
Copy link
Member

To try to take a step back from the discussion:

  • We want that it is easy for ExtensionArray ops to just fall back to the scalar ops (the ExtensionDtype.type objects might already define ops, so use that).
    • For this reason, @Dr-Irv implemented this fall back inside ops.py in case the ExtensionArray has no method for the specific op.
  • On the other hand, we can also argue that it might be easier that the ops are always defined on the ExtensionArray. That would make the implementation inside ops.py a lot easier, as the fallback to the ExtensionDtype.type scalars could be removed.
  • I would like to add that for my use case, I want that it is possible for the ExtensionArray to indicate that no such fallback should be tried (as this can be expensive). Defining the op and letting it raise might be enough for that (however, currently TypeError is catched and then still the fallback is tried. I think we should not do that).

Given the above, I think we can try to harmonize the goal of making this fallback both easy as explicit (1 and 3), and keep the ops defined on the ExtensionArray and ops.py simple (2).
What if we provide a mixin that defines all those basic fallbacks? In that case, if the extension author decides it would like to enable the fallback, it can simply inherit from the mixin. Also, if then the behaviour of certain methods (like the booleans) are not exactly what he/she wants, just overwrite that specific method again in your array class.

@TomAugspurger
Copy link
Contributor

Regarding the elementwise fallback to ExtensionDtype.type, isn't that relatively easy to define on the EA subclass?

# in MyEASubclass
def __lt__(self, other):
    return np.array([a < b for a, b in zip(self, other)])

likewise for the other methods.

I would like to add that for my use case, I want that it is possible for the ExtensionArray to indicate that no such fallback should be tried (as this can be expensive).

For cyberpandas, the scalar fall back would be incorrect. We use IPAddress(0) as the NA value, so comparisions with it should be NA to match pandas behavior.

In [3]: arr = IPArray([0, 1, 2, 3])

In [4]: arr <= arr
Out[4]: array([False,  True,  True,  True])

In [5]: arr.astype(object) <= arr.astype(object)
Out[5]: array([ True,  True,  True,  True])

A mixin seems like a reasonable compromise.

@jorisvandenbossche
Copy link
Member

Regarding the elementwise fallback to ExtensionDtype.type, isn't that relatively easy to define on the EA subclass?

It's indeed relatively straightforward, although a bit more complicated as you small example if you want to cover all cases (scalar vs array other etc). Therefore providing a default implementation available as mixin would make this easier.

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented May 24, 2018

@TomAugspurger You wrote:

Regarding the elementwise fallback to ExtensionDtype.type, isn't that relatively easy to define on the EA subclass?

Yes, but having to do it for all of the operators is painful. What I implemented avoids that. Take the example of DecimalArray. You'd like to have to avoid implementing every single operator supported by decimal.Decimal in DecimalArray.

I like the idea of using a mixin as proposed by @jorisvandenbossche. However, it seems that I ought to wait until @jreback finishes #21191 and #21160 and it is merged in, and then I can see where things are and make modifications as needed for my use case.

@jorisvandenbossche
Copy link
Member

Yes, but having to do it for all of the operators is painful.

It should be possible to write a function that generates all the different operators (which is how we often do it inside pandas). But we can use this approach for the mixin.

However, it seems that I ought to wait until @jreback finishes #21191 and #21160 and it is merged in, and then I can see where things are and make modifications as needed for my use case.

I don't think you need to wait with eg writing the mixin with fallback (as that use case will not be covered in #21191).

@jbrockmendel
Copy link
Member

@Dr-Irv

Are you suggesting that for someone implementing their own EA subclass, it should be really easy for them to implement the operators on the EA subclass?

Yes, this.

Or are you suggesting that pandas provide default operators for ExtensionArray that are defined based on the operators on the ExtensionDtype?

The element-by-element implementation in the PR is effectively the object-dtype behavior, and I'd be fine with that being a default at the BaseArray level (@TomAugspurger thoughts? It would also be reasonable for BaseArray to implement these ops to raise com.AbtractMethodError).

The changes in this PR that are in Series._binop are now moved to #21183 in order to make combine work.

I'll take a closer look at #21883, thanks.

BTW, thanks for pushing forward on this. You've clearly put some thought+effort into this and it is very much appreciated.

@jorisvandenbossche

but is it needed to fix all of this (make sure all of Series ops are dispatched correctly etc, resolve duplication between index and series, ...) before getting basic ops working for ExtensionArrays?

Not at all, the steps I mentioned were very much for the "what would you do instead" question. I do think that implementing ops at the array level will minimize duplicated effort in core.ops (a lot of effort went in to un-tangling older special-casing in core.ops, a process which I'm not eager to repeat).

What if we provide a mixin that defines all those basic fallbacks? In that case, if the extension author decides it would like to enable the fallback, it can simply inherit from the mixin. Also, if then the behaviour of certain methods (like the booleans) are not exactly what he/she wants, just overwrite that specific method again in your array class.

+1.

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented May 24, 2018

My plan right now is to look at the mixin approach on May 29 or later.

When I created this PR, the whole point was to inspire discussion on the design and implementation, so I think this discussion has been healthy in terms of us moving forward.

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented May 30, 2018

Closing this, as there is a new proposed implementation in #21261 .

@Dr-Irv Dr-Irv closed this May 30, 2018
@Dr-Irv Dr-Irv deleted the neweapos branch June 29, 2018 13:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ExtensionArray support for comparisons to fundamental types and lists of those types
5 participants