-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Support ExtensionArray operators via a mixin #21261
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some quick feedback
pandas/core/arrays/base.py
Outdated
ovalues = [parm] * len(self) | ||
return ovalues | ||
lvalues = convert_values(self) | ||
rvalues = convert_values(other) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably do alignment as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't need to, for a few reasons:
- The
_binop
method from here is a method onExtensionArray
(i.e.,isinstance(self, ExtensionArray) == True
- This method is called by
dispatch_to_extension_ops
inops.py
, which is called after alignment has already been done in_arith_method_SERIES
and alignment is ensured in_comp_method_SERIES
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, OK, this is also tested?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche Yes, this is tested because I am using the same tests that are used for general operators on Series, which tests things that are misaligned, etc.
See pandas/tests/extension/tests/decimal/test_decimal.py
that leverages the TestSeriesOperators.test_operators
method from pandas\tests\series\test_operators
.
As an aside, doing it this way uncovered the issues with .get
and .combine
that I submitted.
pandas/core/arrays/base.py
Outdated
return f(b) | ||
else: | ||
return NotImplemented | ||
res = [callfunc(a, b) for (a, b) in zip(lvalues, rvalues)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we would do op(a, b)
directly (so in practice doing a + b
instead of a.__add__(b)
), you would not need the checking for NotImplemented below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar as
pandas/pandas/core/indexes/category.py
Line 793 in c85ab08
def _make_compare(op): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check for NotImplemented
is needed because the underlying ExtensionDtype
class (which are the objects in lvalues
and rvalues
above) may not implement the specific operator.
Not sure what you are referring to when you wrote "similar as" the code in `pandas/core/indexes/category.py". That implementation is not doing things element by element, which is what I'm doing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you doing it this way, dynamically generating this? there is a limited set of ops, just write them out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check for NotImplemented is needed because the underlying ExtensionDtype class (which are the objects in lvalues and rvalues above) may not implement the specific operator.
Yes, but if you do op(a, b)
, then this will always raise a TypeError, and not return NotImplemented, and then you don't need to check manually whether any NotImplemented object was returned (and the generated error message will already be fine).
Not sure what you are referring to when you wrote "similar as" the code in `pandas/core/indexes/category.py". That implementation is not doing things element by element, which is what I'm doing here.
I simply referenced it because it does something very similar (generating the comparison methods), and does this by using the operator.
methods (which is what I mean with op(a, b)
above).
Whether it then does not do it element-wise does is a detail.
So what I mean is the following. You basically do
op_name = '__add__'
f = getattr(a, op_name, None)
f(a, b)
which translates to doing a.__add__(b)
and might return NotImplemented.
What I meant with my comment was doing instead:
f = operator.add
f = f(a, b)
which translates to a + b
and will directly raise the appropriate error.
And the code I referenced is an example how the the __add__
and operator.add
are linked in generating the methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you doing it this way, dynamically generating this? there is a limited set of ops, just write them out.
@jreback are you speaking about this create_method
function that you commented on? Or the fact below the create_method
function is used in a loop instead of manually repeated?
Because if it is the first: that is how we do it almost everywhere in the pandas codebase. If it is the second: yes that is true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is what I mean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback This is a bit of the "what comes first" discussion held at #20889 (comment) with @jorisvandenbossche . My goal with the mixin is that if you already have the ExtensionDtype
subclass (e.g. Decimal
) that has the operators defined, you just pull in the mixin, and you're all set with the operators.
In 21d76b3 you are creating the basis for a more general operator support. I could make the mixin use the more general framework you've proposed, but I guess that should be merged in before I do the mixin. (That's the "what comes first" issue)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche My latest commit does as you suggest. This removes the tests for NotImplemented
@jreback My latest commit uses a similar pattern to what you had in the referenced commit, just creating an operator one by one instead of using a loop
Codecov Report
@@ Coverage Diff @@
## master #21261 +/- ##
==========================================
+ Coverage 91.9% 91.9% +<.01%
==========================================
Files 154 154
Lines 49557 49625 +68
==========================================
+ Hits 45544 45608 +64
- Misses 4013 4017 +4
Continue to review full report at Codecov.
|
pandas/core/arrays/base.py
Outdated
return f(b) | ||
else: | ||
return NotImplemented | ||
res = [callfunc(a, b) for (a, b) in zip(lvalues, rvalues)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you doing it this way, dynamically generating this? there is a limited set of ops, just write them out.
@jreback The idea is to use one piece of code for all of the operators, and to use the implementation of the operator on the underlying Also, this is following the pattern used in |
pandas/core/arrays/base.py
Outdated
return f(b) | ||
else: | ||
return NotImplemented | ||
res = [callfunc(a, b) for (a, b) in zip(lvalues, rvalues)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Orthogonal to @jorisvandenbossche 's comment, instead of [callfunc(a, b) for (a, b) in zip(lvalues, rvalues)]
consider libops.vec_binop(lvalues, rvalues, callfunc)
. Should be equivalent but more performant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That doesn't seem to work properly?
In [64]: a = np.random.randn(100000).astype(object)
In [65]: b = np.random.randn(100000).astype(object)
In [66]: def func(a, b):
...: return a + b
...:
...:
In [67]: a + b
Out[67]:
array([-1.7373671804522872, -0.44608341233368065, -3.5963900176223706, ...,
-1.0538754426986237, -2.0261407625013654, -0.6206810372160029], dtype=object)
In [68]: pd._libs.ops.vec_binop(a, b, func)
Out[68]: array([nan, nan, nan, ..., nan, nan, nan], dtype=object)
In [69]: np.array([func(aa, bb) for aa, bb in zip(a, b)], dtype=object)
Out[69]:
array([-1.7373671804522872, -0.44608341233368065, -3.5963900176223706, ...,
-1.0538754426986237, -2.0261407625013654, -0.6206810372160029], dtype=object)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that's because it does a maybe_convert_bool
at then end, I suppose. Is there a version that does not do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, I guess that explains the "bin" part of "vec_binop". So it looks like a) the original suggestion is incorrect and b) it would probably be easy to implement and may improve perf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought "binary" meant that it is an op between 2 values, not that the result in boolean. So I think the naming is just confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed on both counts.
pandas/core/arrays/base.py
Outdated
@@ -610,3 +615,91 @@ def _ndarray_values(self): | |||
used for interacting with our indexers. | |||
""" | |||
return np.array(self) | |||
|
|||
|
|||
def ExtensionOpsMixin(include_arith_ops, include_logic_ops): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a compelling reason to make a factory for this instead of just making two mixin classes?
class ExtensionArithmeticMixin(...):
class ExtensionLogicMixin(...):
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel I am sharing the code of the operator between arithmetic and comparison operators. Alternatively, I could just create a base mixin class, and have ExtensionArithmeticMixin
and ExtensionLogicMixin
inherit from the base.
If you guys want me to make that change, let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel I implemented your suggestion. It does look cleaner. It's in the latest commit
pandas/core/arrays/base.py
Outdated
|
||
# We can't use (NotImplemented in res) because the | ||
# results might be objects that have overridden __eq__ | ||
if any(isinstance(r, type(NotImplemented)) for r in res): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the comment above. Is there a scenario in which isinstance(r, type(NotImplemented)) and not (r is NotImplemented)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comment was about using in
. Having said that, I think I can use (r is NotImplemented)
vs. isinstance(r, type(NotImplemented))
But now I have removed that in my working version based on @jorisvandenbossche suggestion to use the op rather than the op_name
pandas/core/arrays/base.py
Outdated
for op_name in arithops: | ||
setattr(_ExtensionOpsMixin, op_name, create_method(op_name)) | ||
|
||
if include_logic_ops: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can these be called comparison_ops
instead of logic_ops
? I tend to think of the latter as meaning `&|^'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
pandas/core/indexes/base.py
Outdated
# try that | ||
try: | ||
iloc = self.get_loc(key) | ||
return s[iloc] | ||
except KeyError: | ||
if is_integer(key): | ||
if (len(self) > 0 and | ||
self.inferred_type in ['integer', 'boolean']): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What makes this necessary? I'm not clear on how it relates to the Mixin classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this you can comment on the relevant other PR (#21260)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel the other PR (#21260) was needed to make the tests here work right.
pandas/core/ops.py
Outdated
@@ -990,6 +991,20 @@ def _construct_divmod_result(left, result, index, name, dtype): | |||
) | |||
|
|||
|
|||
def dispatch_to_extension_op(left, right, op_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be accomplished with:
return dispatch_to_index_op(op, left, right, left.values.__class__)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless we require that the ExtensionArray constructor handles objects of itself (and does not copy them), we cannot do the your example code (and currently we don't require this from extension array implementations I think.
It is also don't think it necessarily improves readability of the code by reusing that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. Eventually dispatch_to_index_op is intended to give way to an EA-based dispatch.
@jbrockmendel @jreback @jorisvandenbossche My latest commit also does something with the JSON example to see that we trap different kinds of exceptions correctly. I'm not sure if I should put that there or not. It also illustrates that you can define just single methods using the mixin, i.e., if your underlying dtype only supports addition, but nothing else, you can then define an operator for addition and not define the other ones. |
pandas/core/arrays/base.py
Outdated
|
||
return res_values | ||
|
||
name = '__{name}__'.format(name=op_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you defining create_method as anything other than NotImplementedError
this is way too opionated - if u want to define a function to do this and an author can use this ok
but it should not be the default
maybe if u want to create an ExtensionOpsMixinForScalars (long name though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please take a closer look at the diff if you give such an opinionated (pun intended :-)) comment
- this is not the default, it's an optional mixin extension authors can reuse (so what you suggest?)
- so it is exactly the point to be opinionated and fallback to the scalars (we might try to think of a better name that reflects this, as you suggested)
- it is what we discussed in the other PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche maybe I wasn't clear.
ExtensionOpsBase
should simply have NotImplementedError for the actual implementation of the arithmethic and comaprison ops. Scalar implementations of this will IMHO rarely be used and simply cluttering things up. This can be put as a separate MixIn class if you wish for inclusion by an author, including it here is forcing a pretty bad idiom: coupling the interface and implementation, and is quite error prone because its not opt in. If an author extends and adds ops, they immediate get an implementation, which could very easiy be buggy / not work / work in unexpected ways.
I think the ops implementation is so important that this must be an explicit choice.
The point of the mixin is to avoid the boilerplate of writing __div__ - ....
and so on. NOT the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be put as a separate MixIn class if you wish for inclusion by an author,
Again, this is what this PR is doing.
To repeat from above, is might be it is just the name of ExtensionOpsBase which is not clear? (with which I agree)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback @jorisvandenbossche Maybe the naming is misleading. So let me clarify the intent of the PR.
The goal here is that if someone extends ExtensionArray
(let's call it MyArray
for sake of discussion) where they are storing objects (let's call them MyClass
) that already have the arithmetic and/or comparison operators defined for the class MyClass
, they can use these mixins to get those operators defined on the MyArray
subclass by simply adding the mixin as one of the base classes of MyArray
.
An example is Decimal
. The operators are defined there, so the mixin gives you the operators on DecimalArray
"for free". I also have my own application that extends ExtensionArray
and it uses the same principal.
The use of these 3 classes (ExtensionOpsBase
, ExtensionArithmeticMixin
and ExtensionComparisonMixin
) is optional for people who are writing their own subclass of ExtensionArray
. The ExtensionOpsBase
class is just a base class for the 2 mixin classes. It contains the create_method()
method that allows one to create an individual operator, under the assumption that MyClass
has the operator defined. Aside from the use of the two mixin classes that define all of the operators, ExtensionOpsBase.createmethod()
can be used to define individual methods. So, for example, if the MyClass
class only supports addition, you can choose to not use the mixin, and just do something like the following in MyArray
to create the operator.
__add__ = ExtensionOpsBase.create_method(operator.add)
IMHO, it seems that @jorisvandenbossche understands what I'm trying to do, but @jreback may not fully understand it.
If you think there is a better name for ExtensionOpsBase
, I'm open to suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche I've tried a variety of implementations without the base class, and couldn't get it to work. I'll rename as you suggest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renaming looks ok to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Next commit will use the renaming
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed on the naming conventions. Maybe the NotImplemented versions of the methods belong on the base ExtensionArray
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel The NotImplemented versions are there by default for any class, so I don't think we need to do anything. In other words, if you write a class MyClass
that extends ExtensionArray
and you don't define the methods for operators, they are by default NotImplemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates!
(didn't look at the tests yet)
pandas/core/arrays/base.py
Outdated
operator using the class method create_method() | ||
""" | ||
@classmethod | ||
def create_method(cls, op): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make this a private method? (_create_method
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche See my reply here #21261 (comment) as to why I think it should not be private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fine to have it private for users (and I think that is better, as a end-user should never have to use this), but have it documented for extension array authors. We have other methods like that on the ExtensionArray class as well, like _from_sequence
.
We just need to clearly document it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche OK, that makes sense. I'll make that change.
pandas/core/arrays/base.py
Outdated
Can also be used to define an individual method for a specific | ||
operator using the class method create_method() | ||
""" | ||
@classmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blank line above this for PEP8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It passed the PEP8 tests....... But I will add it in.
pandas/core/arrays/base.py
Outdated
def create_method(cls, op): | ||
""" | ||
A class method that returns a method that will correspond to an | ||
operator for an ExtensionArray subclass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mention that is to dispatch to the relevant operator defined on the scalars
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
pandas/core/arrays/base.py
Outdated
----- | ||
Given an ExtensionArray subclass called MyClass, use | ||
|
||
mymethod = create_method(my_operator) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would give here an actual example (eg __add__ = ExtensionOpsBase._create_method(operators.add)
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
pandas/core/arrays/base.py
Outdated
def _binop(self, other): | ||
def convert_values(parm): | ||
if isinstance(parm, ExtensionArray): | ||
ovalues = list(parm) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it needed to convert to a list? ExtensionArrays support iterating through them (which returns the scalars)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
pandas/core/arrays/base.py
Outdated
except TypeError: | ||
msg = ("ExtensionDtype invalid operation " + | ||
"{opn} between {one} and {two}") | ||
raise TypeError(msg.format(opn=op_name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I would not care catching the error.
Eg for the decimal test case:
In [8]: import decimal
In [9]: d = decimal.Decimal('1.1')
In [10]: d + 'a'
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-10-c0db2fc01a1f> in <module>()
----> 1 d + 'a'
TypeError: unsupported operand type(s) for +: 'decimal.Decimal' and 'str'
which is already a nice error (and IMO even clearer, because it does not contain references to "Extension" which normal users don't necessarily are familiar with)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Fixed.
pandas/core/arrays/base.py
Outdated
one=type(lvalues), | ||
two=type(rvalues))) | ||
|
||
res_values = com._values_from_object(res) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this does anything for a list (it just passes it through), so this can be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I removed it and things still work. Had copied this from elsewhere.
pandas/core/ops.py
Outdated
|
||
method = getattr(left.values, op_name, None) | ||
if method is not None: | ||
res_values = method(right) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as before, can we do here op(left, right)
instead of getting the method manually ? (similar as what happens in dispatch_to_index_op
) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made that change. Thanks.
pandas/core/ops.py
Outdated
res_values = method(right) | ||
if method is None or res_values is NotImplemented: | ||
msg = "ExtensionArray invalid operation {opn} between {one} and {two}" | ||
raise TypeError(msg.format(opn=op_name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given my comment above, this should not be needed (as the ExtensionArray op will raise the appropriate error)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that makes sense. I changed it.
pandas/core/arrays/base.py
Outdated
|
||
try: | ||
res_values = self._from_sequence(res_values) | ||
except TypeError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it makes sense to have a separate _make_comparision_op
and _make_aithmetics_op
?
The the comparison one should not need to try the conversion to ExtensionArray (if you want this to happen like in your case, you can still use the arithmetic one to add the comparison ops manually on your class)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point. But if someone else had the same use case as me, where the comparison ops need to return an object rather than a boolean, then they'd have to know the workaround you suggest, and so then the question is whether we document it.
I'm leaving this as is for now, but will accept if you really want me to change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One reason to split it up would be to avoid the extra overhead to try to convert it to an ExtensionArray in case of the boolean ops (although for a good implementation this probably should not be too much overhead ..)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I've done the split for the next commit. I added a parameter to create_method()
to specify whether to try the coercion to the underlying dtype, and said to not do it for the comparison operators.
This needs some tests, seee: https://github.com/pandas-dev/pandas/pull/21160/files#diff-f12c8797eb9d98e36fd196d5e0821f75 for a generic way to do this. FInally I would suggest that the implementation of Categorical use this (can be a followup PR) to validate this approach. Too much code has been added w/o adequate testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs quite a bit of splitting / work
pandas/core/arrays/base.py
Outdated
|
||
return res_values | ||
|
||
name = '__{name}__'.format(name=op_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche maybe I wasn't clear.
ExtensionOpsBase
should simply have NotImplementedError for the actual implementation of the arithmethic and comaprison ops. Scalar implementations of this will IMHO rarely be used and simply cluttering things up. This can be put as a separate MixIn class if you wish for inclusion by an author, including it here is forcing a pretty bad idiom: coupling the interface and implementation, and is quite error prone because its not opt in. If an author extends and adds ops, they immediate get an implementation, which could very easiy be buggy / not work / work in unexpected ways.
I think the ops implementation is so important that this must be an explicit choice.
The point of the mixin is to avoid the boilerplate of writing __div__ - ....
and so on. NOT the implementation.
doc/source/whatsnew/v0.24.0.txt
Outdated
@@ -14,6 +14,7 @@ Other Enhancements | |||
^^^^^^^^^^^^^^^^^^ | |||
- :func:`to_datetime` now supports the ``%Z`` and ``%z`` directive when passed into ``format`` (:issue:`13486`) | |||
- :func:`to_csv` now supports ``compression`` keyword when a file handle is passed. (:issue:`21227`) | |||
- ``ExtensionArray`` has a ``ExtensionOpsMixin`` factory that allows default operators to be defined (:issue:`20659`, :issue:`19577`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make a separate sub-section. pls expand this a bit, I know what you mean, but I doubt the average reader does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the note here still accurate or has the factory been changed to just mixins?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've written a section for whatsnew
pandas/core/arrays/base.py
Outdated
@@ -610,3 +618,121 @@ def _ndarray_values(self): | |||
used for interacting with our indexers. | |||
""" | |||
return np.array(self) | |||
|
|||
|
|||
class ExtensionOpsBase(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should not contain a scalar based implemetation, if you really want to add that make it a separate MixIin (inherit from object)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note this is currently not a "base" class as what you have in mind, so it might be you just find the naming wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it IS a Base class for the arithemetic / comparison ops. This is the wrong place to put this logic. This needs to return NotImplementeError. as I have said multiple times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback see my comment #21261 (comment) . The class ExtensionOpsBase
(and I'm open to new names) is not meant to be used directly as a parent class. So why would it need to return NotImplementedError? In fact, if someone just puts ExtensionOpsBase
as a parent, it will do nothing beyond defining create_method()
.
pandas/core/arrays/base.py
Outdated
return set_function_name(_binop, name, cls) | ||
|
||
|
||
class ExtensionArithmeticMixin(ExtensionOpsBase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inherit from object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You will need to be clearer what your objection is. In the way @Dr-Irv implemented it, he needs to subclass the other class, to have the create_method
implementation. If you don't like the way how it is currently split in three classes, be specific about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same point as above create_method
needs to return NotImplementedError in this class, hence you don't need a base class. I am being pedantic on purpose here. What you are doing is coupling way to tight the scalar implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback can't inherit from object, as the set_function_name
call in create_method()
needs to know the parent class. (or maybe you know some other way to do this)
I don't see the point that create_method
needs to return NotImplementedError, as the whole point of what I'm doing here is to just provide the capability to have the operators on ExtensionArray
use the underlying operators of the elements on the array.
pandas/core/arrays/base.py
Outdated
__div__ = ExtensionOpsBase.create_method(operator.div) | ||
__rdiv__ = ExtensionOpsBase.create_method(ops.rdiv) | ||
|
||
__divmod__ = ExtensionOpsBase.create_method(divmod) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rdivmod is a thing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
pandas/core/series.py
Outdated
if self_is_ext and not is_categorical_dtype(self.values): | ||
try: | ||
new_values = self._values._from_sequence(new_values) | ||
except TypeError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you catching a TypeError?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is also for the other PR, but: because the result might not necessarily be an ExtensionArray.
This is of course a bit an unclear area of combine
, but currently there is no restriction on the function that the dtype needs to be preserved in the result of the function.
assert all(isinstance(v, decimal.Decimal) for v in values) | ||
for val in values: | ||
if not isinstance(val, self.dtype.type): | ||
raise TypeError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
erorr message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed (and also in pandas/tests/extension/json/array.py)
|
||
|
||
class TestOperator(BaseDecimal, TestSeriesOperators): | ||
@cache_readonly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
never do this. use a fixture
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unavoidable, because I am subclassing the tests.series.tests_operators.TestSeriesOperators
class. That class inherits from tests.series.common.TestData
, which defines ts
as a @cache_readonly
property. The method I use for testing (TestSeriesOperators.test_operators()
) has references to self.ts
, i.e., it doesn't use a fixture, and I can't see how to override that definition so we get a DecimalArray
series and reuse all that testing code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind, I redid the tests as you suggested elsewhere
return pd.Series(vals._from_sequence([abs(i) for i in vals])) | ||
else: | ||
return abs(v) | ||
context = decimal.getcontext() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what the heck is all this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am reusing the tests that are in the class pandas/tests/series/test_operators/TestSeriesOperators
. For decimal, the default when dividing by zero is to raise an error, as opposed to in numpy, where inf
is returned. So by changing the defaults for Decimal
, we get the same behavior as numpy.
There is a related issue with respect to abs()
, because the tests in TestSeriesOperators
call np.abs()
, but that doesn't work for the Decimal
class.
@@ -230,3 +230,21 @@ def test_groupby_extension_agg(self, as_index, data_for_grouping): | |||
super(TestGroupby, self).test_groupby_extension_agg( | |||
as_index, data_for_grouping | |||
) | |||
|
|||
|
|||
def test_ops(data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than try to create one off tests, you need a genearlized testing mechanism dispatched in BaseOps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I copied your code and made some modifications to it to have a general testing mechanism.
Hello @Dr-Irv! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on June 29, 2018 at 00:18 Hours UTC |
@jreback @jorisvandenbossche @jbrockmendel I have just pushed a new commit that addresses a number of issues raised above, including:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates!
Added some more comments
pandas/core/arrays/base.py
Outdated
|
||
Parameters | ||
---------- | ||
op: An operator that takes arguments op(a, b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use numpydoc standard formatting?
op : function
An operator that takes arguments op(a, b)
coerce_to_dtype : bool
boolean indicating whether to attempt to convert ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
pandas/core/arrays/base.py
Outdated
def convert_values(parm): | ||
if isinstance(parm, ExtensionArray) or is_list_like(parm): | ||
ovalues = parm | ||
elif is_extension_array_dtype(parm): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A Series is also 'list-like', so I think this elif
will not be reached.
But given that you simply iterate through it (which will be the same for series or extensionarray), I think you can simply remove that part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
pandas/core/arrays/base.py
Outdated
else: # Assume its an object | ||
ovalues = [parm] * len(self) | ||
return ovalues | ||
lvalues = convert_values(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be needed as the calling object will always be already an ExtensionArray?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Fixed
pandas/core/arrays/base.py
Outdated
op_name = ops._get_op_name(op, False) | ||
|
||
def _binop(self, other): | ||
def convert_values(parm): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor detail, but I find 'values' (or 'param') is clearer than 'parm'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
pandas/core/arrays/base.py
Outdated
|
||
return res | ||
|
||
name = '__{name}__'.format(name=op_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can also pass True to _get_op_name
, and then that function does this formatting for you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
pandas/tests/extension/base/ops.py
Outdated
# scalar | ||
op = all_arithmetic_operators | ||
s = pd.Series(data) | ||
self.check_op(s, op, 1, exc=TypeError) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe use s[0]
as the scalar? (so to have the same type) That might be more general likely to work.
pandas/tests/extension/base/ops.py
Outdated
def _compare_other(self, data, op, other): | ||
s = pd.Series(data) | ||
|
||
if op in '__eq__': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in -> == ?
But it's not fully clear to me what you are testing here. You are testing that the equality operators are not implemented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is from the @jreback implementation. Let me study it in more detail and try it another way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche So I studied this, and I think the intent here is to say that by default, the ExtensionArray
eq
and ne
operators are not implemented, but that doing an eq
or ne
operation on a Series
containing the ExtensionArray
produces a result.
But I have made some changes to these tests to clarify that we are testing operator names versus operators.
pandas/tests/extension/base/ops.py
Outdated
from .base import BaseExtensionTests | ||
|
||
|
||
class BaseOpsTests(BaseExtensionTests): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can split this in arithmetic and compare ops? So it is easier to fully skip one of the two, depending on what your class implements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will do that.
# scalar | ||
op_name = all_arithmetic_operators | ||
s = pd.Series(data) | ||
self.check_op(s, op_name, decimal.Decimal(1.5)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you update the scalar in the base test, I think this one can be inherited?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that was possible
@@ -1216,11 +1216,11 @@ def test_neg(self): | |||
def test_invert(self): | |||
assert_series_equal(-(self.series < 0), ~(self.series < 0)) | |||
|
|||
def test_operators(self): | |||
def test_operators(self, absfunc=np.abs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this change related to the PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was leftover from the previous method of testing, so I will revert it back.
doc/source/whatsnew/v0.24.0.txt
Outdated
``ExtensionArray`` operator support | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
An ``ExtensionArray`` subclass consisting of objects that have arithmetic and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you include most of this in extending.rst
as well?
It'd be good to mention the performance implications of using these mixins.
@jorisvandenbossche @TomAugspurger Latest commit reorganizes the tests (based on your suggestions), adds docs to extending.rst, and most importantly, fixes a bug in how the reverse operators were called. |
as i did in my impl this is so much simpler to just make class methods then you opt in by MyExtensionArray._add_numeric_ops() and just override the _create_method |
@jreback YOu wrote:
OK, but that's not a true mixin. I'll do as you suggest. |
@jreback so what you propose (from a user perspective) is to have a user do
(where the Mixin implements the
to be clear on the API you propose? |
@jbrockmendel wrote:
Yes, that is exactly my use case! For @jreback and @jorisvandenbossche , latest commit does it the way that @jreback suggests |
pandas/core/arrays/base.py
Outdated
def returnNotImplemented(self, other): | ||
return NotImplemented | ||
def addArithmeticOps(cls): | ||
cls.__add__ = cls._create_method(operator.add) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name these like
_add_arithmetic_ops and so on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a user perspective, I personally slightly prefer the mixin classes to having to call the class method. But if this is a way we can agree on to move forward, then that's fine!
pandas/core/arrays/base.py
Outdated
A base class for linking the operators to their dunder names | ||
""" | ||
@classmethod | ||
def addArithmeticOps(cls): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use PEP8 and make it private (_add_arithmetic_ops
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Somehow flake8 didn't pick it up for me when I ran it this time.
pandas/core/arrays/base.py
Outdated
cls.__lt__ = cls._create_method(operator.lt, False) | ||
cls.__gt__ = cls._create_method(operator.gt, False) | ||
cls.__le__ = cls._create_method(operator.le, False) | ||
cls.__ge__ = cls._create_method(operator.ge, False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The False
here is quite specific to the _create_method
implementation of the ScalarOpsMixin, so this doesn't feel completely "clean".
So you could also have a _create_arithmetic_method
and _create_comparison_method
class methods, and in your scalar mixing below you can reuse the same function for both those class methods (with a different value of the keyword).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some quick comments, didn't look in detail again
doc/source/whatsnew/v0.24.0.txt
Outdated
For the second approach, which is appropriate if your ExtensionArray contains | ||
elements that already have the operators | ||
defined on a per-element basis, pandas provides a mixin, | ||
:class:`ExtensionScalarOpsMixin` that you can use that can |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"that you can use that can define" -> "that you can use to define"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
doc/source/extending.rst
Outdated
for ``MyExtensionElement``, the second approach will automatically | ||
define the operators for ``MyExtensionArray``. | ||
|
||
A mixin class, :class:`~pandas.api.extensions.ExtensionScalarOpscMixin` supports this second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo in "Opsc"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
pandas/core/arrays/base.py
Outdated
insert the lines | ||
|
||
MyExtensionArray.addArithmeticOperators() | ||
MyExtensionArray.addComparisonOperators() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left-over of the old name
else: | ||
# We know that left is not ExtensionArray and is Series and right is | ||
# ExtensionArray. Want to force ExtensionArray op to get called | ||
res_values = op(list(left.values), right.values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was this needed to fix failing tests? (this was not here in a previous version I think? and eg the dispatch to index is only dealing with the left case)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was added a few commits ago. I discovered that the tests were not properly testing the reverse operators. So there were a couple of changes related to that:
- Changed
pandas/tests/extension/base/ops.py:BaseOpsUtil
to properly test the reverse operator. - Changed the tests in
pandas/core/ops.py:_arith_method_SERIES.wrapper()
pandas/core/ops.py:_comp_method_SERIES.wrapper()
to check if the "right" argument is an extension array - The change you quoted above to detect when the reverse operator is called.
Without these changes, an operator such as pd.Series([1,2,3]) + pd.Series(ExtensionArray(data))
would return an object
dtype rather than the ExtensionArray dtype.
As best as I can tell, dispatch_to_index_op
is only used on datetime, timedelta and Categorical objects. In those cases, if the index is on the right and some other type is on the left, the other type returns NotImplemented
, which causes python to then call the reverse operator on the right argument. With extension types, I think we should allow a Series of any object to be on the left, and then have the extension operator figure out if the operation is valid.
Another possible implementation might be to add a test in the wrappers that says that if is_extension_dtype(right) return NotImplemented, which will then make python call the reverse operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another possible implementation might be to add a test in the wrappers that says that if is_extension_dtype(right) return NotImplemented
This is much closer to what we've had in mind with the recent refactoring in ops
. Would this look something like splitting up the conditions on 1082-1085? We might need to up-cast right
to be a Series
in order for the deference chain to go through; I'll need to double-check.
Why is the list(left.values)
necessary? I would have thought left.values
would get the job done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel Wrote:
Another possible implementation might be to add a test in the wrappers that says that if is_extension_dtype(right) return NotImplemented
This is much closer to what we've had in mind with the recent refactoring in ops. Would this look something like splitting up the conditions on 1082-1085? We might need to up-cast right to be a Series in order for the deference chain to go through; I'll need to double-check.
Yes, I think it would be something like:
elif is_extension_array_dtype(left):
return dispatch_to_extension_op(op, left, right)
elif (is_extension_array_dtype(right) and
not is_categorical_dtype(right)):
return NotImplemented
But I would have to test the concept. I don't think the upcast is necessary. We know that left is a Series. So python would then see that op(left, right)
returns NotImplemented
, and then try to call reverse_op(right, left)
, so that the reverse operator is called, but using the implementation of right
.
I'm not going to try making that change without feedback from others.
Why is the list(left.values) necessary? I would have thought left.values would get the job done.
Because left.values
could be any type (numpy.ndarray, pd.Categorical, etc.), and we don't want the version of op to be called that is associated with the class of left.values . So by making it a list, we are forcing the reverse operator to be called on right.values, which is the EA implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another possible implementation might be to add a test in the wrappers that says that if is_extension_dtype(right) return NotImplemented, which will then make python call the reverse operator.
I think this is the better way to go.
For me it is fine to not yet do this in this PR, but then I would also not include the above change (and live with that the first iteration in this PR is not yet ideal for cases where the ExtensionArray is the right variable)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I make that change (using NotImplemented)?
I would at least try this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche Yes, I will try this in a new PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche (and @jbrockmendel) I tried the NotImplemented
idea and it doesn't work. I think the reason is as follows. Let's say that we have one Series
called s1
that has a normal (int
or double
) dtype. And s2
has a dtype of the EA. Let's say you write the expression s1 + s2
. Python sees that you are adding two Series
objects. So internally, if we return NotImplemented
because s2
is an EA, then Python will say "Well, you don't know how to add a Series to a Series, so I'm not going to bother calling the reverse operator." In other words, the reverse operator is only called when the two types passed to the operator are different, and NotImplemented
is returned when applying the operator to the first argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure I follow. Are you saying that both s1.__add__(s2)
and s2.__add__(s1)
return NotImplemented
?
The dispatch logic should do something like
out = s1.values + s2
return Series(out, name=...)
It is s1.values.__add__
that should return NotImplemented
in this case. Is this not an option for some reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel I've tried two different alternatives (regarding replacing this code
else:
# We know that left is not ExtensionArray and is Series and right is
# ExtensionArray. Want to force ExtensionArray op to get called
res_values = op(list(left.values), right.values)
that this discussion refers to). Note the section of code is hit when left.values
is not an EA, but right.values
is an EA.
- Use
else:
return NotImplemented
- Use
else:
res_values = op(left.values, right.values)
In the first case, python sees that s1.__add__(s2)
is NotImplemented
, and because s1
and s2
are both of type Series
, then s2.__radd__(s1)
is never called.
In the second case, left.values
is a numpy
array, so then the numpy
__add__
method is called, and it is perfectly happy to return an array of dtype
that is set to object
rather than the dtype of the EA. numpy
ends up calling the __add__
method for each of the elements.
doc/source/whatsnew/v0.24.0.txt
Outdated
@@ -12,6 +12,45 @@ New features | |||
|
|||
.. _whatsnew_0240.enhancements.other: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this label belongs to the title below your added content
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
pandas/core/arrays/base.py
Outdated
|
||
Parameters | ||
---------- | ||
op: function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space after "op" and before the colon (numpy docstring standard, long version: http://pandas.pydata.org/pandas-docs/stable/contributing_docstring.html)
and the same below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
||
def _binop(self, other): | ||
def convert_values(param): | ||
if isinstance(param, ExtensionArray) or is_list_like(param): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make this is_list_like(param)
more strict as is_array_like
?
For example, if you create ExtensionArray of sets, and do an operation where the right value is a single set, the current code will not work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or similar with dicts (for which we already have a dummy implementation in the tests)
(I am also fine with leaving this for later, as there are other places where we have problems with iterable scalar elements)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche I think I want to leave it for now. Because you'd like to be able to do an operation such as EABackedSeries + list(objects)
and using is_array_type
means you have to have a dtype.
try: | ||
res = self._from_sequence(res) | ||
except TypeError: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this fails, I think we should still convert it to an array instead of keeping it as a list?
Or does that happen on another level?
(anyway, also for usability with the ExtensionArray itself, I think coercing it to an array makes more sense)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we convert to an array, then we could have a dtype problem. This allows the result to be of any type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we convert to an array, then we could have a dtype problem
It will be converted to an array anyhow, if not here, then at the level above when the res
is passed to the series constructor. So personally I would already do the conversion here (we can use the same logic / code as what is done in the series constructor to coerce a passed list)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche But why repeat that logic? If we leave it as a list, then the Series constructor will do the inference on the dtype.
|
||
return res | ||
|
||
op_name = ops._get_op_name(op, True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use a parameter name instead of the positional argument ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could do that, but not specifying the parameter is consistent with all the other usages of _get_op_name
, so I think I should be consistent with code that is elsewhere.
pandas/tests/extension/base/ops.py
Outdated
|
||
def test_compare_scalar(self, data, all_compare_operators): | ||
op_name = all_compare_operators | ||
self._compare_other(data, op_name, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the arithmetic tests above you convert data
to s = pd.Series(data)
to perform the test on. Should do that here as well?
Also, should we compare with data[0]
instead of 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see that here that is done in the _compare_other
. However, I would move that here to a) be consistent and b) be more explicit by looking at the single test code that it are the Series methods we are testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to compare to data[0]
, because these are the default tests, and if we use data[0]
, then the tests for __ne__
and __eq__
become messed up. By using 0 by default, nothing should be equal to 0, making the __ne__
and __eq__
tests easier to do. See test_categorical.py
for where we just override _compare_other()
to make it work.
pandas/tests/extension/base/ops.py
Outdated
else: | ||
|
||
# array | ||
getattr(data, op_name)(other) is NotImplementedError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this return NotImplementedError?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copied code from @jreback . Changed now to have an assert and that it is NotImplemented
pandas/tests/extension/base/ops.py
Outdated
def _compare_other(self, data, op_name, other): | ||
s = pd.Series(data) | ||
|
||
if op_name == '__eq__': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should do here something similar as in the Arithmetic tests to convert the __eq__
to operator.eq
(so to not test s.__eq__(other)
directly but to test s == other
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a mix. We want to test that by default the attributes of the op_name are not there, but we still want to call the operators directly and see that they raise TypeError
(except for eq and ne which are defined)
|
||
def check_opname(self, s, op_name, other, exc=None): | ||
super(TestArithmeticOps, self).check_opname(s, op_name, | ||
other, exc=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed? It seems you are not changing anything in the implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is needed because the super check_opname
has a default of exc=NotImplementedError
. So this overrides that default.
@@ -1247,6 +1248,10 @@ def assert_series_equal(left, right, check_dtype=True, | |||
right = pd.IntervalIndex(right) | |||
assert_index_equal(left, right, obj='{obj}.index'.format(obj=obj)) | |||
|
|||
elif (is_extension_array_dtype(left) and not is_categorical_dtype(left) and | |||
is_extension_array_dtype(right) and not is_categorical_dtype(right)): | |||
return assert_extension_array_equal(left.values, right.values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should only do this if check_dtype=True
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow your logic for this. Before, we had no code that really compared two EA's for testing. However, that code doesn't work correctly if they are categorical dtypes. The check_dtype
is a separate check, so if check_dtype
was False, you'd still do this test the same way.
@Dr-Irv ok rebased. ping on green. |
thanks! |
2018-06-29 15:51 GMT+02:00 Dr. Irv <notifications@github.com>:
***@***.**** commented on this pull request.
------------------------------
In pandas/core/arrays/base.py
<#21261 (comment)>:
> + ovalues = param
+ else: # Assume its an object
+ ovalues = [param] * len(self)
+ return ovalues
+ lvalues = self
+ rvalues = convert_values(other)
+
+ # If the operator is not defined for the underlying objects,
+ # a TypeError should be raised
+ res = [op(a, b) for (a, b) in zip(lvalues, rvalues)]
+
+ if coerce_to_dtype:
+ try:
+ res = self._from_sequence(res)
+ except TypeError:
+ pass
@jorisvandenbossche <https://github.com/jorisvandenbossche> But why
repeat that logic? If we leave it as a list, then the Series constructor
will do the inference on the dtype.
Because IMO the ExtensionArray itself should also do the sensible thing
(eg sometimes you want to act on the values, eg if you don't want
alignment). And returning a list is then not consistent and not useful.
Note that it's not fully repeating the logic, it's only reusing the
underlying functionality (in principle it should be a single helper
function, but would need to look that up), and no inference needs to happen
in the Series constructor.
|
git diff upstream/master -u -- "*.py" | flake8 --diff
Based on a discussion in #20889, this provides a mixin (via a mixin factory) that provides a default implementation of operators for
ExtensionArray
using the operators defined on the underlyingExtensionDtype
. Tested using theDecimalArray
implementation.NOTE: This requires #21183
and #21260to be accepted into master, and so the changes from those pull requests are included here.Comments from @jorisvandenbossche @TomAugspurger @jreback @jbrockmendel are welcome.