Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Support operators for ExtensionArray #20889

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions pandas/core/arrays/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,13 @@ class ExtensionArray(object):
* factorize / _values_for_factorize
* argsort / _values_for_argsort

For logical operators, the default is to return a Series of boolean.
However, if the underlying ExtensionDtype overrides the logical
operators, then the implementer may want to have an ExtensionArray
subclass contain the result. This can be done by changing the property
_logical_result from its default value of None to the _from_sequence
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it needed to have this property? Can't we simply detect whether the result is a boolean numpy array or again an ExtensionArray ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this use-case a bit more? I think we will certainly want Series <compare> Series to always be an ndarray of booleans.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't speak for the author, but my assumption was that this has to do with some of the spaghetti-code in ops._bool_method_SERIES, where sometimes a bool-dtype is returned and other times an int-dtype is returned (and datetimelike are currently all broken, see #19972, #19759). Straightening out this mess independently of EA implementations is part of the plan referred to above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, in my use case, I need the boolean operators to return an object that represents the relation. I'm using pandas on top of 2 different libraries (that functionally are the same) where the operators (x <= y), (x >= y) and (x == y) are not booleans, but objects representing the relations.

method of the ExtensionArray subclass.

This class does not inherit from 'abc.ABCMeta' for performance reasons.
Methods and properties required by the interface raise
``pandas.errors.AbstractMethodError`` and no ``register`` method is
Expand Down Expand Up @@ -567,6 +574,9 @@ def copy(self, deep=False):
"""
raise AbstractMethodError(self)

# See documentation above
_logical_result = None

# ------------------------------------------------------------------------
# Block-related methods
# ------------------------------------------------------------------------
Expand Down Expand Up @@ -610,3 +620,14 @@ def _ndarray_values(self):
used for interacting with our indexers.
"""
return np.array(self)

# ------------------------------------------------------------------------
# Utilities for use by subclasses
# ------------------------------------------------------------------------
def is_sequence_of_dtype(self, seq):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what is this needed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, this is expected to always be true. If it isn't I'd recommend making a superclass that has all the scalar types like I do in https://github.com/ContinuumIO/cyberpandas/blob/c66bbecaf5193bd284a0fddfde65395d119aad41/cyberpandas/ip_array.py#L22

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's suppose that you don't implement the ExtensionArray operators as methods of the subclass of ExtensionArray, but you let the underlying ExtensionDtype handle the operators for you. (This is what I used for Decimal). Some of the operators will return a sequence containing all objects of ExtensionDtype. Some operators (e.g., logical ones), will not. So internally, it's useful to have a test to know whether a sequence has objects of the corresponding ExtensionDtype so that you can then return an ExtensionArray as a result, otherwise, you just let things get coerced based on the type in the sequence.

"""
Given a sequence, determine whether all members have the appropriate
type for this instance of an ExtensionArray
"""
thistype = self.dtype.type
return all(isinstance(i, thistype) for i in seq)
22 changes: 16 additions & 6 deletions pandas/core/indexes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -3081,13 +3081,23 @@ def get_value(self, series, key):
# if we have something that is Index-like, then
# use this, e.g. DatetimeIndex
s = getattr(series, '_values', None)
if isinstance(s, (ExtensionArray, Index)) and is_scalar(key):
try:
return s[key]
except (IndexError, ValueError):
if is_scalar(key):
if isinstance(s, Index):
try:
return s[key]
except (IndexError, ValueError):

# invalid type as an indexer
pass
# invalid type as an indexer
pass
elif isinstance(s, ExtensionArray):
try:
# This should call the ExtensionArray __getitem__
iloc = self.get_loc(key)
return s[iloc]
except (IndexError, ValueError):

# invalid type as an indexer
pass

s = com._values_from_object(series)
k = com._values_from_object(key)
Expand Down
96 changes: 95 additions & 1 deletion pandas/core/ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
# necessary to enforce truediv in Python 2.X
from __future__ import division
import operator
import inspect

import numpy as np
import pandas as pd
Expand All @@ -30,7 +31,7 @@
is_bool_dtype,
is_list_like,
is_scalar,
_ensure_object)
_ensure_object, is_extension_array_dtype)
from pandas.core.dtypes.cast import (
maybe_upcast_putmask, find_common_type,
construct_1d_object_array_from_listlike)
Expand Down Expand Up @@ -990,6 +991,93 @@ def _construct_divmod_result(left, result, index, name, dtype):
)


def dispatch_to_extension_op(left, right, op_name=None, is_logical=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the dispatch_to_index_op uses op instead of op_name. Is there a reason for this difference? (and I mean in the actual implementation here it assumes a method name, and not an operator function that can be called)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for the difference is as follows. The way I implemented this, we first look to see if the operator is defined for the ExtensionArray subclass. If not, then we use the implementation of the operator on the underlying ExtensionDtype. So if you pass op, you get the operator bound to a specific class. If you have op_name, then we can translate to either the ExtensionArray subclass implementation or the ExtensionDtype implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the op is not defined on ExtensionArray, calling it directly (op(left_values, right_values)) will raise a TypeError that you can catch (which you already do), so I don't really see the difference

"""
Assume that left is a Series backed by an ExtensionArray,
apply the operator defined by op_name.
"""

method = getattr(left.values, op_name, None)
deflen = len(left)
excons = type(left.values)._from_sequence
exclass = type(left.values)
testseq = left.values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having trouble understanding these names. (method makes sense).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excons is the constructor used to construct a result from a sequence that has all its elements of type ExtensionDtype. (so "ex" for "Extension" and "cons" for "constructor") exclass is the underlying class of the ExtensionArray subclass.


if is_logical:
if exclass._logical_result is not None:
excons = exclass._logical_result
else:
excons = None # Indicates boolean

# The idea here is as follows. First we see if the op is
# defined in the ExtensionArray subclass, and returns a
# result that is not NotImplemented. If so, we use that
# result. If that fails, then we try an
# element by element operator, invoking the operator
# on each element

# First see if the extension array object supports the op
res = NotImplemented
if method is not None and inspect.ismethod(method):
rvalues = right
if is_extension_array_dtype(right) and isinstance(right, ABCSeries):
rvalues = right.values
try:
res = method(rvalues)
except TypeError:
pass
except Exception as e:
raise e
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not doing op(left.values, right/right.values)? What does this manual checking/trying does that the former does not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above. In the code above, I'm testing to see if the ExtensionArray subclass has the operator defined. op and method are the same, as method is computed as the operator on left.values. I could change the name of the variable from method to op to make this clearer.


def convert_values(parm):
if is_extension_array_dtype(parm):
ovalues = parm.values
elif is_list_like(parm):
ovalues = parm
else: # Assume its an object
ovalues = [parm] * deflen
return ovalues

if res is NotImplemented:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain this fallback a bit more?

If the EA doesn't define ops, then I'm perfectly fine with raising NotImplementedError at the end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above. The idea here is that either the EA defines the ops, or the ExtensionDtype defines the ops.

The idea here is that if you know that your underlying ExtensionDtype already has the ops defined, you don't have to implement each of the ops at the ExtensionArray level.

I used DecimalArray as an example. The operators are already defined for Decimal so there is no reason to have to implement the operators for an array of Decimal.

# Try it on each element. Support operation to another
# ExtensionArray, or something that is list like, or
# a single object. This allows a result of an operator
# to be an object or any type
lvalues = convert_values(left)
rvalues = convert_values(right)

# Get the method for each object.
def callfunc(a, b):
f = getattr(a, op_name, None)
if f is not None:
return f(b)
else:
return NotImplemented
res = [callfunc(a, b) for (a, b) in zip(lvalues, rvalues)]

# We can't use (NotImplemented in res) because the
# results might be objects that have overridden __eq__
if any(isinstance(r, type(NotImplemented)) for r in res):
msg = "invalid operation {opn} between {one} and {two}"
raise TypeError(msg.format(opn=op_name,
one=type(lvalues),
two=type(rvalues)))

# At this point we have the result
# always return a full value series here
res_values = com._values_from_object(res)
if excons is not None:
if testseq.is_sequence_of_dtype(res_values):
# Convert to the ExtensionArray type if each result is of that
# type. If _logical_result was not None, this will then use
# the function set there to return an appropriate result
res_values = excons(res_values)

res_name = get_op_result_name(left, right)
return left._constructor(res_values, index=left.index,
name=res_name)


def _arith_method_SERIES(cls, op, special):
"""
Wrapper function for Series arithmetic operations, to avoid
Expand Down Expand Up @@ -1058,6 +1146,9 @@ def wrapper(left, right):
raise TypeError("{typ} cannot perform the operation "
"{op}".format(typ=type(left).__name__, op=str_rep))

elif is_extension_array_dtype(left):
return dispatch_to_extension_op(left, right, op_name)

lvalues = left.values
rvalues = right
if isinstance(rvalues, ABCSeries):
Expand Down Expand Up @@ -1208,6 +1299,9 @@ def wrapper(self, other, axis=None):
return self._constructor(res_values, index=self.index,
name=res_name)

elif is_extension_array_dtype(self):
return dispatch_to_extension_op(self, other, op_name, True)

elif isinstance(other, ABCSeries):
# By this point we have checked that self._indexed_same(other)
res_values = na_op(self.values, other.values)
Expand Down
48 changes: 40 additions & 8 deletions pandas/core/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -2185,18 +2185,34 @@ def _binop(self, other, func, level=None, fill_value=None):

this_vals, other_vals = ops.fill_binop(this.values, other.values,
fill_value)

with np.errstate(all='ignore'):
result = func(this_vals, other_vals)
name = ops.get_op_result_name(self, other)

if is_extension_array_dtype(this) or is_extension_array_dtype(other):
try:
result = func(this_vals, other_vals)
except TypeError:
result = NotImplemented
except Exception as e:
raise e

if result is NotImplemented:
result = [func(a, b) for a, b in zip(this_vals, other_vals)]
if is_extension_array_dtype(this):
excons = type(this_vals)._from_sequence
else:
excons = type(other_vals)._from_sequence
result = excons(result)
else:
with np.errstate(all='ignore'):
result = func(this_vals, other_vals)
result = self._constructor(result, index=new_index, name=name)
result = result.__finalize__(self)
if name is None:
# When name is None, __finalize__ overwrites current name
result.name = None
return result

def combine(self, other, func, fill_value=np.nan):
def combine(self, other, func, fill_value=None):
"""
Perform elementwise binary operation on two Series using given function
with optional fill value when an index is missing from one Series or
Expand All @@ -2208,6 +2224,9 @@ def combine(self, other, func, fill_value=np.nan):
func : function
Function that takes two scalars as inputs and return a scalar
fill_value : scalar value
The default specifies to use np.nan unless self is
backed by ExtensionArray, in which case the ExtensionArray
na_value is used.

Returns
-------
Expand All @@ -2227,20 +2246,33 @@ def combine(self, other, func, fill_value=np.nan):
Series.combine_first : Combine Series values, choosing the calling
Series's values first
"""
self_is_ext = is_extension_array_dtype(self)
if fill_value is None:
if self_is_ext:
fill_value = self.dtype.na_value
else:
fill_value = np.nan
if isinstance(other, Series):
new_index = self.index.union(other.index)
new_name = ops.get_op_result_name(self, other)
new_values = np.empty(len(new_index), dtype=self.dtype)
new_values = []
for i, idx in enumerate(new_index):
lv = self.get(idx, fill_value)
rv = other.get(idx, fill_value)
with np.errstate(all='ignore'):
new_values[i] = func(lv, rv)
new_values.append(func(lv, rv))
else:
new_index = self.index
with np.errstate(all='ignore'):
new_values = func(self._values, other)
if not self_is_ext:
with np.errstate(all='ignore'):
new_values = func(self._values, other)
else:
new_values = [func(lv, other) for lv in self._values]
new_name = self.name

if (self_is_ext and self.values.is_sequence_of_dtype(new_values)):
new_values = self._values._from_sequence(new_values)

return self._constructor(new_values, index=new_index, name=new_name)

def combine_first(self, other):
Expand Down
12 changes: 12 additions & 0 deletions pandas/tests/extension/base/getitem.py
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,18 @@ def test_getitem_slice(self, data):
result = data[slice(1)] # scalar
assert isinstance(result, type(data))

def test_get(self, data):
# GH 20882
s = pd.Series(data, index=[2 * i for i in range(len(data))])
assert s.get(4) == s.iloc[2]

result = s.get([4, 6])
expected = s.iloc[[2, 3]]
self.assert_series_equal(result, expected)

s = pd.Series(data[:6], index=list('abcdef'))
assert s.get('c') == s.iloc[2]

def test_take_sequence(self, data):
result = pd.Series(data)[[0, 1, 3]]
assert result.iloc[0] == data[0]
Expand Down
13 changes: 13 additions & 0 deletions pandas/tests/extension/category/test_categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,9 @@

import pytest
import numpy as np
import pandas as pd

import pandas.util.testing as tm

from pandas.api.types import CategoricalDtype
from pandas import Categorical
Expand Down Expand Up @@ -157,3 +160,13 @@ def test_value_counts(self, all_data, dropna):

class TestCasting(base.BaseCastingTests):
pass


def test_combine():
orig_data1 = make_data()
orig_data2 = make_data()
s1 = pd.Series(Categorical(orig_data1, ordered=True))
s2 = pd.Series(Categorical(orig_data2, ordered=True))
result = s1.combine(s2, lambda x1, x2: x1 <= x2)
expected = pd.Series([a <= b for (a, b) in zip(orig_data1, orig_data2)])
tm.assert_series_equal(result, expected)
Loading