Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Series.combine() fails with ExtensionArray inside of Series #21183

Merged
merged 18 commits into from
Jun 8, 2018
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.23.1.txt
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@ Bug Fixes
~~~~~~~~~

- tab completion on :class:`Index` in IPython no longer outputs deprecation warnings (:issue:`21125`)
- Series.combine() no longer fails with ExtensionArray inside of Series (:issue:`20825`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Series.combine() --> :meth:`Series.combine`
  • ExtensionArray --> :class:`~pandas.api.extensions.ExtensionArray`
  • Series --> :class:`Series`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


Groupby/Resample/Rolling
^^^^^^^^^^^^^^^^^^^^^^^^
Expand Down
11 changes: 11 additions & 0 deletions pandas/core/arrays/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -610,3 +610,14 @@ def _ndarray_values(self):
used for interacting with our indexers.
"""
return np.array(self)

# ------------------------------------------------------------------------
# Utilities for use by subclasses
# ------------------------------------------------------------------------
def is_sequence_of_dtype(self, seq):
"""
Given a sequence, determine whether all members have the appropriate
type for this instance of an ExtensionArray
"""
thistype = self.dtype.type
return all(isinstance(i, thistype) for i in seq)
48 changes: 40 additions & 8 deletions pandas/core/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -2185,18 +2185,34 @@ def _binop(self, other, func, level=None, fill_value=None):

this_vals, other_vals = ops.fill_binop(this.values, other.values,
fill_value)

with np.errstate(all='ignore'):
result = func(this_vals, other_vals)
name = ops.get_op_result_name(self, other)

if is_extension_array_dtype(this) or is_extension_array_dtype(other):
try:
result = func(this_vals, other_vals)
except TypeError:
result = NotImplemented
except Exception as e:
raise e
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this last except block is redundant, since non-TypeError exceptions will be raised regardless of if it that block is present or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


if result is NotImplemented:
result = [func(a, b) for a, b in zip(this_vals, other_vals)]
if is_extension_array_dtype(this):
excons = type(this_vals)._from_sequence
else:
excons = type(other_vals)._from_sequence
result = excons(result)
else:
with np.errstate(all='ignore'):
result = func(this_vals, other_vals)
result = self._constructor(result, index=new_index, name=name)
result = result.__finalize__(self)
if name is None:
# When name is None, __finalize__ overwrites current name
result.name = None
return result

def combine(self, other, func, fill_value=np.nan):
def combine(self, other, func, fill_value=None):
"""
Perform elementwise binary operation on two Series using given function
with optional fill value when an index is missing from one Series or
Expand All @@ -2208,6 +2224,9 @@ def combine(self, other, func, fill_value=np.nan):
func : function
Function that takes two scalars as inputs and return a scalar
fill_value : scalar value
The default specifies to use np.nan unless self is
backed by ExtensionArray, in which case the ExtensionArray
na_value is used.

Returns
-------
Expand All @@ -2227,20 +2246,33 @@ def combine(self, other, func, fill_value=np.nan):
Series.combine_first : Combine Series values, choosing the calling
Series's values first
"""
self_is_ext = is_extension_array_dtype(self)
if fill_value is None:
if self_is_ext:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thikn na_value_for_dtype is appropriate here, probably with compat=False.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made that change.

fill_value = self.dtype.na_value
else:
fill_value = np.nan
if isinstance(other, Series):
new_index = self.index.union(other.index)
new_name = ops.get_op_result_name(self, other)
new_values = np.empty(len(new_index), dtype=self.dtype)
new_values = []
for i, idx in enumerate(new_index):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the enumerate is no longer necessary since i isn't used, so you should be able to directly iterate over new_index.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

lv = self.get(idx, fill_value)
rv = other.get(idx, fill_value)
with np.errstate(all='ignore'):
new_values[i] = func(lv, rv)
new_values.append(func(lv, rv))
else:
new_index = self.index
with np.errstate(all='ignore'):
new_values = func(self._values, other)
if not self_is_ext:
with np.errstate(all='ignore'):
new_values = func(self._values, other)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't really understand (but this is related with the current implementation, not your changes) why we don't do it element-wise here (no loop over the values as is the case if other is Series).

For me, this seems like a bug in the current implementation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche You're correct. I created a new issue #21248 . I will fix that here.

else:
new_values = [func(lv, other) for lv in self._values]
new_name = self.name

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put a comment on what is going on here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if (self_is_ext and self.values.is_sequence_of_dtype(new_values)):
new_values = self._values._from_sequence(new_values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under what conditions is self.values.is_sequence_of_dtype(new_values) false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

combine takes a func as an argument. The func may not necessarily return objects of the ExtensionDtype. For example, consider a logical operator. So here self.values is the original ExtensionArray, but new_values is the result of applying the func element by element, and those individual results aren't necessarily of the corresponding ExtensionDtype.

Copy link
Contributor

@TomAugspurger TomAugspurger May 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

How important is it to allow coercion of output type? The previous code certainly considered dtype-preserving functions to be the expected case, since the pre-allocated new_dtype had the same dtype as self. Only if necessary was it coerced.

Without having studied the uses of combine, can we say "extension arrays are not allowed to upcast in combine?". i.e. Series[ExtensionArray].combine(Series[Any]) -> Series[ExtensionArray]. That doesn't sound quiet right...

Anyway, my aversion is to having to perform a full scan of the data just to determine the dtype. That's what types are for :) Can we ask the user to provide an output_dtype argument?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider the following (using the implementation in this PR):

In [1]: import pandas as pd
   ...: from pandas.tests.extension.decimal.array import DecimalArray, make_dat
   ...: a
   ...:

In [2]: da1= make_data()
   ...: da2= make_data()
   ...: da1[:5], da2[:5]
   ...:
Out[2]:
([Decimal('0.9100374922132099531069115982973016798496246337890625'),
  Decimal('0.559003605365540945371094494475983083248138427734375'),
  Decimal('0.31951366993722529752375294265220873057842254638671875'),
  Decimal('0.238154945978455767630066475248895585536956787109375'),
  Decimal('0.5851317119327676952167394119896925985813140869140625')],
 [Decimal('0.12525543165059660477567149428068660199642181396484375'),
  Decimal('0.68213474143447905273518472313298843801021575927734375'),
  Decimal('0.29982507800002611286771525556105189025402069091796875'),
  Decimal('0.297029189226840184545608281041495501995086669921875'),
  Decimal('0.5969224093736389402664599401759915053844451904296875')])

In [3]: s1 = pd.Series(DecimalArray(da1))
   ...: s2 = pd.Series(DecimalArray(da2))
   ...: pd.DataFrame({'s1':s1, 's2':s2}).head()
   ...:
Out[3]:
                                                  s1
                     s2
0  0.91003749221320995310691159829730167984962463...  0.12525543165059660477567149428068660199642181...
1  0.55900360536554094537109449447598308324813842...  0.68213474143447905273518472313298843801021575...
2  0.31951366993722529752375294265220873057842254...  0.29982507800002611286771525556105189025402069...
3  0.23815494597845576763006647524889558553695678...  0.29702918922684018454560828104149550199508666...
4  0.58513171193276769521673941198969259858131408...  0.59692240937363894026645994017599150538444519...

In [4]: s1.dtype, s2.dtype
Out[4]:
(<pandas.tests.extension.decimal.array.DecimalDtype at 0x15bc5ef1048>,
 <pandas.tests.extension.decimal.array.DecimalDtype at 0x15bc5ef1048>)

In [5]: cores = s1.combine(s2, lambda x1, x2: x1 if x1 < x2 else x2)
   ...: cores.head()
   ...:
Out[5]:
0    0.12525543165059660477567149428068660199642181...
1    0.55900360536554094537109449447598308324813842...
2    0.29982507800002611286771525556105189025402069...
3    0.23815494597845576763006647524889558553695678...
4    0.58513171193276769521673941198969259858131408...
dtype: decimal

In [6]: clres = s1.combine(s2, lambda x1, x2: (x1 < x2))
   ...: clres.head()
   ...:
Out[6]:
0    False
1     True
2    False
3     True
4     True
dtype: bool

Note that with the implementation as in this PR, we get a Series of dtype decimal since the results are all Decimal objects, and we get a boolean Series when the results are logical.

The previous behavior would product object, which is not what you want. Given a sequence s, if you do pd.Series(s), the internals of pandas look at the sequence s to determine the dtype. But those internals won't pick things up correctly for things that are ExtensionDtype. So somehow, we need to know to call ExtensionArray._from_sequence() to get the result of combine into the right type.

The implementation is already doing this element-by-element, so we are doing a full scan of both the left and right arrays. This is an extra scan on the result.

We can add an output_dtype argument to combine, but then we'd be changing the API of combine and would need to decide what the default value of that parameter is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather not do this here at all, prefering instead to dispatch to the EA itself for a .combine() method, this will be much simpler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of introducing the is_sequence_of_dtype check, an alternative might be to try: .. except _from_sequence()?

I would prefer not to add combine logic to the extension array

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche My most recent commit has this change you suggested and removes is_sequence_of_dtype. It required an extra check to see if it is categorical. I also added a test which revealed a couple of other bugs.

In terms of whether combine should be in the EA class or in series.py, you guys will have to decide on that.


return self._constructor(new_values, index=new_index, name=new_name)

def combine_first(self, other):
Expand Down
13 changes: 13 additions & 0 deletions pandas/tests/extension/category/test_categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,9 @@

import pytest
import numpy as np
import pandas as pd

import pandas.util.testing as tm

from pandas.api.types import CategoricalDtype
from pandas import Categorical
Expand Down Expand Up @@ -157,3 +160,13 @@ def test_value_counts(self, all_data, dropna):

class TestCasting(base.BaseCastingTests):
pass


def test_combine():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be moved to be within the TestMethods class? And can you reference the GitHub issue number in a comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

orig_data1 = make_data()
orig_data2 = make_data()
s1 = pd.Series(Categorical(orig_data1, ordered=True))
s2 = pd.Series(Categorical(orig_data2, ordered=True))
result = s1.combine(s2, lambda x1, x2: x1 <= x2)
expected = pd.Series([a <= b for (a, b) in zip(orig_data1, orig_data2)])
tm.assert_series_equal(result, expected)
7 changes: 2 additions & 5 deletions pandas/tests/test_window.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ def win_types(request):
return request.param


@pytest.fixture(params=['kaiser', 'gaussian', 'general_gaussian', 'slepian'])
@pytest.fixture(params=['kaiser', 'gaussian', 'general_gaussian'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge conflict?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, something went wrong with git. I didn't touch that code, but did a rebase, and I think it pulled that in.

def win_types_special(request):
return request.param

Expand Down Expand Up @@ -1079,8 +1079,7 @@ def test_cmov_window_special(self, win_types_special):
kwds = {
'kaiser': {'beta': 1.},
'gaussian': {'std': 1.},
'general_gaussian': {'power': 2., 'width': 2.},
'slepian': {'width': 0.5}}
'general_gaussian': {'power': 2., 'width': 2.}}

vals = np.array([6.95, 15.21, 4.72, 9.12, 13.81, 13.49, 16.68, 9.48,
10.63, 14.48])
Expand All @@ -1090,8 +1089,6 @@ def test_cmov_window_special(self, win_types_special):
13.65671, 12.01002, np.nan, np.nan],
'general_gaussian': [np.nan, np.nan, 9.85011, 10.71589, 11.73161,
13.08516, 12.95111, 12.74577, np.nan, np.nan],
'slepian': [np.nan, np.nan, 9.81073, 10.89359, 11.70284, 12.88331,
12.96079, 12.77008, np.nan, np.nan],
'kaiser': [np.nan, np.nan, 9.86851, 11.02969, 11.65161, 12.75129,
12.90702, 12.83757, np.nan, np.nan]
}
Expand Down
13 changes: 9 additions & 4 deletions pandas/util/testing.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@
is_categorical_dtype,
is_interval_dtype,
is_sequence,
is_list_like)
is_list_like, is_extension_array_dtype)
from pandas.io.formats.printing import pprint_thing
from pandas.core.algorithms import take_1d
import pandas.core.common as com
Expand Down Expand Up @@ -1118,10 +1118,12 @@ def assert_extension_array_equal(left, right):
right_na = right.isna()
assert_numpy_array_equal(left_na, right_na)

left_valid = left[~left_na].astype(object)
right_valid = right[~right_na].astype(object)
if len(left_na) > 0 and len(right_na) > 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the changes here? left_na is an array of booleans, so it'll always be the same length as left.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If left has length zero, then the statement that follows, i.e.

left_valid = left[~left_na].astype(object)

fails. There was some test case that was failing without this change (but I don't remember which one, and it may have been in the ops stuff). I will leave this change out and see if this fix for combine still works.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this works

In [9]: pd.Series()[np.array([], dtype='bool')].astype(object)
Out[9]: Series([], dtype: object)

then we should ensure that Series[extension_array] works too. If you find a reproducible example, could you make a new issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took this code out of this PR. It's not needed here. When we get to the ops stuff, I'll have to reintroduce it, and figure out the boundary case that required this particular case.


assert_numpy_array_equal(left_valid, right_valid)
left_valid = left[~left_na].astype(object)
right_valid = right[~right_na].astype(object)

assert_numpy_array_equal(left_valid, right_valid)


# This could be refactored to use the NDFrame.equals method
Expand Down Expand Up @@ -1224,6 +1226,9 @@ def assert_series_equal(left, right, check_dtype=True,
left = pd.IntervalIndex(left)
right = pd.IntervalIndex(right)
assert_index_equal(left, right, obj='{obj}.index'.format(obj=obj))
elif (is_extension_array_dtype(left) and not is_categorical_dtype(left) and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the not is_categorical_dtype needed?

Copy link
Contributor Author

@Dr-Irv Dr-Irv May 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was left over from the ops stuff, so I have removed it.

is_extension_array_dtype(right) and not is_categorical_dtype(right)):
return assert_extension_array_equal(left.values, right.values)

else:
_testing.assert_almost_equal(left.get_values(), right.get_values(),
Expand Down