Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG/Perf: Support ExtensionArrays in where #24114

Merged
merged 26 commits into from
Dec 10, 2018

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Dec 5, 2018

We need some way to do .where on EA object for DatetimeArray. Adding it
to the interface is, I think, the easiest way.

Initially I started to write a version on ExtensionBlock, but it proved
to be unwieldy. to write a version that performed well for all types.
It may be possible to do using _ndarray_values but we'd need a few more
things around that (missing values, converting an arbitrary array to the
"same' ndarary_values, error handling, re-constructing). It seemed easier
to push this down to the array.

The implementation on ExtensionArray is readable, but likely slow since
it'll involve a conversion to object-dtype.

Closes #24077
Closes #24142
Closes #16983

We need some way to do `.where` on EA object for DatetimeArray. Adding it
to the interface is, I think, the easiest way.

Initially I started to write a version on ExtensionBlock, but it proved
to be unwieldy. to write a version that performed well for all types.
It *may* be possible to do using `_ndarray_values` but we'd need a few more
things around that (missing values, converting an arbitrary array to the
"same' ndarary_values, error handling, re-constructing). It seemed easier
to push this down to the array.

The implementation on ExtensionArray is readable, but likely slow since
it'll involve a conversion to object-dtype.

Closes pandas-dev#24077
@TomAugspurger TomAugspurger added Indexing Related to indexing on series/frames, not to indexes themselves ExtensionArray Extending pandas with custom dtypes or arrays. labels Dec 5, 2018
@TomAugspurger TomAugspurger added this to the 0.24.0 milestone Dec 5, 2018
@pep8speaks
Copy link

Hello @TomAugspurger! Thanks for submitting the PR.

@TomAugspurger
Copy link
Contributor Author

categorical perf:

master

In [2]: import pandas as pd

In [3]: cser = pd.Series(pd.Categorical(['a', 'b', 'c'] * 10000))

In [4]: %timeit cser.where(cser == 'a', 'c')
1.18 ms ± 37.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

PR:

In [3]: %timeit cser.where(cser == 'a', 'c')
755 µs ± 23.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

* Ensure data generated OK.
* Remove erroneous comments about alignment. That was user error.
@codecov
Copy link

codecov bot commented Dec 5, 2018

Codecov Report

Merging #24114 into master will increase coverage by <.01%.
The diff coverage is 94.44%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24114      +/-   ##
==========================================
+ Coverage    92.2%    92.2%   +<.01%     
==========================================
  Files         162      162              
  Lines       51714    51782      +68     
==========================================
+ Hits        47682    47747      +65     
- Misses       4032     4035       +3
Flag Coverage Δ
#multiple 90.6% <94.44%> (ø) ⬆️
#single 43% <12.5%> (-0.02%) ⬇️
Impacted Files Coverage Δ
pandas/core/dtypes/base.py 100% <100%> (ø) ⬆️
pandas/core/arrays/base.py 96.81% <100%> (+0.04%) ⬆️
pandas/core/indexes/category.py 97.88% <100%> (-0.02%) ⬇️
pandas/core/arrays/interval.py 93.16% <100%> (+0.17%) ⬆️
pandas/core/arrays/sparse.py 91.92% <88.88%> (-0.04%) ⬇️
pandas/core/internals/blocks.py 93.71% <93.33%> (-0.01%) ⬇️
pandas/core/arrays/categorical.py 95.37% <94.11%> (-0.03%) ⬇️
pandas/core/arrays/period.py 98.31% <94.44%> (-0.16%) ⬇️
pandas/util/testing.py 87.51% <0%> (+0.09%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9f2c716...56470c3. Read the comment docs.

1 similar comment
@codecov
Copy link

codecov bot commented Dec 5, 2018

Codecov Report

Merging #24114 into master will increase coverage by <.01%.
The diff coverage is 94.44%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24114      +/-   ##
==========================================
+ Coverage    92.2%    92.2%   +<.01%     
==========================================
  Files         162      162              
  Lines       51714    51782      +68     
==========================================
+ Hits        47682    47747      +65     
- Misses       4032     4035       +3
Flag Coverage Δ
#multiple 90.6% <94.44%> (ø) ⬆️
#single 43% <12.5%> (-0.02%) ⬇️
Impacted Files Coverage Δ
pandas/core/dtypes/base.py 100% <100%> (ø) ⬆️
pandas/core/arrays/base.py 96.81% <100%> (+0.04%) ⬆️
pandas/core/indexes/category.py 97.88% <100%> (-0.02%) ⬇️
pandas/core/arrays/interval.py 93.16% <100%> (+0.17%) ⬆️
pandas/core/arrays/sparse.py 91.92% <88.88%> (-0.04%) ⬇️
pandas/core/internals/blocks.py 93.71% <93.33%> (-0.01%) ⬇️
pandas/core/arrays/categorical.py 95.37% <94.11%> (-0.03%) ⬇️
pandas/core/arrays/period.py 98.31% <94.44%> (-0.16%) ⬇️
pandas/util/testing.py 87.51% <0%> (+0.09%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9f2c716...56470c3. Read the comment docs.

@codecov
Copy link

codecov bot commented Dec 5, 2018

Codecov Report

Merging #24114 into master will increase coverage by <.01%.
The diff coverage is 97.43%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24114      +/-   ##
==========================================
+ Coverage   92.21%   92.21%   +<.01%     
==========================================
  Files         162      162              
  Lines       51723    51761      +38     
==========================================
+ Hits        47694    47731      +37     
- Misses       4029     4030       +1
Flag Coverage Δ
#multiple 90.61% <97.43%> (ø) ⬆️
#single 43% <7.69%> (-0.03%) ⬇️
Impacted Files Coverage Δ
pandas/core/arrays/base.py 97.41% <ø> (ø) ⬆️
pandas/core/indexes/category.py 97.9% <ø> (ø) ⬆️
pandas/core/arrays/sparse.py 92.08% <ø> (ø) ⬆️
pandas/core/internals/blocks.py 93.81% <100%> (+0.13%) ⬆️
pandas/core/arrays/categorical.py 95.3% <83.33%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 029cde2...539d3cb. Read the comment docs.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Dec 5, 2018 via email

Series.where : Similar method for Series.
DataFrame.where : Similar method for DataFrame.
"""
return type(self)._from_sequence(np.where(cond, self, other),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm this turns it into an array. we have much special handling for this (e.g. see .where for DTI). i think this needs to dispatch somehow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I see you override things. ok then.

@TomAugspurger TomAugspurger mentioned this pull request Dec 5, 2018
12 tasks
@TomAugspurger
Copy link
Contributor Author

@jorisvandenbossche do you have any objections to adding ExtensionArray.where to the interface? Or do the behavior change on Series[category].where?

@jorisvandenbossche
Copy link
Member

Just for context: how is this different from eaarray[cond] = other ?

The behaviour change to keep the categorical dtype is certainly fine.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Dec 6, 2018 via email

@jorisvandenbossche
Copy link
Member

But other than that they should be identical for 1-d arrays, right?

And since EAs are 1D, and our internal EAs support setitem, why is the new code needed? Or what in setitem is not working as it should right now? (maybe I am missing some context)

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Dec 6, 2018 via email

@@ -661,6 +662,42 @@ def take(self, indices, allow_fill=False, fill_value=None):
# pandas.api.extensions.take
raise AbstractMethodError(self)

def where(self, cond, other):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other implementations of where (DataFrame.where, Index.where, etc.) have other default to NA. Do we want to maintain that convention here too?

lother = other.left
rother = other.right
left = np.where(cond, self.left, lother)
right = np.where(cond, self.right, rother)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left/right should have a where method, so might be a bit safer to do something like:

left = self.left.where(cond, lother)
right = self.right.where(cond, rother)

np.where looks like it can cause some problems depending on what left/right are:

In [2]: left = pd.date_range('2018', periods=3); left
Out[2]: DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03'], dtype='datetime64[ns]', freq='D')

In [3]: np.where([True, False, True], left, pd.NaT)
Out[3]: array([1514764800000000000, NaT, 1514937600000000000], dtype=object)

@@ -777,6 +777,17 @@ def take(self, indices, allow_fill=False, fill_value=None, axis=None,

return self._shallow_copy(left_take, right_take)

def where(self, cond, other):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to have IntervalIndex use this implementation instead of the naive object array based implementation that it currently uses. Can certainly leave that for a follow-up PR though, and I'd be happy to do it.

@@ -777,6 +777,17 @@ def take(self, indices, allow_fill=False, fill_value=None, axis=None,

return self._shallow_copy(left_take, right_take)

def where(self, cond, other):
if is_scalar(other) and isna(other):
lother = rother = other
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be safe, I think this should be lother = rother = self.left._na_value to ensure that we're filling left/right with the correct NA value. If we use left/right.where instead of np.where this should be handled automatically iirc, so could maybe just do that instead.

def where(self, cond, other):
if is_scalar(other) and isna(other):
lother = rother = other
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make this an elif that checks that other is interval-like (something like isinstance(other, Interval) or is_interval_dtype(other)), then have an else clause that raises a ValueError saying other must be interval-like?

As written I think this would raise a somewhat unclear AttributeError in self._check_closed_matches since it assumes other.closed exists.

@TomAugspurger
Copy link
Contributor Author

On further reflection, I realize that ndarrays don't have a where method, so I don't think we should add ExtensionArray.where.

I'll see if setitem on a copy is sufficient.

@@ -501,10 +501,13 @@ def _can_reindex(self, indexer):

@Appender(_index_shared_docs['where'])
def where(self, cond, other=None):
# TODO: Investigate an alternative implementation with
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# for the type
other = self.dtype.na_value

if is_sparse(self.values):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this, we fail in the

            result = self._holder._from_sequence(
                 np.where(cond, self.values, other),
                 dtype=dtype,

since the where may change the dtype, if NaN is introduced.

Implementing SparseArray.__setitem__ would allow us to remove this block.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be an overriding method in Sparse then, not here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have a SparseBlock anymore. I can add one back if you want, but I figured it'd be easier not to since implementing SparseArray.__setitem__ will remove the need for this, and we'd just have to remove SparseBlock again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is pretty hacky. This was why we had originally a .get_values() methon on Sparse to do things like this. We need something to give back the underlying type of the object, which is useful for Categorical as well. Would rather create a generalized soln than hack it like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we don't need this. I think we can just re-infer the dtype from the output of np.where.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so is this changing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing from master? Yes, in the sense that it'll return a SparseArray. But it still densifies when np.where is called.

If you mean "is this changing in the future", yes it'll be removed when SparseArray.__setitem__ is implemented.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh ok, can you add a TODO comment

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Dec 7, 2018

Or what in setitem is not working as it should right now? (maybe I am missing some context)

@jorisvandenbossche OK, here's some context :) The most immediate failure is mismatched block dimensions / shapes for in DataFrame.where for EA (not just DatetimeArray, but that was tested). This uses Categorical

In [8]: df = pd.DataFrame({"A": pd.Categorical([1, 2, 3])})

In [9]: df.where(pd.DataFrame({"A": [True, False, True]}))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-9-56dcebf7e672> in <module>
----> 1 df.where(pd.DataFrame({"A": [True, False, True]}))

... 

~/sandbox/pandas/pandas/core/internals/blocks.py in __init__(self, values, placement, ndim)
     84             raise ValueError(
     85                 'Wrong number of items passed {val}, placement implies '
---> 86                 '{mgr}'.format(val=len(self.values), mgr=len(self.mgr_locs)))
     87
     88     def _check_ndim(self, values, ndim):

ValueError: Wrong number of items passed 3, placement implies 1

The broadcasting is all messed up since the shapes aren't right (we're using Block.where).

ipdb> cond
array([[ True],
       [False],
       [ True]])
ipdb> values
[1, 2, 3]
Categories (3, int64): [1, 2, 3]
ipdb> other
nan

A hacky, but shorter fix is to use the following (this is in Block.where)

diff --git a/pandas/core/internals/blocks.py b/pandas/core/internals/blocks.py
index 618b9eb12..2356a226d 100644
--- a/pandas/core/internals/blocks.py
+++ b/pandas/core/internals/blocks.py
@@ -1319,12 +1319,20 @@ class Block(PandasObject):
 
         values = self.values
         orig_other = other
+        if not self._can_consolidate:
+            transpose = False
+
         if transpose:
             values = values.T
 
         other = getattr(other, '_values', getattr(other, 'values', other))
         cond = getattr(cond, 'values', cond)
 
+        if not self._can_consolidate:
+            if cond.ndim == 2:
+                assert cond.shape[-1] == 1
+                cond = cond.ravel()
+
         # If the default broadcasting would go in the wrong direction, then
         # explicitly reshape other instead
         if getattr(other, 'ndim', 0) >= 1:

That fixes most of the issues I was having on the DTA branch. Still running the tests to see if any were re-broken.
(edit: it's slightly more complicated, have to handle reshaping other as well, so ~15 more LOC).


So, in summary

  1. we need to support EAs in DataFrame.where
  2. The diff just above is a hacky way of achieving just that (no more).
  3. This PR will still be useful for avoiding the conversion to object (DatetimeTZBlock will avoid it even without this PR, thanks to _try_coerce_args coercing datetimes to ints)

@TomAugspurger TomAugspurger changed the title API: Added ExtensionArray.where API/BUG/Perf: Support ExtensionArrays in where Dec 7, 2018
@TomAugspurger TomAugspurger changed the title API/BUG/Perf: Support ExtensionArrays in where BUG/Perf: Support ExtensionArrays in where Dec 7, 2018
pandas/core/internals/blocks.py Outdated Show resolved Hide resolved
# for the type
other = self.dtype.na_value

if is_sparse(self.values):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be an overriding method in Sparse then, not here

else:
dtype = self.dtype

# rough heuristic to see if the other array implements setitem
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again you don't actually need to do this here, rather override in the appropriate class

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will still need the check for extension, even if we create SparseBlock again.

@@ -122,6 +162,60 @@ def test_get_indexer_non_unique(self, idx_values, key_values, key_class):
tm.assert_numpy_array_equal(expected, result)
tm.assert_numpy_array_equal(exp_miss, res_miss)

def test_where_unobserved_nan(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is where all of the where tests are?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There weren't any previously (we used to fall back to object).

* Unxfail (most) of the new combine_first tests
* Removed stale comment
* group conditions
@TomAugspurger
Copy link
Contributor Author

Updated. Main outstanding point is whether or not we should create a SparseBlock just for this. I don't have a preference.

if isinstance(other, (ABCIndexClass, ABCSeries)):
other = other.array

elif isinstance(other, ABCDataFrame):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add some comments here

# for the type
other = self.dtype.na_value

if is_sparse(self.values):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is pretty hacky. This was why we had originally a .get_values() methon on Sparse to do things like this. We need something to give back the underlying type of the object, which is useful for Categorical as well. Would rather create a generalized soln than hack it like this.


# rough heuristic to see if the other array implements setitem
if (self._holder.__setitem__ == ExtensionArray.__setitem__
or self._holder.__setitem__ == SparseArray.__setitem__):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what the heck is this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general block is to check if the block implements __setitem__. That specific line is backwards compat for SparseArray, which implements __setitem__ to raise a TypeError instead of a NotImplementedError.

I suppose it'd be cleaner to do this in a try / except block...

@TomAugspurger
Copy link
Contributor Author

Cleaned things up a bit I think.

@TomAugspurger
Copy link
Contributor Author

All green.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty reasonable. question about the sparse checks.

# for the type
other = self.dtype.na_value

if is_sparse(self.values):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so is this changing?

if lib.is_scalar(other):
msg = object_msg.format(other)
else:
msg = compat.reprlib.repr(other)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we don't blow up with a long message for large categoricals. I messed it up though, one sec.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed all this stuff and just print out the text of the message.

With a bit of effort we could figure out exactly which of the new values is causing the fallback of object, but that'd take some work (we don't know the exact type /dtype of other here, so there will be a lot of conditions). Not a high priority.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

k cool

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small additional comments, lgtm otherwise. ping on green.

return np.random.choice(list(string.ascii_letters), size=100)
while True:
values = np.random.choice(list(string.ascii_letters), size=100)
# ensure we meet the requirement
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no repeated values but duplicates allowed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just that the first two are distinct., since the where test requires that data[0] != data[1].

@@ -11,7 +11,11 @@ def dtype():

@pytest.fixture
def data():
"""Length-100 array for this type."""
"""Length-100 array for this type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you copy this doc-string to the categorical one

@@ -2658,6 +2708,32 @@ def concat_same_type(self, to_concat, placement=None):
values, placement=placement or slice(0, len(values), 1),
ndim=self.ndim)

def where(self, other, cond, align=True, errors='raise',
try_cast=False, axis=0, transpose=False):
# This can all be deleted in favor of ExtensionBlock.where once
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add TODO(EA) or someting here so we know to remove this

# for the type
other = self.dtype.na_value

if is_sparse(self.values):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh ok, can you add a TODO comment

@TomAugspurger
Copy link
Contributor Author

All green.

@jreback jreback merged commit baad046 into pandas-dev:master Dec 10, 2018
@jreback
Copy link
Contributor

jreback commented Dec 10, 2018

thanks!

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants