Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Implement Kleene logic for BooleanArray #29842

Merged

Conversation

TomAugspurger
Copy link
Contributor

xref #29556

I have a few TODOs, and a few tests that I need to unxfail. Putting this up now so that @jorisvandenbossche can take a look.

pandas/core/arrays/boolean.py Outdated Show resolved Hide resolved
pandas/core/arrays/boolean.py Outdated Show resolved Hide resolved
pandas/tests/arrays/test_boolean.py Outdated Show resolved Hide resolved
other = pd.array([True] * len(data), dtype="boolean")
self._compare_other(data, op_name, other)
other = np.array([True] * len(data))
self._compare_other(data, op_name, other)
other = pd.Series([True] * len(data), dtype="boolean")
self._compare_other(data, op_name, other)

def test_kleene_or(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A careful review of these new test cases would be greatly appreciated. I've tried to make them as clear as possible, while covering all the cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through the tests, very clear, added a few comments, for the rest looks good to me!

@jorisvandenbossche jorisvandenbossche added ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Nov 27, 2019
@jorisvandenbossche jorisvandenbossche added this to the 1.0 milestone Nov 27, 2019
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started adding some docs in #29597 as well (on the NA scalar), so need to think about potential overlap. Now, I think we need to explain it in both places anyhow.

doc/source/user_guide/boolean.rst Outdated Show resolved Hide resolved
doc/source/user_guide/boolean.rst Outdated Show resolved Hide resolved
doc/source/user_guide/boolean.rst Show resolved Hide resolved
doc/source/user_guide/boolean.rst Outdated Show resolved Hide resolved
pandas/core/arrays/boolean.py Outdated Show resolved Hide resolved
pandas/tests/arrays/test_boolean.py Outdated Show resolved Hide resolved
pandas/tests/arrays/test_boolean.py Show resolved Hide resolved
pandas/tests/arrays/test_boolean.py Outdated Show resolved Hide resolved
other = pd.array([True] * len(data), dtype="boolean")
self._compare_other(data, op_name, other)
other = np.array([True] * len(data))
self._compare_other(data, op_name, other)
other = pd.Series([True] * len(data), dtype="boolean")
self._compare_other(data, op_name, other)

def test_kleene_or(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through the tests, very clear, added a few comments, for the rest looks good to me!

@TomAugspurger
Copy link
Contributor Author

Sorry @jorisvandenbossche, I pushed a somewhat major refactor after you reviewed :/ Should be done now.

High-level overview:

  • Split into dedicated methods kleene_or, kleene_and, and kleene_xor, rather than doing everything in _create_logical_method.
  • Removed the _compare_other, test_scalar, and test_array tests. Since there were parameterized over all the ops, we would have needed to essentially re-implement the kleene logic in _compare_other to get the masking right.
  • Added tests for each op against both arrays and scalars (True, False, np.nan)

My main concern right now is that I may be assuming that masked values are false in a few places.


One important comment was buried

But that gives the question: what do we want a | np.nan to do?

Right now, I've adopted pd.NA semantics. I think we do that or raise, I don't have a preference. Easy to support both.

@jorisvandenbossche
Copy link
Member

My main concern right now is that I may be assuming that masked values are false in a few places.

How would you be assuming that? Is there a place that you "uncover" masked values?

@jorisvandenbossche
Copy link
Member

Right now, I've adopted pd.NA semantics. I think we do that or raise, I don't have a preference. Easy to support both.

I would maybe rather raise an error then. As otherwise you have a np.nan that behaves differently depending on the context.

@TomAugspurger
Copy link
Contributor Author

How would you be assuming that? Is there a place that you "uncover" masked values?

Still thinking through it, but we do things rougly like

result = left & right
...
mask[result] = False

i.e. we update the mask based on the result. I'll add some tests where I manually modify the _data of masked values to be True.

@jorisvandenbossche
Copy link
Member

i.e. we update the mask based on the result.

Ah, I see. Yes, that shouldn't be done. Or it should combine with mask from the original values first, I think?

@TomAugspurger
Copy link
Contributor Author

Another batch of changes

  • raise for NaN: 7f78a64
  • More tests (empty arrays, inplace mutation) 747e046, 36b171b
  • Ensure we aren't relying on the assumption that masked values are False: d0a8cca

The tests in d0a8cca are a bit tricky... Hopefully they're comprehensive and make sense.

pandas/core/arrays/boolean.py Outdated Show resolved Hide resolved
pandas/core/arrays/boolean.py Outdated Show resolved Hide resolved
mask = left_mask

if right_mask is not None:
mask = mask | right_mask
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still needed with the new code below to create the mask?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe so, though I may be wrong...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, not needed after using your code. Thanks.

# return result, mask

result = left | right
mask[left & ~left_mask] = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A logical op is quite a bit faster than a setitem like this, so if we can get the final result by combining different logical ops, that might be more performant

pandas/core/arrays/boolean.py Outdated Show resolved Hide resolved
pandas/tests/arrays/test_boolean.py Outdated Show resolved Hide resolved
result = left & right
# unmask where either left or right is False
mask[~left & ~left_mask] = False
mask[~right & ~right_mask] = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment here, I think that something like this can be faster:

left_false = ~left & ~left_mask
right_false= ~right & ~right_mask

mask = (left_mask & ~right_false) | (right_mask & ~left_false)

(avoiding setitem)

And need to think if we can avoid some ~

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further optimization:

left_false = ~(left | left_mask)
right_false= ~(right | right_mask)
mask = (left_mask & ~right_false) | (right_mask & ~left_false)

Timing comparison:

left = np.random.randint(0, 2, 1000).astype(bool)
right = np.random.randint(0, 2, 1000).astype(bool) 
left_mask = np.random.randint(0, 2, 1000).astype(bool) 
right_mask = np.random.randint(0, 2, 1000).astype(bool)
In [47]: %%timeit 
    ...: mask = left_mask | right_mask 
    ...: mask[~left & ~left_mask] = False 
    ...: mask[~right & ~right_mask] = False 

7.2 µs ± 106 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [58]: %%timeit 
    ...: left_false = ~(left | left_mask) 
    ...: right_false= ~(right | right_mask) 
    ...:  
    ...: mask = (left_mask & ~right_false) | (right_mask & ~left_false)
3.73 µs ± 275 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And on bigger arrays, the difference is much bigger, it seems. For 100_000 elements, I get 775 µs vs 45 µs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I'll also add an asv for these ops.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some initial comments, haven't looked at the tests yet

doc/source/user_guide/boolean.rst Outdated Show resolved Hide resolved
@@ -740,6 +742,171 @@ def boolean_arithmetic_method(self, other):
return set_function_name(boolean_arithmetic_method, name, cls)


def kleene_or(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would these be better in a separate module (in case we decide to re-use them) and this becomes less cluttered

@TomAugspurger
Copy link
Contributor Author

@jorisvandenbossche what did you do about NumPy scalars not returning NotImplemented for certain operations? We're getting the wrong answer for NumPy scalars, since our implementation isn't called.

In [2]: a = pd.array([True, False, None])

In [3]: a | np.bool_(True)
Out[3]:
<BooleanArray>
[True, True, True]
Length: 3, dtype: boolean

In [4]: np.bool_(True) | a
Out[4]:
<BooleanArray>
[True, True, NA]
Length: 3, dtype: boolean

@jorisvandenbossche
Copy link
Member

Hmm, for the NA scalar itself I skipped the tests, as I thought there was nothing to do about this:

def test_comparison_ops():
for other in [NA, 1, 1.0, "a", np.int64(1), np.nan, np.bool_(True)]:
assert (NA == other) is NA
assert (NA != other) is NA
assert (NA > other) is NA
assert (NA >= other) is NA
assert (NA < other) is NA
assert (NA <= other) is NA
if isinstance(other, (np.int64, np.bool_)):
# for numpy scalars we get a deprecation warning and False as result
# for equality or error for larger/lesser than
continue
assert (other == NA) is NA
assert (other != NA) is NA
assert (other > NA) is NA
assert (other >= NA) is NA
assert (other < NA) is NA
assert (other <= NA) is NA

But for the array-level ops this seems even more annoying .. (more likely to run into, as we do return numpy scalars from indexing/ops).
But I thought that __array_priority__ should ensure that ops are directed to the BooleanArray? (numpy scalars should have a very low array priority)

@jorisvandenbossche
Copy link
Member

I am actually linking to the comparison ops tests, for logical ops I apparently didn't add tests for numpy scalars (something to improve!)

@TomAugspurger
Copy link
Contributor Author

Ahh, we're going through __array_ufunc__, which is not using Kleene logic. Should have been obvious that we got control somehow, as that returned a BooleanArray.

Will fix that.

@jorisvandenbossche
Copy link
Member

Should have been obvious that we got control somehow, as that returned a BooleanArray.

Aha, yes missed that as well :)

We probably need to fix this for the other ops as well (can be in another PR)

@TomAugspurger
Copy link
Contributor Author

OK, NumPy bools should be handled now. I did that by handling or, xor, and and in maybe_dispatch_ufunc_to_dunder_op. Hopefully that doesn't break anything else.

@TomAugspurger
Copy link
Contributor Author

All green. Should be good to go.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. just a couple of questions.

asv_bench/benchmarks/boolean.py Show resolved Hide resolved
pandas/core/ops/mask_ops.py Show resolved Hide resolved
@jreback
Copy link
Contributor

jreback commented Dec 5, 2019

@TomAugspurger happy to merge if @jorisvandenbossche ok

@@ -184,6 +184,9 @@ class BooleanArray(ExtensionArray, ExtensionOpsMixin):
represented by 2 numpy arrays: a boolean array with the data and
a boolean array with the mask (True indicating missing).

BooleanArray implements Kleene logic (sometimes called three-value
logic) for logical operations. See :ref:`` for more.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is "ref" here a placeholder?

other, mask = coerce_to_array(other, copy=False)
elif isinstance(other, np.bool_):
other = other.item()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is to convert to a python bool? why not just bool(other)? item i usually think of as being an array method

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

item is the general method to get a python scalar (here we of course know we want a bool).

But Tom, why is it exactly needed to convert this? I would think the numpy operations later on work fine with a numpy scalar as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, we do things like if right is False or if right is True, which will fail for numpy booleans. I don't want to have to worry about checking both, so easier to convert here.

[
True,
False,
# pd.NA
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a comment or commented-out code?

@jbrockmendel
Copy link
Member

what did you do about NumPy scalars not returning NotImplemented for certain operations? We're getting the wrong answer for NumPy scalars, since our implementation isn't called.

Not clear that it helps here, but it might be relevant that pd.NA has no __array_priority__ attr

@jorisvandenbossche
Copy link
Member

it might be relevant that pd.NA has no array_priority attr

It indeed doesn't help here (since the object is BooleanArray, not pd.NA, and BooleaArray already has an array_priority set).
But, I tried it yesterday for pd.NA, as it would be nice to solve those inconsistencies. But I don't see any difference ..

So @jbrockmendel is the following correct? My understanding is that __array_priority__ ensures that if an operation is done between two objects that both have an __array_priority__ set, that the object with the highest priority can perform the operation first, even if it is the right hand side object?

That can ensure that both these operations return a Timestamp:

In [15]: pd.Timestamp("2012-01-01") + np.timedelta64(1, 'h')
Out[15]: Timestamp('2012-01-01 01:00:00')

In [16]: np.timedelta64(1, 'h') + pd.Timestamp("2012-01-01") 
Out[16]: Timestamp('2012-01-01 01:00:00')

And, indeed, if I comment out __array_priority__ on Timestamp, we get:

In [1]: pd.Timestamp("2012-01-01") + np.timedelta64(1, 'm')
Out[1]: Timestamp('2012-01-01 00:01:00')

In [2]: np.timedelta64(1, 'h') + pd.Timestamp("2012-01-01")
---------------------------------------------------------------------------
UFuncTypeError                            Traceback (most recent call last)
<ipython-input-2-0de73ced9bd3> in <module>
----> 1 np.timedelta64(1, 'h') + pd.Timestamp("2012-01-01")

UFuncTypeError: ufunc 'add' cannot use operands with types dtype('<m8[h]') and dtype('O')

But if I add __array_priority__ = 100 to the NAType class, I still see this inconsistent behaviour:

In [1]: pd.NA == np.bool_(True) 
Out[1]: NA

In [2]: np.bool_(True) == pd.NA 
Out[2]: False

So it seems that the decision who gets priority or which operation is executed first is more complicated (or array priority is not honoured here, which is maybe something to report to numpy?)

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Dec 6, 2019

@jorisvandenbossche does it make sense to define NA.__array_ufunc__? I don't know if the NumPy scalar is invoking the ufunc machinery for binary ops but it may be. That'd also get pd.NA working better with raw ndarrays.

@jbrockmendel
Copy link
Member

My understanding is that array_priority ensures that if an operation is done between two objects that both have an array_priority set, that the object with the highest priority can perform the operation first, even if it is the right hand side object?

That's right if left is an ndarray (maybe some other numpy types?) but not for e.g. Series, which has array_priority

@jorisvandenbossche
Copy link
Member

That's right if left is an ndarray (maybe some other numpy types?) but not for e.g. Series, which has array_priority

@jbrockmendel it's about the case where left is a numpy scalar, not ndarrary or series

@TomAugspurger ah, good idea! That seems to give us more control. Will further comment on the other issue.

@jorisvandenbossche jorisvandenbossche merged commit 17f2ef3 into pandas-dev:master Dec 9, 2019
@jorisvandenbossche
Copy link
Member

@TomAugspurger Thanks a lot for this !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants