ENH: Implement Kleene logic for BooleanArray #29842

TomAugspurger · 2019-11-25T21:58:26Z

I have a few TODOs, and a few tests that I need to unxfail. Putting this up now so that @jorisvandenbossche can take a look.

pandas/core/arrays/boolean.py

pandas/tests/arrays/test_boolean.py

TomAugspurger · 2019-11-25T22:01:04Z

pandas/tests/arrays/test_boolean.py

        other = pd.array([True] * len(data), dtype="boolean")
        self._compare_other(data, op_name, other)
        other = np.array([True] * len(data))
        self._compare_other(data, op_name, other)
        other = pd.Series([True] * len(data), dtype="boolean")
        self._compare_other(data, op_name, other)

+    def test_kleene_or(self):


A careful review of these new test cases would be greatly appreciated. I've tried to make them as clear as possible, while covering all the cases.

I went through the tests, very clear, added a few comments, for the rest looks good to me!

jorisvandenbossche

I started adding some docs in #29597 as well (on the NA scalar), so need to think about potential overlap. Now, I think we need to explain it in both places anyhow.

doc/source/user_guide/boolean.rst

pandas/core/arrays/boolean.py

pandas/tests/arrays/test_boolean.py

jorisvandenbossche · 2019-11-27T08:50:32Z

pandas/tests/arrays/test_boolean.py

        other = pd.array([True] * len(data), dtype="boolean")
        self._compare_other(data, op_name, other)
        other = np.array([True] * len(data))
        self._compare_other(data, op_name, other)
        other = pd.Series([True] * len(data), dtype="boolean")
        self._compare_other(data, op_name, other)

+    def test_kleene_or(self):


I went through the tests, very clear, added a few comments, for the rest looks good to me!

TomAugspurger · 2019-11-27T15:07:59Z

Sorry @jorisvandenbossche, I pushed a somewhat major refactor after you reviewed :/ Should be done now.

High-level overview:

Split into dedicated methods kleene_or, kleene_and, and kleene_xor, rather than doing everything in _create_logical_method.
Removed the _compare_other, test_scalar, and test_array tests. Since there were parameterized over all the ops, we would have needed to essentially re-implement the kleene logic in _compare_other to get the masking right.
Added tests for each op against both arrays and scalars (True, False, np.nan)

My main concern right now is that I may be assuming that masked values are false in a few places.

One important comment was buried

But that gives the question: what do we want a | np.nan to do?

Right now, I've adopted pd.NA semantics. I think we do that or raise, I don't have a preference. Easy to support both.

jorisvandenbossche · 2019-11-27T15:14:54Z

My main concern right now is that I may be assuming that masked values are false in a few places.

How would you be assuming that? Is there a place that you "uncover" masked values?

jorisvandenbossche · 2019-11-27T15:15:58Z

Right now, I've adopted pd.NA semantics. I think we do that or raise, I don't have a preference. Easy to support both.

I would maybe rather raise an error then. As otherwise you have a np.nan that behaves differently depending on the context.

TomAugspurger · 2019-11-27T15:21:23Z

How would you be assuming that? Is there a place that you "uncover" masked values?

Still thinking through it, but we do things rougly like

result = left & right
...
mask[result] = False

i.e. we update the mask based on the result. I'll add some tests where I manually modify the _data of masked values to be True.

jorisvandenbossche · 2019-11-27T15:23:37Z

i.e. we update the mask based on the result.

Ah, I see. Yes, that shouldn't be done. Or it should combine with mask from the original values first, I think?

pandas/core/arrays/boolean.py

TomAugspurger · 2019-11-27T16:15:16Z

Another batch of changes

raise for NaN: 7f78a64
More tests (empty arrays, inplace mutation) 747e046, 36b171b
Ensure we aren't relying on the assumption that masked values are False: d0a8cca

The tests in d0a8cca are a bit tricky... Hopefully they're comprehensive and make sense.

pandas/core/arrays/boolean.py

jorisvandenbossche · 2019-11-27T16:55:18Z

pandas/core/arrays/boolean.py

+    mask = left_mask
+
+    if right_mask is not None:
+        mask = mask | right_mask


Is this still needed with the new code below to create the mask?

I believe so, though I may be wrong...

Ah sorry, not needed after using your code. Thanks.

jorisvandenbossche · 2019-11-27T16:58:08Z

pandas/core/arrays/boolean.py

+    #     return result, mask
+
+    result = left | right
+    mask[left & ~left_mask] = False


A logical op is quite a bit faster than a setitem like this, so if we can get the final result by combining different logical ops, that might be more performant

pandas/core/arrays/boolean.py

pandas/tests/arrays/test_boolean.py

jorisvandenbossche · 2019-11-27T17:12:45Z

pandas/core/arrays/boolean.py

+        result = left & right
+        # unmask where either left or right is False
+        mask[~left & ~left_mask] = False
+        mask[~right & ~right_mask] = False


Similar comment here, I think that something like this can be faster:

left_false = ~left & ~left_mask right_false= ~right & ~right_mask mask = (left_mask & ~right_false) | (right_mask & ~left_false)

(avoiding setitem)

And need to think if we can avoid some ~

Further optimization:

left_false = ~(left | left_mask) right_false= ~(right | right_mask) mask = (left_mask & ~right_false) | (right_mask & ~left_false)

Timing comparison:

left = np.random.randint(0, 2, 1000).astype(bool) right = np.random.randint(0, 2, 1000).astype(bool) left_mask = np.random.randint(0, 2, 1000).astype(bool) right_mask = np.random.randint(0, 2, 1000).astype(bool)

In [47]: %%timeit ...: mask = left_mask | right_mask ...: mask[~left & ~left_mask] = False ...: mask[~right & ~right_mask] = False 7.2 µs ± 106 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) In [58]: %%timeit ...: left_false = ~(left | left_mask) ...: right_false= ~(right | right_mask) ...: ...: mask = (left_mask & ~right_false) | (right_mask & ~left_false) 3.73 µs ± 275 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

And on bigger arrays, the difference is much bigger, it seems. For 100_000 elements, I get 775 µs vs 45 µs

Thanks. I'll also add an asv for these ops.

jreback

some initial comments, haven't looked at the tests yet

doc/source/user_guide/boolean.rst

jreback · 2019-11-27T17:02:52Z

pandas/core/arrays/boolean.py

@@ -740,6 +742,171 @@ def boolean_arithmetic_method(self, other):
        return set_function_name(boolean_arithmetic_method, name, cls)


+def kleene_or(


would these be better in a separate module (in case we decide to re-use them) and this becomes less cluttered

TomAugspurger · 2019-12-05T14:12:47Z

@jorisvandenbossche what did you do about NumPy scalars not returning NotImplemented for certain operations? We're getting the wrong answer for NumPy scalars, since our implementation isn't called.

In [2]: a = pd.array([True, False, None])

In [3]: a | np.bool_(True)
Out[3]:
<BooleanArray>
[True, True, True]
Length: 3, dtype: boolean

In [4]: np.bool_(True) | a
Out[4]:
<BooleanArray>
[True, True, NA]
Length: 3, dtype: boolean

jorisvandenbossche · 2019-12-05T14:23:10Z

Hmm, for the NA scalar itself I skipped the tests, as I thought there was nothing to do about this:

pandas/pandas/tests/scalar/test_na_scalar.py

Lines 49 to 69 in ee6e6b3

    
           def test_comparison_ops(): 
        
               for other in [NA, 1, 1.0, "a", np.int64(1), np.nan, np.bool_(True)]: 
        
                   assert (NA == other) is NA 
        
                   assert (NA != other) is NA 
        
                   assert (NA > other) is NA 
        
                   assert (NA >= other) is NA 
        
                   assert (NA < other) is NA 
        
                   assert (NA <= other) is NA 
        
                   if isinstance(other, (np.int64, np.bool_)): 
        
                       # for numpy scalars we get a deprecation warning and False as result 
        
                       # for equality or error for larger/lesser than 
        
                       continue 
        
                   assert (other == NA) is NA 
        
                   assert (other != NA) is NA 
        
                   assert (other > NA) is NA 
        
                   assert (other >= NA) is NA 
        
                   assert (other < NA) is NA 
        
                   assert (other <= NA) is NA

But for the array-level ops this seems even more annoying .. (more likely to run into, as we do return numpy scalars from indexing/ops).
But I thought that __array_priority__ should ensure that ops are directed to the BooleanArray? (numpy scalars should have a very low array priority)

jorisvandenbossche · 2019-12-05T14:24:31Z

I am actually linking to the comparison ops tests, for logical ops I apparently didn't add tests for numpy scalars (something to improve!)

TomAugspurger · 2019-12-05T14:34:07Z

Ahh, we're going through __array_ufunc__, which is not using Kleene logic. Should have been obvious that we got control somehow, as that returned a BooleanArray.

Will fix that.

jorisvandenbossche · 2019-12-05T14:39:58Z

Should have been obvious that we got control somehow, as that returned a BooleanArray.

Aha, yes missed that as well :)

We probably need to fix this for the other ops as well (can be in another PR)

TomAugspurger · 2019-12-05T14:40:37Z

OK, NumPy bools should be handled now. I did that by handling or, xor, and and in maybe_dispatch_ufunc_to_dunder_op. Hopefully that doesn't break anything else.

TomAugspurger · 2019-12-05T15:26:07Z

All green. Should be good to go.

jreback

lgtm. just a couple of questions.

asv_bench/benchmarks/boolean.py

pandas/core/ops/mask_ops.py

jreback · 2019-12-05T15:44:12Z

@TomAugspurger happy to merge if @jorisvandenbossche ok

jbrockmendel · 2019-12-06T02:58:35Z

pandas/core/arrays/boolean.py

@@ -184,6 +184,9 @@ class BooleanArray(ExtensionArray, ExtensionOpsMixin):
    represented by 2 numpy arrays: a boolean array with the data and
    a boolean array with the mask (True indicating missing).

+    BooleanArray implements Kleene logic (sometimes called three-value
+    logic) for logical operations. See :ref:`` for more.


is "ref" here a placeholder?

jbrockmendel · 2019-12-06T02:59:54Z

pandas/core/arrays/boolean.py

                other, mask = coerce_to_array(other, copy=False)
+            elif isinstance(other, np.bool_):
+                other = other.item()


this is to convert to a python bool? why not just bool(other)? item i usually think of as being an array method

item is the general method to get a python scalar (here we of course know we want a bool).

But Tom, why is it exactly needed to convert this? I would think the numpy operations later on work fine with a numpy scalar as well?

IIRC, we do things like if right is False or if right is True, which will fail for numpy booleans. I don't want to have to worry about checking both, so easier to convert here.

jbrockmendel · 2019-12-06T03:01:26Z

pandas/tests/arrays/test_boolean.py

+        [
+            True,
+            False,
+            # pd.NA


is this a comment or commented-out code?

jbrockmendel · 2019-12-06T03:04:19Z

what did you do about NumPy scalars not returning NotImplemented for certain operations? We're getting the wrong answer for NumPy scalars, since our implementation isn't called.

Not clear that it helps here, but it might be relevant that pd.NA has no __array_priority__ attr

jorisvandenbossche · 2019-12-06T07:35:20Z

it might be relevant that pd.NA has no array_priority attr

It indeed doesn't help here (since the object is BooleanArray, not pd.NA, and BooleaArray already has an array_priority set).
But, I tried it yesterday for pd.NA, as it would be nice to solve those inconsistencies. But I don't see any difference ..

So @jbrockmendel is the following correct? My understanding is that __array_priority__ ensures that if an operation is done between two objects that both have an __array_priority__ set, that the object with the highest priority can perform the operation first, even if it is the right hand side object?

That can ensure that both these operations return a Timestamp:

In [15]: pd.Timestamp("2012-01-01") + np.timedelta64(1, 'h')
Out[15]: Timestamp('2012-01-01 01:00:00')

In [16]: np.timedelta64(1, 'h') + pd.Timestamp("2012-01-01") 
Out[16]: Timestamp('2012-01-01 01:00:00')

And, indeed, if I comment out __array_priority__ on Timestamp, we get:

In [1]: pd.Timestamp("2012-01-01") + np.timedelta64(1, 'm')
Out[1]: Timestamp('2012-01-01 00:01:00')

In [2]: np.timedelta64(1, 'h') + pd.Timestamp("2012-01-01")
---------------------------------------------------------------------------
UFuncTypeError                            Traceback (most recent call last)
<ipython-input-2-0de73ced9bd3> in <module>
----> 1 np.timedelta64(1, 'h') + pd.Timestamp("2012-01-01")

UFuncTypeError: ufunc 'add' cannot use operands with types dtype('<m8[h]') and dtype('O')

But if I add __array_priority__ = 100 to the NAType class, I still see this inconsistent behaviour:

In [1]: pd.NA == np.bool_(True) 
Out[1]: NA

In [2]: np.bool_(True) == pd.NA 
Out[2]: False

So it seems that the decision who gets priority or which operation is executed first is more complicated (or array priority is not honoured here, which is maybe something to report to numpy?)

TomAugspurger · 2019-12-06T13:39:48Z

@jorisvandenbossche does it make sense to define NA.__array_ufunc__? I don't know if the NumPy scalar is invoking the ufunc machinery for binary ops but it may be. That'd also get pd.NA working better with raw ndarrays.

jbrockmendel · 2019-12-06T15:36:56Z

My understanding is that array_priority ensures that if an operation is done between two objects that both have an array_priority set, that the object with the highest priority can perform the operation first, even if it is the right hand side object?

That's right if left is an ndarray (maybe some other numpy types?) but not for e.g. Series, which has array_priority

jorisvandenbossche · 2019-12-09T08:49:17Z

That's right if left is an ndarray (maybe some other numpy types?) but not for e.g. Series, which has array_priority

@jbrockmendel it's about the case where left is a numpy scalar, not ndarrary or series

@TomAugspurger ah, good idea! That seems to give us more control. Will further comment on the other issue.

jorisvandenbossche · 2019-12-09T08:54:21Z

@TomAugspurger Thanks a lot for this !

ENH: add BooleanArray extension array (pandas-dev#29555)

bb904cb

TomAugspurger commented Nov 25, 2019

View reviewed changes

TomAugspurger added 4 commits November 26, 2019 07:37

move

13c7ea3

doc fixup

fff786f

Merge remote-tracking branch 'upstream/master' into boolean-array-kleene

4067e7f

working

708c553

jorisvandenbossche added ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Nov 27, 2019

jorisvandenbossche added this to the 1.0 milestone Nov 27, 2019

jorisvandenbossche reviewed Nov 27, 2019

View reviewed changes

TomAugspurger added 3 commits November 27, 2019 08:12

Merge remote-tracking branch 'upstream/master' into boolean-array-kleene

c56894e

updates

2e9d547

updates

373aaab

jorisvandenbossche reviewed Nov 27, 2019

View reviewed changes

pandas/core/arrays/boolean.py Outdated Show resolved Hide resolved

TomAugspurger added 4 commits November 27, 2019 09:42

Raise for NaN

7f78a64

added tests for empty

36b171b

added tests for inplace mutation

747e046

Do not assume masked values are False

d0a8cca

TomAugspurger added 3 commits November 27, 2019 11:00

Merge remote-tracking branch 'upstream/master' into boolean-array-kleene

fe061b0

mypy

9f9e44c

doc fixups

0a34257

jorisvandenbossche reviewed Nov 27, 2019

View reviewed changes

jreback requested changes Nov 27, 2019

View reviewed changes

Added benchmarks

2ba0034

Merge remote-tracking branch 'upstream/master' into boolean-array-kleene

5a2c81c

TomAugspurger added 3 commits December 5, 2019 08:15

move

7032318

numpy scalars

bbb7f9b

doc note

ce763b4

handle numpy bool

5bc5328

jreback approved these changes Dec 5, 2019

View reviewed changes

asv_bench/benchmarks/boolean.py Show resolved Hide resolved

pandas/core/ops/mask_ops.py Show resolved Hide resolved

jbrockmendel reviewed Dec 6, 2019

View reviewed changes

pandas/tests/arrays/test_boolean.py Outdated

[

True,

False,

# pd.NA

Copy link

Member

jbrockmendel Dec 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a comment or commented-out code?

TomAugspurger added 2 commits December 6, 2019 15:38

Merge remote-tracking branch 'upstream/master' into boolean-array-kleene

457bd08

cleanup

31c2bc6

jorisvandenbossche approved these changes Dec 9, 2019

View reviewed changes

jorisvandenbossche merged commit 17f2ef3 into pandas-dev:master Dec 9, 2019

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019

ENH: Implement Kleene logic for BooleanArray (pandas-dev#29842)

de6d4ef

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019

ENH: Implement Kleene logic for BooleanArray (pandas-dev#29842)

1549306

jorisvandenbossche mentioned this pull request Mar 23, 2020

PERF: masked ops for reductions (sum) #30982

Merged

		@@ -740,6 +742,171 @@ def boolean_arithmetic_method(self, other):
		return set_function_name(boolean_arithmetic_method, name, cls)


		def kleene_or(

ENH: Implement Kleene logic for BooleanArray #29842

ENH: Implement Kleene logic for BooleanArray #29842

Conversation

TomAugspurger commented Nov 25, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Nov 27, 2019

jorisvandenbossche commented Nov 27, 2019

jorisvandenbossche commented Nov 27, 2019

TomAugspurger commented Nov 27, 2019

jorisvandenbossche commented Nov 27, 2019

TomAugspurger commented Nov 27, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Dec 5, 2019

jorisvandenbossche commented Dec 5, 2019

jorisvandenbossche commented Dec 5, 2019

TomAugspurger commented Dec 5, 2019

jorisvandenbossche commented Dec 5, 2019

TomAugspurger commented Dec 5, 2019

TomAugspurger commented Dec 5, 2019

jreback left a comment

Choose a reason for hiding this comment

jreback commented Dec 5, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Dec 6, 2019

jorisvandenbossche commented Dec 6, 2019

TomAugspurger commented Dec 6, 2019 • edited Loading

jbrockmendel commented Dec 6, 2019

jorisvandenbossche commented Dec 9, 2019

jorisvandenbossche commented Dec 9, 2019

TomAugspurger commented Dec 6, 2019 •

edited

Loading