-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Implement Kleene logic for BooleanArray #29842
ENH: Implement Kleene logic for BooleanArray #29842
Conversation
other = pd.array([True] * len(data), dtype="boolean") | ||
self._compare_other(data, op_name, other) | ||
other = np.array([True] * len(data)) | ||
self._compare_other(data, op_name, other) | ||
other = pd.Series([True] * len(data), dtype="boolean") | ||
self._compare_other(data, op_name, other) | ||
|
||
def test_kleene_or(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A careful review of these new test cases would be greatly appreciated. I've tried to make them as clear as possible, while covering all the cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through the tests, very clear, added a few comments, for the rest looks good to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started adding some docs in #29597 as well (on the NA scalar), so need to think about potential overlap. Now, I think we need to explain it in both places anyhow.
other = pd.array([True] * len(data), dtype="boolean") | ||
self._compare_other(data, op_name, other) | ||
other = np.array([True] * len(data)) | ||
self._compare_other(data, op_name, other) | ||
other = pd.Series([True] * len(data), dtype="boolean") | ||
self._compare_other(data, op_name, other) | ||
|
||
def test_kleene_or(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through the tests, very clear, added a few comments, for the rest looks good to me!
Sorry @jorisvandenbossche, I pushed a somewhat major refactor after you reviewed :/ Should be done now. High-level overview:
My main concern right now is that I may be assuming that masked values are false in a few places. One important comment was buried
Right now, I've adopted |
How would you be assuming that? Is there a place that you "uncover" masked values? |
I would maybe rather raise an error then. As otherwise you have a |
Still thinking through it, but we do things rougly like result = left & right
...
mask[result] = False i.e. we update the mask based on the result. I'll add some tests where I manually modify the |
Ah, I see. Yes, that shouldn't be done. Or it should combine with mask from the original values first, I think? |
pandas/core/arrays/boolean.py
Outdated
mask = left_mask | ||
|
||
if right_mask is not None: | ||
mask = mask | right_mask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still needed with the new code below to create the mask?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe so, though I may be wrong...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry, not needed after using your code. Thanks.
pandas/core/arrays/boolean.py
Outdated
# return result, mask | ||
|
||
result = left | right | ||
mask[left & ~left_mask] = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A logical op is quite a bit faster than a setitem like this, so if we can get the final result by combining different logical ops, that might be more performant
pandas/core/arrays/boolean.py
Outdated
result = left & right | ||
# unmask where either left or right is False | ||
mask[~left & ~left_mask] = False | ||
mask[~right & ~right_mask] = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment here, I think that something like this can be faster:
left_false = ~left & ~left_mask
right_false= ~right & ~right_mask
mask = (left_mask & ~right_false) | (right_mask & ~left_false)
(avoiding setitem)
And need to think if we can avoid some ~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Further optimization:
left_false = ~(left | left_mask)
right_false= ~(right | right_mask)
mask = (left_mask & ~right_false) | (right_mask & ~left_false)
Timing comparison:
left = np.random.randint(0, 2, 1000).astype(bool)
right = np.random.randint(0, 2, 1000).astype(bool)
left_mask = np.random.randint(0, 2, 1000).astype(bool)
right_mask = np.random.randint(0, 2, 1000).astype(bool)
In [47]: %%timeit
...: mask = left_mask | right_mask
...: mask[~left & ~left_mask] = False
...: mask[~right & ~right_mask] = False
7.2 µs ± 106 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [58]: %%timeit
...: left_false = ~(left | left_mask)
...: right_false= ~(right | right_mask)
...:
...: mask = (left_mask & ~right_false) | (right_mask & ~left_false)
3.73 µs ± 275 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And on bigger arrays, the difference is much bigger, it seems. For 100_000 elements, I get 775 µs vs 45 µs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I'll also add an asv for these ops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some initial comments, haven't looked at the tests yet
pandas/core/arrays/boolean.py
Outdated
@@ -740,6 +742,171 @@ def boolean_arithmetic_method(self, other): | |||
return set_function_name(boolean_arithmetic_method, name, cls) | |||
|
|||
|
|||
def kleene_or( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would these be better in a separate module (in case we decide to re-use them) and this becomes less cluttered
@jorisvandenbossche what did you do about NumPy scalars not returning NotImplemented for certain operations? We're getting the wrong answer for NumPy scalars, since our implementation isn't called. In [2]: a = pd.array([True, False, None])
In [3]: a | np.bool_(True)
Out[3]:
<BooleanArray>
[True, True, True]
Length: 3, dtype: boolean
In [4]: np.bool_(True) | a
Out[4]:
<BooleanArray>
[True, True, NA]
Length: 3, dtype: boolean |
Hmm, for the NA scalar itself I skipped the tests, as I thought there was nothing to do about this: pandas/pandas/tests/scalar/test_na_scalar.py Lines 49 to 69 in ee6e6b3
But for the array-level ops this seems even more annoying .. (more likely to run into, as we do return numpy scalars from indexing/ops). |
I am actually linking to the comparison ops tests, for logical ops I apparently didn't add tests for numpy scalars (something to improve!) |
Ahh, we're going through Will fix that. |
Aha, yes missed that as well :) We probably need to fix this for the other ops as well (can be in another PR) |
OK, NumPy bools should be handled now. I did that by handling |
All green. Should be good to go. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. just a couple of questions.
@TomAugspurger happy to merge if @jorisvandenbossche ok |
pandas/core/arrays/boolean.py
Outdated
@@ -184,6 +184,9 @@ class BooleanArray(ExtensionArray, ExtensionOpsMixin): | |||
represented by 2 numpy arrays: a boolean array with the data and | |||
a boolean array with the mask (True indicating missing). | |||
|
|||
BooleanArray implements Kleene logic (sometimes called three-value | |||
logic) for logical operations. See :ref:`` for more. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is "ref" here a placeholder?
other, mask = coerce_to_array(other, copy=False) | ||
elif isinstance(other, np.bool_): | ||
other = other.item() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is to convert to a python bool? why not just bool(other)
? item
i usually think of as being an array method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
item
is the general method to get a python scalar (here we of course know we want a bool).
But Tom, why is it exactly needed to convert this? I would think the numpy operations later on work fine with a numpy scalar as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, we do things like if right is False
or if right is True
, which will fail for numpy booleans. I don't want to have to worry about checking both, so easier to convert here.
pandas/tests/arrays/test_boolean.py
Outdated
[ | ||
True, | ||
False, | ||
# pd.NA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a comment or commented-out code?
Not clear that it helps here, but it might be relevant that pd.NA has no |
It indeed doesn't help here (since the object is BooleanArray, not pd.NA, and BooleaArray already has an array_priority set). So @jbrockmendel is the following correct? My understanding is that That can ensure that both these operations return a Timestamp:
And, indeed, if I comment out
But if I add
So it seems that the decision who gets priority or which operation is executed first is more complicated (or array priority is not honoured here, which is maybe something to report to numpy?) |
@jorisvandenbossche does it make sense to define |
That's right if |
@jbrockmendel it's about the case where @TomAugspurger ah, good idea! That seems to give us more control. Will further comment on the other issue. |
@TomAugspurger Thanks a lot for this ! |
xref #29556
I have a few TODOs, and a few tests that I need to unxfail. Putting this up now so that @jorisvandenbossche can take a look.