Use new NA scalar in BooleanArray #29961

jorisvandenbossche · 2019-12-02T09:14:44Z

Follow-up on #29597 and #29555, now actually using the pd.NA scalar in BooleanArray.

TomAugspurger

LGTM at a glance. Request for a few more tests / confirmation that these are already testsed:

Test to enure that array([True, False, None, np.nan pd.NA], dtype="boolean") correctly sanitizes all the NA-like value to be NA.
Test in BooleanArray.__setitem__ ensuring that arr[0] = np.nan, etc., always inserts NA.

pandas/core/arrays/boolean.py

jorisvandenbossche · 2019-12-02T12:44:02Z

One question that comes up here (but the same question applies to string and integer arrays): what "NA value" do we want to use when converting to object dtype numpy arrays? None or pd.NA ?

In the initial BooleanArray PR, I used None (since pd.NA was not there yet), so you get a numpy array like np.array([True, False, None]).
Now we can start using pd.NA, which is closer to the pandas representation (and what you would get from iteration or conversion to list np.array([i for i in arr])). But on the other hand, None can be easier to handle for cases where you need a numpy array (functionality that needs numpy arrays will typically also not recognize or handle correctly pd.NA).

TomAugspurger · 2019-12-02T15:30:03Z

I'm not sure, but my initial preference is for having pd.NA, with an option to get other values for NA upon request. I think that means we should have a somewhat standard to_numpy method

def to_numpy(self, dtype=object, na_value=pd.NA):
    ...

Then if the user wants None / NaN, they can request it relatively easily.

do we want to use when converting to object dtype numpy arrays?

What are the user-actions that hit this?

np.array(boolarray, dtype=...)
boolarray.astype(np_dtype)
...?

jorisvandenbossche · 2019-12-03T10:16:36Z

Examples where the pd.NA in a numpy array gives problems: #29976 (pyarrow's C conversion code does not know it) and the factorize errors in #29964 (our cython hashing code for object arrays cannot handle it).

Now, both are easy to solve (certainly since both can easily be solved on our side; the hashing code can recognize pd.NA and for the conversion to pyarrow we can ensure to use None before passing to pyarrow). But it are examples of how other code can break.

Still not sure that we should therefore use None as the default in __array__, but at least I think it is important to have this to_numpy(.., na_value=..).

TomAugspurger

For the "what value to use when converting", I'm not sure that downstream projects not understanding pd.NA should drive our decision here. We'll be living with this decision for a while, and projects will have time to adapt.

TomAugspurger · 2019-12-03T17:42:45Z

pandas/core/arrays/boolean.py

@@ -281,7 +281,9 @@ def __getitem__(self, item):
            return self._data[item]
        return type(self)(self._data[item], self._mask[item])

-    def _coerce_to_ndarray(self, force_bool: bool = False):
+    def _coerce_to_ndarray(
+        self, force_bool: bool = False, na_value: "Scalar" = lib._no_default


Any reason to prefer lib_no_default to just libmissing.NA directly?

Hmm, not sure. I was probably sidetracked by the idea I could not use None as default as that is a valid value as well ..
(if we want to have this generic / shared with other arrays, using self._na_value might be useful, but I don't think we will share this with arrays that don't use pd.NA, so ..)

jorisvandenbossche · 2019-12-04T07:33:11Z

@TomAugspurger updated this

TomAugspurger · 2019-12-04T11:51:16Z

See #30043 for the CI failures. OK to ignore for now.

TomAugspurger

I think we can move forward with this, despite the ongoing API discussion about .to_numpy / .astype(float) / np.asarray(arr), since that will be affecting all IntegerArray / StringArray / BooleanArray.

TomAugspurger · 2019-12-04T11:53:46Z

pandas/core/arrays/boolean.py

+        if is_integer_dtype(dtype):
+            if self.isna().any():
+                raise ValueError("cannot convert NA to integer")
+        # for float dtype, ensure we use np.nan before casting (numpy cannot


Pending the discussion in #30038.

TomAugspurger · 2019-12-04T15:58:57Z

One more note: we'll need to handle reductions like any and all. They look somewhat buggy on master though.

In [16]: pd.array([True, None])._reduce('all')
Out[16]: True

In [17]: pd.array([True, None])._reduce('all', skipna=False)
Out[17]: True

In [18]: pd.array([False, None])._reduce('all', )
Out[18]: False

In [19]: pd.array([False, None])._reduce('all', skipna=False)
Out[19]: False

Do we want to do that here or as a followup?

jorisvandenbossche · 2019-12-04T19:21:15Z

Let's do that as a follow-up, that's a remainder from the initial implementation, it's noted as a to do item in #29556

TomAugspurger · 2019-12-04T19:25:19Z

Sounds good to me.

jorisvandenbossche · 2019-12-04T19:27:21Z

OK, going to merge this then, so the Kleene PR can then be updated.

jorisvandenbossche · 2019-12-04T19:28:10Z

Once the IntegerArray PR is in, I will also take a look at consolidating both classes

TomAugspurger · 2019-12-04T19:32:20Z

I'll update the Kleene PR now, and the Integer one after if I have a chance.

jorisvandenbossche · 2019-12-04T20:05:24Z

I did a quick PR for the any/all: #30062

Use NA scalar in BooleanArray

e2f86ed

jorisvandenbossche added ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Dec 2, 2019

jorisvandenbossche added this to the 1.0 milestone Dec 2, 2019

This was referenced Dec 2, 2019

Missing values proposal: concrete steps for 1.0 #29556

Closed

ENH: Implement Kleene logic for BooleanArray #29842

Merged

TomAugspurger reviewed Dec 2, 2019

View reviewed changes

pandas/core/arrays/boolean.py Show resolved Hide resolved

TomAugspurger reviewed Dec 3, 2019

View reviewed changes

jorisvandenbossche added 2 commits December 4, 2019 08:05

Merge remote-tracking branch 'upstream/master' into boolean-use-NA

eff53fa

updates

d083c88

This was referenced Dec 4, 2019

API: Uses pd.NA in IntegerArray #29964

Merged

API: how to handle NA in conversion to numpy arrays #30038

Closed

TomAugspurger approved these changes Dec 4, 2019

View reviewed changes

jorisvandenbossche merged commit e73ed45 into pandas-dev:master Dec 4, 2019

jorisvandenbossche deleted the boolean-use-NA branch December 4, 2019 19:27

This was referenced Dec 4, 2019

CI: correct azure-36-locale slow name + move pyarrow #30065

Merged

CI: temp skip BooleanArray pyarrow test #30069

Merged

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019

API: Use new NA scalar in BooleanArray (pandas-dev#29961)

3efaeb3

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019

API: Use new NA scalar in BooleanArray (pandas-dev#29961)

c35c942

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use new NA scalar in BooleanArray #29961

Use new NA scalar in BooleanArray #29961

jorisvandenbossche commented Dec 2, 2019

TomAugspurger left a comment

jorisvandenbossche commented Dec 2, 2019

TomAugspurger commented Dec 2, 2019

jorisvandenbossche commented Dec 3, 2019

TomAugspurger left a comment

TomAugspurger Dec 3, 2019

jorisvandenbossche Dec 4, 2019

jorisvandenbossche commented Dec 4, 2019

TomAugspurger commented Dec 4, 2019

TomAugspurger left a comment

TomAugspurger Dec 4, 2019

TomAugspurger commented Dec 4, 2019

jorisvandenbossche commented Dec 4, 2019

TomAugspurger commented Dec 4, 2019

jorisvandenbossche commented Dec 4, 2019

jorisvandenbossche commented Dec 4, 2019

TomAugspurger commented Dec 4, 2019

jorisvandenbossche commented Dec 4, 2019

Use new NA scalar in BooleanArray #29961

Use new NA scalar in BooleanArray #29961

Conversation

jorisvandenbossche commented Dec 2, 2019

TomAugspurger left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Dec 2, 2019

TomAugspurger commented Dec 2, 2019

jorisvandenbossche commented Dec 3, 2019

TomAugspurger left a comment

Choose a reason for hiding this comment

TomAugspurger Dec 3, 2019

Choose a reason for hiding this comment

jorisvandenbossche Dec 4, 2019

Choose a reason for hiding this comment

jorisvandenbossche commented Dec 4, 2019

TomAugspurger commented Dec 4, 2019

TomAugspurger left a comment

Choose a reason for hiding this comment

TomAugspurger Dec 4, 2019

Choose a reason for hiding this comment

TomAugspurger commented Dec 4, 2019

jorisvandenbossche commented Dec 4, 2019

TomAugspurger commented Dec 4, 2019

jorisvandenbossche commented Dec 4, 2019

jorisvandenbossche commented Dec 4, 2019

TomAugspurger commented Dec 4, 2019

jorisvandenbossche commented Dec 4, 2019