Implement some reductions for string Series #31757

dsaxton · 2020-02-06T20:03:28Z

closes New string data type aggregations (min, max, sum) work for DataFrames but not Series #31746, closes Inconsistent behaviour with min method on object dtype columns #18588
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

pandas/tests/arrays/string_/test_string.py

WillAyd · 2020-02-06T22:02:17Z

Just as a counterpoint do we really want to support all of these? sum in particular is strange to me to support on a string dtype

jorisvandenbossche · 2020-02-06T22:15:15Z

The value of sum is certainly less clear, but I think min/max are nice to have?

dsaxton · 2020-02-06T22:21:54Z

Just as a counterpoint do we really want to support all of these? sum in particular is strange to me to support on a string dtype

Agreed summing strings is a little odd, but is it worth implementing for the sake of consistency with Series of object dtype (for which this is a valid operation)?

WillAyd · 2020-02-06T22:28:24Z

I don’t think consistency with object dtype is a goal for the string dtype. Even for min/max I’m not sure what those mean in a lot of cases, unless the answer is to fallback to Python semantics.

My concern is I think that as an answer conflicts with the goal of creating a native string type

dsaxton · 2020-02-06T22:57:53Z

I don’t think consistency with object dtype is a goal for the string dtype. Even for min/max I’m not sure what those mean in a lot of cases, unless the answer is to fallback to Python semantics.

My concern is I think that as an answer conflicts with the goal of creating a native string type

Fair point about consistency not being important. I do think though that strings having an "order" to them is a pretty useful / natural concept (we'd probably want to allow sorting of strings, in which case we'd also want min and max)

jorisvandenbossche · 2020-02-06T23:04:41Z

Even for min/max I’m not sure what those mean in a lot of cases, unless the answer is to fallback to Python semantics.

You can sort strings, and then min/max have a rather clear meaning IMO

But it's true we certainly don't need to do this for consistency with object dtype, but because we think it is useful
(we only need to check / fix consistency between dataframe vs series operation)

jreback · 2020-02-06T23:08:42Z

i would certain add operations that work now on object types - otherwise the dtypes won’t be used generally which is not great

jorisvandenbossche · 2020-02-11T13:04:02Z

More thoughts on adding min/max and/or sum. I would like to see min/max added, but care less about sum.

pandas/tests/arrays/string_/test_string.py

jreback · 2020-02-12T12:53:27Z

pandas/core/arrays/string_.py

@@ -274,7 +274,16 @@ def astype(self, dtype, copy=True):
        return super().astype(dtype, copy)

    def _reduce(self, name, skipna=True, **kwargs):
-        raise TypeError(f"Cannot perform reduction '{name}' with string dtype")
+        if name in ["min", "max", "sum"]:
+            na_mask = isna(self)


the masking should be done inside the methods themselves, _reduce just dispatches

Should we implement these methods for StringArray in that case? The NA handling for PandasArray seems to be broken for string inputs, so it might have to get handled within each method

Yes, I would say don't care about PandasArray too much (since PandasArray is not using pd.NA), and just implement the methods here on StringArray.

I think the reason why the NA-handling wasn't working was due to an apparently long-standing bug in nanops.nanminmax which I think we can fix here: #18588. Basically we are filling NA with infinite values when taking the min or max, but this doesn't make sense for object dtypes and an error gets raised even if skipna is True.

If we fix that by explicitly masking the missing values instead, I believe we can just use this function directly in StringArray methods.

pandas/core/arrays/string_.py

pandas/tests/arrays/string_/test_string.py

dsaxton · 2020-02-14T14:52:23Z

pandas/tests/frame/test_apply.py

-        none_in_first_column_result = getattr(df[["A", "B"]], method)()
-        none_in_second_column_result = getattr(df[["B", "A"]], method)()
+        none_in_first_column_result = getattr(df[["A", "B"]], method)().sort_index()
+        none_in_second_column_result = getattr(df[["B", "A"]], method)().sort_index()


Previously the column with the missing value was getting dropped from the result so it only had a single row and the order didn't matter

jorisvandenbossche

Thanks for the update!

jorisvandenbossche · 2020-02-14T14:53:41Z

pandas/core/arrays/string_.py

+
+    def max(self, axis=None, out=None, keepdims=False, skipna=True):
+        nv.validate_max((), dict(out=out, keepdims=keepdims))
+        result = nanops.nanmax(self._ndarray, axis=axis, skipna=skipna)


There should be no need to explicitly pass through the axis keyword, I think

pandas/core/arrays/string_.py

jorisvandenbossche · 2020-02-14T14:57:48Z

pandas/core/nanops.py

+        elif is_object_dtype(dtype) and values.ndim == 1 and na_mask.any():
+            # Need to explicitly mask NA values for object dtypes
+            if skipna:
+                result = getattr(values[~na_mask], meth)(axis)


This masking could also be done in the min/max functions? (as you had before?)

Or, another option might be to add a min/max function to mask_ops.py, similarly as I am doing for sum in #30982 (but it should be simpler for min/max, as those don't need to handle the min_count)

I think a benefit of having it here is that this also fixes a bug for Series: pd.Series(["a", np.nan]).min() currently raises even though it shouldn't

Ah, that's a good point. Can you add a test for that, then?

Now, that aside, I think longer term we still want the separate min/max in mask_ops.py, so it can also be used for the int dtypes. But that can then certainly be done for a separate PR.

jorisvandenbossche · 2020-02-14T14:59:39Z

pandas/tests/extension/base/reduce.py

+            "min",
+            "max",
+        ]:
+            pytest.skip("These reductions are implemented")


Can you see if you can rather update this in test_string.py ? It might be we now need to subclass the ReduceTests instead of NoReduceTests.
(ideally the base tests remain dtype agnostic)

By updating in test_string.py do you mean adding tests using the fixtures data and all_numeric_reductions, only checking for the "correct" output (and skipping over those reductions that aren't yet implemented)?

Hmm, actually looking at the base reduction tests now: they are not really written in a way that they will pass for strings.

But so you can copy this test to tests/extension/test_strings.py (and so override the base one), and then do the string-array-specific adaptation there. It gives some duplication of the test code, but it's not long, and it clearer separation of concerns (the changes for string array are in test_string)

Ok, so we can remove the special cases for StringArray in BaseNoReduceTests without getting test failures, as long as they're handled in TestNoReduce in test_string.py? I'm not too familiar with how these particular tests actually get executed during CI

pandas/tests/frame/test_apply.py

doc/source/whatsnew/v1.0.2.rst

jreback · 2020-02-15T01:48:40Z

doc/source/whatsnew/v1.0.2.rst

@@ -28,6 +28,11 @@ Fixed regressions
 Bug fixes
 ~~~~~~~~~

+**ExtensionArray**
+
+- Fixed issue where taking the minimum or maximum of a ``StringArray`` or ``Series`` with ``StringDtype`` type would raise. (:issue:`31746`)


say .min() or .max()

jreback · 2020-02-15T01:49:57Z

pandas/core/nanops.py

@@ -854,6 +854,8 @@ def reduction(
        mask: Optional[np.ndarray] = None,
    ) -> Dtype:

+        na_mask = isna(values)


you should already have the mask (pass it in when you call this).

jreback · 2020-02-15T01:50:35Z

pandas/core/nanops.py

@@ -864,6 +866,12 @@ def reduction(
                result.fill(np.nan)
            except (AttributeError, TypeError, ValueError):
                result = np.nan
+        elif is_object_dtype(dtype) and values.ndim == 1 and na_mask.any():


do you have a test case that fails on non ndim==1?

Yes, was getting a couple test failures otherwise, I think for reductions when the entire DataFrame has object dtype (I can't recall which tests exactly). I figured the subsetting values[~mask] is only going to make sense if values has one dimension.

doc/source/whatsnew/v1.0.2.rst

jorisvandenbossche · 2020-02-15T08:18:08Z

pandas/tests/extension/test_string.py

+            "min",
+            "max",
+        ]:
+            pytest.skip("These reductions are implemented")


Can you add here a comment saying that those are tested in tests/arrays/test_string.py ?

Co-Authored-By: Joris Van den Bossche <jorisvandenbossche@gmail.com>

jreback

more comments

doc/source/whatsnew/v1.0.2.rst

pandas/core/arrays/string_.py

jreback · 2020-02-16T14:56:01Z

pandas/core/nanops.py

@@ -228,7 +228,9 @@ def _maybe_get_mask(
            # Boolean data cannot contain nulls, so signal via mask being None
            return None

-        if skipna:
+        if skipna or is_object_dtype(values.dtype):
+            # The masking in nanminmax does not work for object dtype, so always


rather than do this, what exactly s the issue? 'does not work' is not very descriptive and generally we don't put comments like this, we simply fix it

So what nanops.nanminmax appears to do when taking the min or max in the presence of missing values is to fill them with an appropriate infinite number that has the effect of ignoring those missing values (if we're taking the minimum replace with positive infinity, if we're taking the max replace with negative infinity). The problem is that this makes no sense for strings (there is as far as I know no "infinite string"), and that's why we get the error about comparing floats (infinity) and strings. The easiest workaround seems to be to mask them out instead.

To make things more complicated the _get_values function in nanminmax apparently doesn't bother to calculate a missing value mask when skipna is False because it's relying on the trick above working. Since it won't I'm making sure that we always get a mask for object dtypes.

len let's actualy fix this properly.

this is going to need either a branch for object dtypes and perf tests.

Similarly as #30982, I would personally rather have a separate implementation for one using masks instead of the filling logic of the nanops (with sharing helper functions where appropriate), instead of trying to fit it into the existing nanops code (which gives those special cases / complex if checks like the one below)

@jorisvandenbossche Fair point for the string dtype, although I think some kind of logic like this would be necessary to fix min / max for object strings.

Edit: Actually looks like maybe you're already addressing this in your PR.

some kind of logic like this would be necessary to fix min / max for object strings.

Yep, indeed, that's what you mentioned as argument before as well for doing it in nanops, so the non-extension array object dtype would benefit as well. For this, we could also add a check for object dtype, and then calculate the mask and use the masked op implementation.

Do you think it might also be worth trying to refactor nanminmax in nanops to use masking in general instead of the current filling approach (from what I could tell this was only really needed for arrays with more than one dimension)?

jreback · 2020-02-16T14:56:15Z

pandas/core/nanops.py

+            and mask is not None
+            and mask.any()
+        ):
+            # Need to explicitly mask NA values for object dtypes


jreback · 2020-02-16T14:56:28Z

pandas/core/nanops.py

@@ -865,6 +867,17 @@ def reduction(
                result.fill(np.nan)
            except (AttributeError, TypeError, ValueError):
                result = np.nan
+        elif (
+            is_object_dtype(dtype)


why do you need all of these condtions? this is complicated

Only looking at objects is for the reason above, values.ndim == 1 is because I think we can get here even if values is not vector-valued, which is more or less what we're assuming when we just mask them out (if we don't include this we can get test failures when we have an all-object DataFrame), mask.any() is because this function already works if no values are missing so there's no reason to do masking, and mask is not None is to please mypy (we've already guaranteed that mask isn't None for objects above but mypy doesn't know this).

I can try to be a bit more explicit in the comments if that would be helpful.

jreback · 2020-02-16T16:25:51Z

moving this off 1.0.2 as this has raised some non-trivial to solve issues. if you want a bug fix out of this ok, but then this needs to decouple the issues.

jreback · 2020-03-04T13:58:33Z

doc/source/whatsnew/v1.0.2.rst

@@ -59,6 +59,10 @@ Previously indexing with a nullable Boolean array containing ``NA`` would raise
 Bug fixes
 ~~~~~~~~~

+**ExtensionArray**


this is likely too invasive for 1.02, move to 1.1

dsaxton · 2020-03-05T02:24:36Z

Going to close this for now since it looks to be quite a bit larger than I originally thought (seems like fixing nanmin / nanmax might be the right approach).

I'm not sure how best to implement this, but another idea might be to fill with an arbitrary non-NaN value from each array along the given axis before taking the min or max, then I think it should work for any dtype (although might be a little slower).

Daniel Saxton added 5 commits February 6, 2020 13:46

Add reduction tests

c87ce47

Implement a few reductions

5365fa1

Fix test

460030f

Add whatsnew

6497e16

skipna

c052feb

jbrockmendel reviewed Feb 6, 2020

View reviewed changes

pandas/tests/arrays/string_/test_string.py Outdated Show resolved Hide resolved

jorisvandenbossche modified the milestones: 1.1, 1.0.2 Feb 11, 2020

jorisvandenbossche reviewed Feb 11, 2020

View reviewed changes

pandas/tests/arrays/string_/test_string.py Show resolved Hide resolved

Daniel Saxton added 3 commits February 11, 2020 14:56

Hardcode expected output

4312b4f

Merge branch 'master' into reduce-string

b67cc56

Skip tests

214995f

jreback added ExtensionArray Extending pandas with custom dtypes or arrays. Strings String extension data type and string data labels Feb 12, 2020

jreback requested changes Feb 12, 2020

View reviewed changes

Daniel Saxton added 6 commits February 13, 2020 15:38

Merge branch 'master' into reduce-string

95c83fb

No else

4082ae3

Modify nanminmax to better handle object dtype

12a2323

Delegate to min and max methods

4a61b81

Fixup nanops

c6465e5

Import and clean up

5c1a5fb

Daniel Saxton added 4 commits February 13, 2020 21:35

Update tests

ca99650

Update whatsnew

18800e1

Limit to values.ndim == 1

3eddb73

Sort index in test

0c1a2a3

dsaxton commented Feb 14, 2020

View reviewed changes

Box in Series

5714c06

jorisvandenbossche reviewed Feb 14, 2020

View reviewed changes

Daniel Saxton added 3 commits February 14, 2020 10:52

Add Series min / max test

5e01c3f

Merge branch 'master' into reduce-string

0e4a4e2

Move tests

ed01138

jreback requested changes Feb 15, 2020

View reviewed changes

Daniel Saxton added 4 commits February 14, 2020 21:02

Pass in mask and patch _maybe_get_mask

1128950

Update whatsnew

7c0f05d

Merge branch 'master' into reduce-string

d84f5bd

Try to make mypy happy

7db2052

jorisvandenbossche reviewed Feb 15, 2020

View reviewed changes

doc/source/whatsnew/v1.0.2.rst Outdated Show resolved Hide resolved

jorisvandenbossche reviewed Feb 15, 2020

View reviewed changes

dsaxton and others added 3 commits February 15, 2020 07:46

Update doc/source/whatsnew/v1.0.2.rst

e19bbe1

Co-Authored-By: Joris Van den Bossche <jorisvandenbossche@gmail.com>

Add comment

df99b4f

Merge branch 'master' into reduce-string

ba20705

jreback requested changes Feb 16, 2020

View reviewed changes

jreback removed this from the 1.0.2 milestone Feb 16, 2020

Merge remote-tracking branch 'upstream/master' into reduce-string

d69b587

jreback requested changes Mar 4, 2020

View reviewed changes

dsaxton closed this Mar 5, 2020

TomAugspurger mentioned this pull request Mar 10, 2020

Inconsistent behaviour with min method on object dtype columns #18588

Open

dsaxton deleted the reduce-string branch March 30, 2021 20:16

Implement some reductions for string Series #31757

Implement some reductions for string Series #31757

Conversation

dsaxton commented Feb 6, 2020 • edited Loading

WillAyd commented Feb 6, 2020

jorisvandenbossche commented Feb 6, 2020

dsaxton commented Feb 6, 2020

WillAyd commented Feb 6, 2020

dsaxton commented Feb 6, 2020

jorisvandenbossche commented Feb 6, 2020

jreback commented Feb 6, 2020

jorisvandenbossche commented Feb 11, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche Feb 13, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback Feb 16, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dsaxton Mar 3, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Feb 16, 2020

Choose a reason for hiding this comment

dsaxton commented Mar 5, 2020

dsaxton commented Feb 6, 2020 •

edited

Loading

jorisvandenbossche Feb 13, 2020 •

edited

Loading

jreback Feb 16, 2020 •

edited

Loading

dsaxton Mar 3, 2020 •

edited

Loading