Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement some reductions for string Series #31757

Closed
wants to merge 30 commits into from
Closed

Implement some reductions for string Series #31757

wants to merge 30 commits into from

Conversation

dsaxton
Copy link
Member

@dsaxton dsaxton commented Feb 6, 2020

@WillAyd
Copy link
Member

WillAyd commented Feb 6, 2020

Just as a counterpoint do we really want to support all of these? sum in particular is strange to me to support on a string dtype

@jorisvandenbossche
Copy link
Member

The value of sum is certainly less clear, but I think min/max are nice to have?

@dsaxton
Copy link
Member Author

dsaxton commented Feb 6, 2020

Just as a counterpoint do we really want to support all of these? sum in particular is strange to me to support on a string dtype

Agreed summing strings is a little odd, but is it worth implementing for the sake of consistency with Series of object dtype (for which this is a valid operation)?

@WillAyd
Copy link
Member

WillAyd commented Feb 6, 2020

I don’t think consistency with object dtype is a goal for the string dtype. Even for min/max I’m not sure what those mean in a lot of cases, unless the answer is to fallback to Python semantics.

My concern is I think that as an answer conflicts with the goal of creating a native string type

@dsaxton
Copy link
Member Author

dsaxton commented Feb 6, 2020

I don’t think consistency with object dtype is a goal for the string dtype. Even for min/max I’m not sure what those mean in a lot of cases, unless the answer is to fallback to Python semantics.

My concern is I think that as an answer conflicts with the goal of creating a native string type

Fair point about consistency not being important. I do think though that strings having an "order" to them is a pretty useful / natural concept (we'd probably want to allow sorting of strings, in which case we'd also want min and max)

@jorisvandenbossche
Copy link
Member

Even for min/max I’m not sure what those mean in a lot of cases, unless the answer is to fallback to Python semantics.

You can sort strings, and then min/max have a rather clear meaning IMO

But it's true we certainly don't need to do this for consistency with object dtype, but because we think it is useful
(we only need to check / fix consistency between dataframe vs series operation)

@jreback
Copy link
Contributor

jreback commented Feb 6, 2020

i would certain add operations that work now on object types - otherwise the dtypes won’t be used generally which is not great

@jorisvandenbossche
Copy link
Member

More thoughts on adding min/max and/or sum. I would like to see min/max added, but care less about sum.

@jorisvandenbossche jorisvandenbossche modified the milestones: 1.1, 1.0.2 Feb 11, 2020
@jreback jreback added ExtensionArray Extending pandas with custom dtypes or arrays. Strings String extension data type and string data labels Feb 12, 2020
@@ -274,7 +274,16 @@ def astype(self, dtype, copy=True):
return super().astype(dtype, copy)

def _reduce(self, name, skipna=True, **kwargs):
raise TypeError(f"Cannot perform reduction '{name}' with string dtype")
if name in ["min", "max", "sum"]:
na_mask = isna(self)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the masking should be done inside the methods themselves, _reduce just dispatches

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we implement these methods for StringArray in that case? The NA handling for PandasArray seems to be broken for string inputs, so it might have to get handled within each method

Copy link
Member

@jorisvandenbossche jorisvandenbossche Feb 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would say don't care about PandasArray too much (since PandasArray is not using pd.NA), and just implement the methods here on StringArray.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason why the NA-handling wasn't working was due to an apparently long-standing bug in nanops.nanminmax which I think we can fix here: #18588. Basically we are filling NA with infinite values when taking the min or max, but this doesn't make sense for object dtypes and an error gets raised even if skipna is True.

If we fix that by explicitly masking the missing values instead, I believe we can just use this function directly in StringArray methods.

pandas/core/arrays/string_.py Outdated Show resolved Hide resolved
pandas/tests/arrays/string_/test_string.py Show resolved Hide resolved
none_in_first_column_result = getattr(df[["A", "B"]], method)()
none_in_second_column_result = getattr(df[["B", "A"]], method)()
none_in_first_column_result = getattr(df[["A", "B"]], method)().sort_index()
none_in_second_column_result = getattr(df[["B", "A"]], method)().sort_index()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously the column with the missing value was getting dropped from the result so it only had a single row and the order didn't matter

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update!


def max(self, axis=None, out=None, keepdims=False, skipna=True):
nv.validate_max((), dict(out=out, keepdims=keepdims))
result = nanops.nanmax(self._ndarray, axis=axis, skipna=skipna)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be no need to explicitly pass through the axis keyword, I think

pandas/core/arrays/string_.py Show resolved Hide resolved
elif is_object_dtype(dtype) and values.ndim == 1 and na_mask.any():
# Need to explicitly mask NA values for object dtypes
if skipna:
result = getattr(values[~na_mask], meth)(axis)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This masking could also be done in the min/max functions? (as you had before?)

Or, another option might be to add a min/max function to mask_ops.py, similarly as I am doing for sum in #30982 (but it should be simpler for min/max, as those don't need to handle the min_count)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a benefit of having it here is that this also fixes a bug for Series: pd.Series(["a", np.nan]).min() currently raises even though it shouldn't

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's a good point. Can you add a test for that, then?

Now, that aside, I think longer term we still want the separate min/max in mask_ops.py, so it can also be used for the int dtypes. But that can then certainly be done for a separate PR.

"min",
"max",
]:
pytest.skip("These reductions are implemented")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you see if you can rather update this in test_string.py ? It might be we now need to subclass the ReduceTests instead of NoReduceTests.
(ideally the base tests remain dtype agnostic)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By updating in test_string.py do you mean adding tests using the fixtures data and all_numeric_reductions, only checking for the "correct" output (and skipping over those reductions that aren't yet implemented)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, actually looking at the base reduction tests now: they are not really written in a way that they will pass for strings.

But so you can copy this test to tests/extension/test_strings.py (and so override the base one), and then do the string-array-specific adaptation there. It gives some duplication of the test code, but it's not long, and it clearer separation of concerns (the changes for string array are in test_string)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so we can remove the special cases for StringArray in BaseNoReduceTests without getting test failures, as long as they're handled in TestNoReduce in test_string.py? I'm not too familiar with how these particular tests actually get executed during CI

pandas/tests/frame/test_apply.py Show resolved Hide resolved
doc/source/whatsnew/v1.0.2.rst Outdated Show resolved Hide resolved
@@ -28,6 +28,11 @@ Fixed regressions
Bug fixes
~~~~~~~~~

**ExtensionArray**

- Fixed issue where taking the minimum or maximum of a ``StringArray`` or ``Series`` with ``StringDtype`` type would raise. (:issue:`31746`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

say .min() or .max()

@@ -854,6 +854,8 @@ def reduction(
mask: Optional[np.ndarray] = None,
) -> Dtype:

na_mask = isna(values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should already have the mask (pass it in when you call this).

@@ -864,6 +866,12 @@ def reduction(
result.fill(np.nan)
except (AttributeError, TypeError, ValueError):
result = np.nan
elif is_object_dtype(dtype) and values.ndim == 1 and na_mask.any():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have a test case that fails on non ndim==1?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, was getting a couple test failures otherwise, I think for reductions when the entire DataFrame has object dtype (I can't recall which tests exactly). I figured the subsetting values[~mask] is only going to make sense if values has one dimension.

"min",
"max",
]:
pytest.skip("These reductions are implemented")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add here a comment saying that those are tested in tests/arrays/test_string.py ?

dsaxton and others added 3 commits February 15, 2020 07:46
Co-Authored-By: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more comments

doc/source/whatsnew/v1.0.2.rst Outdated Show resolved Hide resolved
pandas/core/arrays/string_.py Show resolved Hide resolved
@@ -228,7 +228,9 @@ def _maybe_get_mask(
# Boolean data cannot contain nulls, so signal via mask being None
return None

if skipna:
if skipna or is_object_dtype(values.dtype):
# The masking in nanminmax does not work for object dtype, so always
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than do this, what exactly s the issue? 'does not work' is not very descriptive and generally we don't put comments like this, we simply fix it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what nanops.nanminmax appears to do when taking the min or max in the presence of missing values is to fill them with an appropriate infinite number that has the effect of ignoring those missing values (if we're taking the minimum replace with positive infinity, if we're taking the max replace with negative infinity). The problem is that this makes no sense for strings (there is as far as I know no "infinite string"), and that's why we get the error about comparing floats (infinity) and strings. The easiest workaround seems to be to mask them out instead.

To make things more complicated the _get_values function in nanminmax apparently doesn't bother to calculate a missing value mask when skipna is False because it's relying on the trick above working. Since it won't I'm making sure that we always get a mask for object dtypes.

Copy link
Contributor

@jreback jreback Feb 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

len let's actualy fix this properly.

this is going to need either a branch for object dtypes and perf tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly as #30982, I would personally rather have a separate implementation for one using masks instead of the filling logic of the nanops (with sharing helper functions where appropriate), instead of trying to fit it into the existing nanops code (which gives those special cases / complex if checks like the one below)

Copy link
Member Author

@dsaxton dsaxton Mar 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche Fair point for the string dtype, although I think some kind of logic like this would be necessary to fix min / max for object strings.

Edit: Actually looks like maybe you're already addressing this in your PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some kind of logic like this would be necessary to fix min / max for object strings.

Yep, indeed, that's what you mentioned as argument before as well for doing it in nanops, so the non-extension array object dtype would benefit as well. For this, we could also add a check for object dtype, and then calculate the mask and use the masked op implementation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it might also be worth trying to refactor nanminmax in nanops to use masking in general instead of the current filling approach (from what I could tell this was only really needed for arrays with more than one dimension)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

and mask is not None
and mask.any()
):
# Need to explicitly mask NA values for object dtypes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

@@ -865,6 +867,17 @@ def reduction(
result.fill(np.nan)
except (AttributeError, TypeError, ValueError):
result = np.nan
elif (
is_object_dtype(dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need all of these condtions? this is complicated

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only looking at objects is for the reason above, values.ndim == 1 is because I think we can get here even if values is not vector-valued, which is more or less what we're assuming when we just mask them out (if we don't include this we can get test failures when we have an all-object DataFrame), mask.any() is because this function already works if no values are missing so there's no reason to do masking, and mask is not None is to please mypy (we've already guaranteed that mask isn't None for objects above but mypy doesn't know this).

I can try to be a bit more explicit in the comments if that would be helpful.

@jreback jreback removed this from the 1.0.2 milestone Feb 16, 2020
@jreback
Copy link
Contributor

jreback commented Feb 16, 2020

moving this off 1.0.2 as this has raised some non-trivial to solve issues. if you want a bug fix out of this ok, but then this needs to decouple the issues.

@@ -59,6 +59,10 @@ Previously indexing with a nullable Boolean array containing ``NA`` would raise
Bug fixes
~~~~~~~~~

**ExtensionArray**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is likely too invasive for 1.02, move to 1.1

@dsaxton
Copy link
Member Author

dsaxton commented Mar 5, 2020

Going to close this for now since it looks to be quite a bit larger than I originally thought (seems like fixing nanmin / nanmax might be the right approach).

I'm not sure how best to implement this, but another idea might be to fill with an arbitrary non-NaN value from each array along the given axis before taking the min or max, then I think it should work for any dtype (although might be a little slower).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Strings String extension data type and string data
Projects
None yet
5 participants