Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement some reductions for string Series #31757

Closed
wants to merge 30 commits into from
Closed
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion doc/source/whatsnew/v1.0.2.rst
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ Bug fixes

**ExtensionArray**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is likely too invasive for 1.02, move to 1.1


- Fixed issue where taking the minimum or maximum of a ``StringArray`` or ``Series`` with ``StringDtype`` type would raise. (:issue:`31746`)
- Fixed issue where taking :meth:`min` or :meth:`max` of a ``StringArray`` or ``Series`` with ``StringDtype`` type would raise. (:issue:`31746`)
dsaxton marked this conversation as resolved.
Show resolved Hide resolved
-

**Categorical**
Expand All @@ -42,7 +42,9 @@ Bug fixes
- Using ``pd.NA`` with :meth:`DataFrame.to_json` now correctly outputs a null value instead of an empty object (:issue:`31615`)
- Fixed bug in parquet roundtrip with nullable unsigned integer dtypes (:issue:`31896`).

**Reduction**

- Fixed bug where :meth:`Series.min` and :meth:`Series.max` would raise a ``TypeError`` in the presence of ``NaN`` for object dtype. (:issue:`18588`)

**Experimental dtypes**

Expand Down
4 changes: 2 additions & 2 deletions pandas/core/arrays/string_.py
Original file line number Diff line number Diff line change
Expand Up @@ -282,12 +282,12 @@ def _reduce(self, name, skipna=True, **kwargs):

def min(self, axis=None, out=None, keepdims=False, skipna=True):
nv.validate_min((), dict(out=out, keepdims=keepdims))
result = nanops.nanmin(self._ndarray, axis=axis, skipna=skipna)
result = nanops.nanmin(self._ndarray, axis=axis, skipna=skipna, mask=isna(self))
return libmissing.NA if isna(result) else result

def max(self, axis=None, out=None, keepdims=False, skipna=True):
jreback marked this conversation as resolved.
Show resolved Hide resolved
nv.validate_max((), dict(out=out, keepdims=keepdims))
result = nanops.nanmax(self._ndarray, axis=axis, skipna=skipna)
result = nanops.nanmax(self._ndarray, axis=axis, skipna=skipna, mask=isna(self))
return libmissing.NA if isna(result) else result

def value_counts(self, dropna=False):
Expand Down
15 changes: 10 additions & 5 deletions pandas/core/nanops.py
Original file line number Diff line number Diff line change
Expand Up @@ -228,7 +228,9 @@ def _maybe_get_mask(
# Boolean data cannot contain nulls, so signal via mask being None
return None

if skipna:
if skipna or is_object_dtype(values.dtype):
# The masking in nanminmax does not work for object dtype, so always
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than do this, what exactly s the issue? 'does not work' is not very descriptive and generally we don't put comments like this, we simply fix it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what nanops.nanminmax appears to do when taking the min or max in the presence of missing values is to fill them with an appropriate infinite number that has the effect of ignoring those missing values (if we're taking the minimum replace with positive infinity, if we're taking the max replace with negative infinity). The problem is that this makes no sense for strings (there is as far as I know no "infinite string"), and that's why we get the error about comparing floats (infinity) and strings. The easiest workaround seems to be to mask them out instead.

To make things more complicated the _get_values function in nanminmax apparently doesn't bother to calculate a missing value mask when skipna is False because it's relying on the trick above working. Since it won't I'm making sure that we always get a mask for object dtypes.

Copy link
Contributor

@jreback jreback Feb 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

len let's actualy fix this properly.

this is going to need either a branch for object dtypes and perf tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly as #30982, I would personally rather have a separate implementation for one using masks instead of the filling logic of the nanops (with sharing helper functions where appropriate), instead of trying to fit it into the existing nanops code (which gives those special cases / complex if checks like the one below)

Copy link
Member Author

@dsaxton dsaxton Mar 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche Fair point for the string dtype, although I think some kind of logic like this would be necessary to fix min / max for object strings.

Edit: Actually looks like maybe you're already addressing this in your PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some kind of logic like this would be necessary to fix min / max for object strings.

Yep, indeed, that's what you mentioned as argument before as well for doing it in nanops, so the non-extension array object dtype would benefit as well. For this, we could also add a check for object dtype, and then calculate the mask and use the masked op implementation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it might also be worth trying to refactor nanminmax in nanops to use masking in general instead of the current filling approach (from what I could tell this was only really needed for arrays with more than one dimension)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

# compute a mask for use later when skipna is False
mask = isna(values)

return mask
Expand Down Expand Up @@ -854,8 +856,6 @@ def reduction(
mask: Optional[np.ndarray] = None,
) -> Dtype:

na_mask = isna(values)

values, mask, dtype, dtype_max, fill_value = _get_values(
values, skipna, fill_value_typ=fill_value_typ, mask=mask
)
Expand All @@ -866,10 +866,15 @@ def reduction(
result.fill(np.nan)
except (AttributeError, TypeError, ValueError):
result = np.nan
elif is_object_dtype(dtype) and values.ndim == 1 and na_mask.any():
elif (
is_object_dtype(dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need all of these condtions? this is complicated

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only looking at objects is for the reason above, values.ndim == 1 is because I think we can get here even if values is not vector-valued, which is more or less what we're assuming when we just mask them out (if we don't include this we can get test failures when we have an all-object DataFrame), mask.any() is because this function already works if no values are missing so there's no reason to do masking, and mask is not None is to please mypy (we've already guaranteed that mask isn't None for objects above but mypy doesn't know this).

I can try to be a bit more explicit in the comments if that would be helpful.

and values.ndim == 1
and mask is not None
and mask.any()
):
# Need to explicitly mask NA values for object dtypes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

if skipna:
result = getattr(values[~na_mask], meth)(axis)
result = getattr(values[~mask], meth)(axis)
else:
result = np.nan
else:
Expand Down
6 changes: 0 additions & 6 deletions pandas/tests/extension/base/reduce.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,12 +25,6 @@ class BaseNoReduceTests(BaseReduceTests):

@pytest.mark.parametrize("skipna", [True, False])
def test_reduce_series_numeric(self, data, all_numeric_reductions, skipna):
if isinstance(data, pd.arrays.StringArray) and all_numeric_reductions in [
"min",
"max",
]:
pytest.skip("These reductions are implemented")

op_name = all_numeric_reductions
s = pd.Series(data)

Expand Down
14 changes: 13 additions & 1 deletion pandas/tests/extension/test_string.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,19 @@ class TestMissing(base.BaseMissingTests):


class TestNoReduce(base.BaseNoReduceTests):
pass
@pytest.mark.parametrize("skipna", [True, False])
def test_reduce_series_numeric(self, data, all_numeric_reductions, skipna):
if isinstance(data, pd.arrays.StringArray) and all_numeric_reductions in [
"min",
"max",
]:
pytest.skip("These reductions are implemented")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add here a comment saying that those are tested in tests/arrays/test_string.py ?


op_name = all_numeric_reductions
s = pd.Series(data)

with pytest.raises(TypeError):
getattr(s, op_name)(skipna=skipna)


class TestMethods(base.BaseMethodsTests):
Expand Down