Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement some reductions for string Series #31757

Closed
wants to merge 30 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions doc/source/whatsnew/v1.0.2.rst
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,10 @@ Previously indexing with a nullable Boolean array containing ``NA`` would raise
Bug fixes
~~~~~~~~~

**ExtensionArray**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is likely too invasive for 1.02, move to 1.1


- Fixed issue where taking ``min`` or ``max`` of a ``StringArray`` or ``Series`` with ``StringDtype`` type would raise. (:issue:`31746`)

**Datetimelike**

- Bug in :meth:`DataFrame.reindex` and :meth:`Series.reindex` when reindexing with a tz-aware index (:issue:`26683`)
Expand All @@ -74,6 +78,10 @@ Bug fixes
- Using ``pd.NA`` with :meth:`DataFrame.to_json` now correctly outputs a null value instead of an empty object (:issue:`31615`)
- Fixed bug in parquet roundtrip with nullable unsigned integer dtypes (:issue:`31896`).

**Reduction**

- Fixed bug where :meth:`Series.min` and :meth:`Series.max` would raise a ``TypeError`` in the presence of ``NaN`` for object dtype. (:issue:`18588`)

**Experimental dtypes**

- Fix bug in :meth:`DataFrame.convert_dtypes` for columns that were already using the ``"string"`` dtype (:issue:`31731`).
Expand Down
16 changes: 15 additions & 1 deletion pandas/core/arrays/string_.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
import numpy as np

from pandas._libs import lib, missing as libmissing
from pandas.compat.numpy import function as nv

from pandas.core.dtypes.base import ExtensionDtype
from pandas.core.dtypes.common import pandas_dtype
Expand All @@ -12,7 +13,7 @@
from pandas.core.dtypes.inference import is_array_like

from pandas import compat
from pandas.core import ops
from pandas.core import nanops, ops
from pandas.core.arrays import PandasArray
from pandas.core.construction import extract_array
from pandas.core.indexers import check_array_indexer
Expand Down Expand Up @@ -274,8 +275,21 @@ def astype(self, dtype, copy=True):
return super().astype(dtype, copy)

def _reduce(self, name, skipna=True, **kwargs):
if name in ["min", "max"]:
return getattr(self, name)(skipna=skipna, **kwargs)

raise TypeError(f"Cannot perform reduction '{name}' with string dtype")

def min(self, axis=None, out=None, keepdims=False, skipna=True):
nv.validate_min((), dict(out=out, keepdims=keepdims))
result = nanops.nanmin(self._ndarray, axis=axis, skipna=skipna, mask=isna(self))
return libmissing.NA if isna(result) else result

def max(self, axis=None, out=None, keepdims=False, skipna=True):
jreback marked this conversation as resolved.
Show resolved Hide resolved
nv.validate_max((), dict(out=out, keepdims=keepdims))
result = nanops.nanmax(self._ndarray, axis=axis, skipna=skipna, mask=isna(self))
return libmissing.NA if isna(result) else result

def value_counts(self, dropna=False):
from pandas import value_counts

Expand Down
15 changes: 14 additions & 1 deletion pandas/core/nanops.py
Original file line number Diff line number Diff line change
Expand Up @@ -228,7 +228,9 @@ def _maybe_get_mask(
# Boolean data cannot contain nulls, so signal via mask being None
return None

if skipna:
if skipna or is_object_dtype(values.dtype):
# The masking in nanminmax does not work for object dtype, so always
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than do this, what exactly s the issue? 'does not work' is not very descriptive and generally we don't put comments like this, we simply fix it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what nanops.nanminmax appears to do when taking the min or max in the presence of missing values is to fill them with an appropriate infinite number that has the effect of ignoring those missing values (if we're taking the minimum replace with positive infinity, if we're taking the max replace with negative infinity). The problem is that this makes no sense for strings (there is as far as I know no "infinite string"), and that's why we get the error about comparing floats (infinity) and strings. The easiest workaround seems to be to mask them out instead.

To make things more complicated the _get_values function in nanminmax apparently doesn't bother to calculate a missing value mask when skipna is False because it's relying on the trick above working. Since it won't I'm making sure that we always get a mask for object dtypes.

Copy link
Contributor

@jreback jreback Feb 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

len let's actualy fix this properly.

this is going to need either a branch for object dtypes and perf tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly as #30982, I would personally rather have a separate implementation for one using masks instead of the filling logic of the nanops (with sharing helper functions where appropriate), instead of trying to fit it into the existing nanops code (which gives those special cases / complex if checks like the one below)

Copy link
Member Author

@dsaxton dsaxton Mar 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche Fair point for the string dtype, although I think some kind of logic like this would be necessary to fix min / max for object strings.

Edit: Actually looks like maybe you're already addressing this in your PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some kind of logic like this would be necessary to fix min / max for object strings.

Yep, indeed, that's what you mentioned as argument before as well for doing it in nanops, so the non-extension array object dtype would benefit as well. For this, we could also add a check for object dtype, and then calculate the mask and use the masked op implementation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it might also be worth trying to refactor nanminmax in nanops to use masking in general instead of the current filling approach (from what I could tell this was only really needed for arrays with more than one dimension)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

# compute a mask for use later when skipna is False
mask = isna(values)

return mask
Expand Down Expand Up @@ -865,6 +867,17 @@ def reduction(
result.fill(np.nan)
except (AttributeError, TypeError, ValueError):
result = np.nan
elif (
is_object_dtype(dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need all of these condtions? this is complicated

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only looking at objects is for the reason above, values.ndim == 1 is because I think we can get here even if values is not vector-valued, which is more or less what we're assuming when we just mask them out (if we don't include this we can get test failures when we have an all-object DataFrame), mask.any() is because this function already works if no values are missing so there's no reason to do masking, and mask is not None is to please mypy (we've already guaranteed that mask isn't None for objects above but mypy doesn't know this).

I can try to be a bit more explicit in the comments if that would be helpful.

and values.ndim == 1
and mask is not None
and mask.any()
):
# Need to explicitly mask NA values for object dtypes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

if skipna:
result = getattr(values[~mask], meth)(axis)
else:
result = np.nan
else:
result = getattr(values, meth)(axis)

Expand Down
28 changes: 28 additions & 0 deletions pandas/tests/arrays/string_/test_string.py
Original file line number Diff line number Diff line change
Expand Up @@ -269,3 +269,31 @@ def test_value_counts_na():
result = arr.value_counts(dropna=True)
expected = pd.Series([2, 1], index=["a", "b"], dtype="Int64")
tm.assert_series_equal(result, expected)


@pytest.mark.parametrize("func", ["min", "max"])
@pytest.mark.parametrize("skipna", [True, False])
def test_reduction(func, skipna):
jreback marked this conversation as resolved.
Show resolved Hide resolved
s = pd.Series(["x", "y", "z"], dtype="string")
result = getattr(s, func)(skipna=skipna)
expected = "x" if func == "min" else "z"

assert result == expected
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved


@pytest.mark.parametrize("func", ["min", "max"])
@pytest.mark.parametrize("skipna", [True, False])
@pytest.mark.parametrize("box_in_series", [True, False])
def test_reduction_with_na(func, skipna, box_in_series):
data = pd.array([pd.NA, "y", "z"], dtype="string")

if box_in_series:
data = pd.Series(data)

result = getattr(data, func)(skipna=skipna)

if skipna:
expected = "y" if func == "min" else "z"
assert result == expected
else:
assert result is pd.NA
15 changes: 14 additions & 1 deletion pandas/tests/extension/test_string.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,20 @@ class TestMissing(base.BaseMissingTests):


class TestNoReduce(base.BaseNoReduceTests):
pass
@pytest.mark.parametrize("skipna", [True, False])
def test_reduce_series_numeric(self, data, all_numeric_reductions, skipna):
if isinstance(data, pd.arrays.StringArray) and all_numeric_reductions in [
"min",
"max",
]:
# Tested in pandas/tests/arrays/string_/test_string.py
pytest.skip("These reductions are implemented")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add here a comment saying that those are tested in tests/arrays/test_string.py ?


op_name = all_numeric_reductions
s = pd.Series(data)

with pytest.raises(TypeError):
getattr(s, op_name)(skipna=skipna)


class TestMethods(base.BaseMethodsTests):
Expand Down
4 changes: 2 additions & 2 deletions pandas/tests/frame/test_apply.py
Original file line number Diff line number Diff line change
Expand Up @@ -1406,8 +1406,8 @@ def test_apply_datetime_tz_issue(self):
@pytest.mark.parametrize("method", ["min", "max", "sum"])
def test_consistency_of_aggregates_of_columns_with_missing_values(self, df, method):
# GH 16832
none_in_first_column_result = getattr(df[["A", "B"]], method)()
none_in_second_column_result = getattr(df[["B", "A"]], method)()
none_in_first_column_result = getattr(df[["A", "B"]], method)().sort_index()
none_in_second_column_result = getattr(df[["B", "A"]], method)().sort_index()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously the column with the missing value was getting dropped from the result so it only had a single row and the order didn't matter

jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved

tm.assert_series_equal(
none_in_first_column_result, none_in_second_column_result
Expand Down
13 changes: 13 additions & 0 deletions pandas/tests/reductions/test_reductions.py
Original file line number Diff line number Diff line change
Expand Up @@ -172,6 +172,19 @@ def test_numpy_reduction_with_tz_aware_dtype(self, tz_aware_fixture, func):
result = getattr(np, func)(expected, expected)
tm.assert_series_equal(result, expected)

@pytest.mark.parametrize("func", ["min", "max"])
@pytest.mark.parametrize("skipna", [True, False])
def test_reduce_object_with_na(self, func, skipna):
# https://github.com/pandas-dev/pandas/issues/18588
s = pd.Series(np.array(["a", "b", np.nan], dtype=object))
result = getattr(s, func)(skipna=skipna)

if skipna:
expected = "a" if func == "min" else "b"
assert result == expected
else:
assert result is np.nan


class TestIndexReductions:
# Note: the name TestIndexReductions indicates these tests
Expand Down