-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement some reductions for string Series #31757
Changes from 21 commits
c87ce47
5365fa1
460030f
6497e16
c052feb
4312b4f
b67cc56
214995f
95c83fb
4082ae3
12a2323
4a61b81
c6465e5
5c1a5fb
ca99650
18800e1
3eddb73
0c1a2a3
5714c06
5e01c3f
0e4a4e2
ed01138
1128950
7c0f05d
d84f5bd
7db2052
e19bbe1
df99b4f
ba20705
d69b587
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -28,6 +28,11 @@ Fixed regressions | |
Bug fixes | ||
~~~~~~~~~ | ||
|
||
**ExtensionArray** | ||
|
||
- Fixed issue where taking the minimum or maximum of a ``StringArray`` or ``Series`` with ``StringDtype`` type would raise. (:issue:`31746`) | ||
jreback marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. say |
||
- | ||
|
||
**Categorical** | ||
|
||
- Fixed bug where :meth:`Categorical.from_codes` improperly raised a ``ValueError`` when passed nullable integer codes. (:issue:`31779`) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,6 +4,7 @@ | |
import numpy as np | ||
|
||
from pandas._libs import lib, missing as libmissing | ||
from pandas.compat.numpy import function as nv | ||
|
||
from pandas.core.dtypes.base import ExtensionDtype | ||
from pandas.core.dtypes.common import pandas_dtype | ||
|
@@ -12,7 +13,7 @@ | |
from pandas.core.dtypes.inference import is_array_like | ||
|
||
from pandas import compat | ||
from pandas.core import ops | ||
from pandas.core import nanops, ops | ||
from pandas.core.arrays import PandasArray | ||
from pandas.core.construction import extract_array | ||
from pandas.core.indexers import check_array_indexer | ||
|
@@ -274,8 +275,21 @@ def astype(self, dtype, copy=True): | |
return super().astype(dtype, copy) | ||
|
||
def _reduce(self, name, skipna=True, **kwargs): | ||
if name in ["min", "max"]: | ||
return getattr(self, name)(skipna=skipna, **kwargs) | ||
|
||
raise TypeError(f"Cannot perform reduction '{name}' with string dtype") | ||
|
||
def min(self, axis=None, out=None, keepdims=False, skipna=True): | ||
nv.validate_min((), dict(out=out, keepdims=keepdims)) | ||
result = nanops.nanmin(self._ndarray, axis=axis, skipna=skipna) | ||
return libmissing.NA if isna(result) else result | ||
|
||
def max(self, axis=None, out=None, keepdims=False, skipna=True): | ||
jreback marked this conversation as resolved.
Show resolved
Hide resolved
|
||
nv.validate_max((), dict(out=out, keepdims=keepdims)) | ||
result = nanops.nanmax(self._ndarray, axis=axis, skipna=skipna) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There should be no need to explicitly pass through the axis keyword, I think |
||
return libmissing.NA if isna(result) else result | ||
|
||
def value_counts(self, dropna=False): | ||
from pandas import value_counts | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -854,6 +854,8 @@ def reduction( | |
mask: Optional[np.ndarray] = None, | ||
) -> Dtype: | ||
|
||
na_mask = isna(values) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you should already have the mask (pass it in when you call this). |
||
|
||
values, mask, dtype, dtype_max, fill_value = _get_values( | ||
values, skipna, fill_value_typ=fill_value_typ, mask=mask | ||
) | ||
|
@@ -864,6 +866,12 @@ def reduction( | |
result.fill(np.nan) | ||
except (AttributeError, TypeError, ValueError): | ||
result = np.nan | ||
elif is_object_dtype(dtype) and values.ndim == 1 and na_mask.any(): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do you have a test case that fails on non ndim==1? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, was getting a couple test failures otherwise, I think for reductions when the entire |
||
# Need to explicitly mask NA values for object dtypes | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why? |
||
if skipna: | ||
result = getattr(values[~na_mask], meth)(axis) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This masking could also be done in the Or, another option might be to add a min/max function to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think a benefit of having it here is that this also fixes a bug for Series: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, that's a good point. Can you add a test for that, then? Now, that aside, I think longer term we still want the separate min/max in mask_ops.py, so it can also be used for the int dtypes. But that can then certainly be done for a separate PR. |
||
else: | ||
result = np.nan | ||
else: | ||
result = getattr(values, meth)(axis) | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -25,6 +25,12 @@ class BaseNoReduceTests(BaseReduceTests): | |
|
||
@pytest.mark.parametrize("skipna", [True, False]) | ||
def test_reduce_series_numeric(self, data, all_numeric_reductions, skipna): | ||
if isinstance(data, pd.arrays.StringArray) and all_numeric_reductions in [ | ||
"min", | ||
"max", | ||
]: | ||
pytest.skip("These reductions are implemented") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you see if you can rather update this in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. By updating in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, actually looking at the base reduction tests now: they are not really written in a way that they will pass for strings. But so you can copy this test to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, so we can remove the special cases for |
||
|
||
op_name = all_numeric_reductions | ||
s = pd.Series(data) | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1406,8 +1406,8 @@ def test_apply_datetime_tz_issue(self): | |
@pytest.mark.parametrize("method", ["min", "max", "sum"]) | ||
def test_consistency_of_aggregates_of_columns_with_missing_values(self, df, method): | ||
# GH 16832 | ||
none_in_first_column_result = getattr(df[["A", "B"]], method)() | ||
none_in_second_column_result = getattr(df[["B", "A"]], method)() | ||
none_in_first_column_result = getattr(df[["A", "B"]], method)().sort_index() | ||
none_in_second_column_result = getattr(df[["B", "A"]], method)().sort_index() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Previously the column with the missing value was getting dropped from the result so it only had a single row and the order didn't matter
jorisvandenbossche marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
tm.assert_series_equal( | ||
none_in_first_column_result, none_in_second_column_result | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is likely too invasive for 1.02, move to 1.1