-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement some reductions for string Series #31757
Changes from 26 commits
c87ce47
5365fa1
460030f
6497e16
c052feb
4312b4f
b67cc56
214995f
95c83fb
4082ae3
12a2323
4a61b81
c6465e5
5c1a5fb
ca99650
18800e1
3eddb73
0c1a2a3
5714c06
5e01c3f
0e4a4e2
ed01138
1128950
7c0f05d
d84f5bd
7db2052
e19bbe1
df99b4f
ba20705
d69b587
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -228,7 +228,9 @@ def _maybe_get_mask( | |
# Boolean data cannot contain nulls, so signal via mask being None | ||
return None | ||
|
||
if skipna: | ||
if skipna or is_object_dtype(values.dtype): | ||
# The masking in nanminmax does not work for object dtype, so always | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. rather than do this, what exactly s the issue? 'does not work' is not very descriptive and generally we don't put comments like this, we simply fix it There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So what nanops.nanminmax appears to do when taking the min or max in the presence of missing values is to fill them with an appropriate infinite number that has the effect of ignoring those missing values (if we're taking the minimum replace with positive infinity, if we're taking the max replace with negative infinity). The problem is that this makes no sense for strings (there is as far as I know no "infinite string"), and that's why we get the error about comparing floats (infinity) and strings. The easiest workaround seems to be to mask them out instead. To make things more complicated the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. len let's actualy fix this properly. this is going to need either a branch for object dtypes and perf tests. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Similarly as #30982, I would personally rather have a separate implementation for one using masks instead of the filling logic of the nanops (with sharing helper functions where appropriate), instead of trying to fit it into the existing nanops code (which gives those special cases / complex if checks like the one below) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jorisvandenbossche Fair point for the string dtype, although I think some kind of logic like this would be necessary to fix Edit: Actually looks like maybe you're already addressing this in your PR. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yep, indeed, that's what you mentioned as argument before as well for doing it in nanops, so the non-extension array object dtype would benefit as well. For this, we could also add a check for object dtype, and then calculate the mask and use the masked op implementation. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you think it might also be worth trying to refactor nanminmax in nanops to use masking in general instead of the current filling approach (from what I could tell this was only really needed for arrays with more than one dimension)? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes |
||
# compute a mask for use later when skipna is False | ||
mask = isna(values) | ||
|
||
return mask | ||
|
@@ -864,6 +866,17 @@ def reduction( | |
result.fill(np.nan) | ||
except (AttributeError, TypeError, ValueError): | ||
result = np.nan | ||
elif ( | ||
is_object_dtype(dtype) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why do you need all of these condtions? this is complicated There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Only looking at objects is for the reason above, I can try to be a bit more explicit in the comments if that would be helpful. |
||
and values.ndim == 1 | ||
and mask is not None | ||
and mask.any() | ||
): | ||
# Need to explicitly mask NA values for object dtypes | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why? |
||
if skipna: | ||
result = getattr(values[~mask], meth)(axis) | ||
else: | ||
result = np.nan | ||
else: | ||
result = getattr(values, meth)(axis) | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -77,7 +77,19 @@ class TestMissing(base.BaseMissingTests): | |
|
||
|
||
class TestNoReduce(base.BaseNoReduceTests): | ||
pass | ||
@pytest.mark.parametrize("skipna", [True, False]) | ||
def test_reduce_series_numeric(self, data, all_numeric_reductions, skipna): | ||
if isinstance(data, pd.arrays.StringArray) and all_numeric_reductions in [ | ||
"min", | ||
"max", | ||
]: | ||
pytest.skip("These reductions are implemented") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you add here a comment saying that those are tested in tests/arrays/test_string.py ? |
||
|
||
op_name = all_numeric_reductions | ||
s = pd.Series(data) | ||
|
||
with pytest.raises(TypeError): | ||
getattr(s, op_name)(skipna=skipna) | ||
|
||
|
||
class TestMethods(base.BaseMethodsTests): | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1406,8 +1406,8 @@ def test_apply_datetime_tz_issue(self): | |
@pytest.mark.parametrize("method", ["min", "max", "sum"]) | ||
def test_consistency_of_aggregates_of_columns_with_missing_values(self, df, method): | ||
# GH 16832 | ||
none_in_first_column_result = getattr(df[["A", "B"]], method)() | ||
none_in_second_column_result = getattr(df[["B", "A"]], method)() | ||
none_in_first_column_result = getattr(df[["A", "B"]], method)().sort_index() | ||
none_in_second_column_result = getattr(df[["B", "A"]], method)().sort_index() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Previously the column with the missing value was getting dropped from the result so it only had a single row and the order didn't matter
jorisvandenbossche marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
tm.assert_series_equal( | ||
none_in_first_column_result, none_in_second_column_result | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is likely too invasive for 1.02, move to 1.1