Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: inconsistent DataFrame.agg behavoir when passing as kwargs numeric_only=True #49534

Conversation

thlautenschlaeger
Copy link

@thlautenschlaeger thlautenschlaeger commented Nov 4, 2022

@thlautenschlaeger thlautenschlaeger changed the title add func list handling for numeric_only BUG: inconsistent DataFrame.agg behavoir when passing as kwargs numeric_only=True Nov 4, 2022
@mroeschke mroeschke added Apply Apply, Aggregate, Transform, Map Warnings Warnings that appear or should be added to pandas and removed Warnings Warnings that appear or should be added to pandas labels Nov 7, 2022
@thlautenschlaeger
Copy link
Author

@rhshadrach can you please have a look at the PR since v1.5.2 will be released.

@jreback
Copy link
Contributor

jreback commented Nov 11, 2022

this doesn't have any tests

@thlautenschlaeger
Copy link
Author

I wasn't sure if they are covered by default. Will add them. Thanks for your response.

@thlautenschlaeger
Copy link
Author

To check in advance, is the appropriate location to add the test in the script below?:

pandas/tests/apply/test_frame_apply.py

@rhshadrach
Copy link
Member

To check in advance, is the appropriate location to add the test in the script below?:

pandas/tests/apply/test_frame_apply.py

Yes - that's correct.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

For the overall the behavior here, I didn't realize commenting in the issue that this approach would be something of a hidden feature. That is, numeric_only isn't in the arguments of the function, but we're essentially implementing it as if it were.

I think this would be (at least more) okay if we were only exercising numeric_only when it came to string arguments. E.g. ['sum', lambda x, numeric_only: x.sum(), 'mean']` would do the computation on the numeric portion of the DataFrame for 'sum' and 'mean', but would attempt to pass every Series that makes up the frame to the lambda. This is admittedly not great - I think the only proper resolution is to use DataFrame methods in agg for list/dict like arguments, but that can't happen until 2.0.

cc @mroeschke for any thoughts.

pandas/core/apply.py Outdated Show resolved Hide resolved
@mroeschke
Copy link
Member

If it were better documented that passing the string alias of the DataFrame methods accepts the associated keyword arguments then I supposed it makes sense to support this, especially if the user passes multiple strings aliases into agg.

@rhshadrach
Copy link
Member

Thanks @mroeschke - @thlautenschlaeger can you add a description of whats happening here to the notes section of agg and apply.

@thlautenschlaeger
Copy link
Author

thlautenschlaeger commented Nov 16, 2022

Thanks @mroeschke - @thlautenschlaeger can you add a description of whats happening here to the notes section of agg and apply.

@rhshadrach Added the description to that section. Also added an example to the doc for agg. Hope everything is alright now.

... columns=['A', 'B', 'C'])

Works equivalently as above. Add argument `numeric_only=True` to avoid
exceptions or warnings.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The warning will go away soon (in 2.0) so I am not sure it's worth including here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you're just talking about the or warnings bit, is that right @mroeschke? Agreed - I think the more general to aggregate only numeric columns. would be sufficient - just like the apply docstring.

@@ -27,7 +27,7 @@ Fixed regressions
Bug fixes
~~~~~~~~~
- Bug in the Copy-on-Write implementation losing track of views in certain chained indexing cases (:issue:`48996`)
-
- Bug in inconsistent DataFrame.agg behavior when passing as kwargs ``numeric_only=True`` (:issue:`49352`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Bug in inconsistent DataFrame.agg behavior when passing as kwargs ``numeric_only=True`` (:issue:`49352`
- Bug in inconsistent DataFrame.agg behavior when passing as kwargs ``numeric_only=True`` with :class:`DataFrame` methods or their associated string alias as functions (:issue:`49352`)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't we should be mentioning DataFrame methods here (#49352 (comment) and #49352 (comment)). +1 on the "string alias" bit though.

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhshadrach I'm assuming the other kwargs like skipna and min_periods have the same issue here in agg, but those don't need to be addressed in 1.5.x since those don't affect the FutureWarning?

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good - a few requests below. Also, when mentioning the numeric_only in the docs, can you state it applies only to string arguments. That it works for e.g. DataFrame.mean is part of the standard behavior of agg (to pass any keywords through to the provided function), so I don't think that needs to be particularly called out.

@@ -284,6 +284,14 @@ def transform_str_or_callable(self, func) -> DataFrame | Series:
except Exception:
return func(obj, *args, **kwargs)

def _filter_numeric_only(self) -> list[Any]:
if "numeric_only" in self.kwargs and self.kwargs["numeric_only"] is True:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is True here? I haven't checked all, but I believe DataFrame methods use "truthy" (i.e. and self.kwargs["numeric_only"]):

df = pd.DataFrame({'a': [1, 1, 2], 'b': list('xyz')})
print(df.sum(numeric_only="string"))
# a    4
# dtype: int64

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe there is any need for the is True here. This would give surprising results if the user passes e.g. np.bool_(True).

Also, can you use kwargs.get("numeric_only", False)

pandas/core/apply.py Outdated Show resolved Hide resolved
... columns=['A', 'B', 'C'])

Works equivalently as above. Add argument `numeric_only=True` to avoid
exceptions or warnings.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you're just talking about the or warnings bit, is that right @mroeschke? Agreed - I think the more general to aggregate only numeric columns. would be sufficient - just like the apply docstring.

pandas/tests/apply/test_frame_apply.py Outdated Show resolved Hide resolved
pandas/tests/apply/test_frame_apply.py Outdated Show resolved Hide resolved
@@ -27,7 +27,7 @@ Fixed regressions
Bug fixes
~~~~~~~~~
- Bug in the Copy-on-Write implementation losing track of views in certain chained indexing cases (:issue:`48996`)
-
- Bug in inconsistent DataFrame.agg behavior when passing as kwargs ``numeric_only=True`` (:issue:`49352`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't we should be mentioning DataFrame methods here (#49352 (comment) and #49352 (comment)). +1 on the "string alias" bit though.

pandas/core/frame.py Outdated Show resolved Hide resolved
@rhshadrach
Copy link
Member

@rhshadrach I'm assuming the other kwargs like skipna and min_periods have the same issue here in agg, but those don't need to be addressed in 1.5.x since those don't affect the FutureWarning?

Correct. While I do think things like skipna should work, they are not part of the regression due to the numeric_only changes.

pandas/core/apply.py Outdated Show resolved Hide resolved
@thlautenschlaeger
Copy link
Author

Thanks for your reviews @mroeschke @rhshadrach
I will go through your comments and integrate the change requests.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to only filter the DataFrame due to numeric_only on string arguments in the list-like. Otherwise, we will be creating an inconsistency between list and callable arguments. For example, with f a callable user-defined function

df.agg(f, numeric_only=True)

will pass numeric_only to the function f whereas

df.agg([f], numeric_only=True)

will filter the DataFrame prior to calling f.

pandas/core/apply.py Outdated Show resolved Hide resolved
pandas/core/apply.py Outdated Show resolved Hide resolved
pandas/core/frame.py Outdated Show resolved Hide resolved
pandas/core/frame.py Show resolved Hide resolved
pandas/core/shared_docs.py Show resolved Hide resolved
@@ -284,6 +284,14 @@ def transform_str_or_callable(self, func) -> DataFrame | Series:
except Exception:
return func(obj, *args, **kwargs)

def _filter_numeric_only(self) -> list[Any]:
if "numeric_only" in self.kwargs and self.kwargs["numeric_only"] is True:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe there is any need for the is True here. This would give surprising results if the user passes e.g. np.bool_(True).

Also, can you use kwargs.get("numeric_only", False)

... ['c', 3, 6]],
... columns=['A', 'B', 'C'])

Works equivalently as above. Add argument `numeric_only=True` to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "Works equivalently as above." mean here?

@@ -294,7 +299,7 @@ def agg_list_like(self) -> DataFrame | Series:
"""
from pandas.core.reshape.concat import concat

obj = self.obj
obj = self._maybe_filter_numeric_only()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will cause different behaviors for list and non-list arguments that contain a callable. E.g.

df.agg(lambda x: x.sum(), numeric_only=True))
df.agg([lambda x: x.sum()], numeric_only=True)

Currently in this PR, I believe the first line will pass numeric_only to the supplied UDF (and this fails as the UDF doesn't take a numeric_only argument). The second line will subset the DataFrame to only numeric columns and succeed. These should have the same behavior with regards to how the numeric_only argument is treated.

I suggest only using the DataFrame filtered to numeric_only when a given argument in the list is a string; otherwise to use the full DataFrame and pass the numeric_only kwarg to the supplied UDF.

@@ -858,6 +858,22 @@ def test_with_dictlike_columns_with_infer():
tm.assert_frame_equal(result, expected)


def test_with_dictlike_functions():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're no longer changing dictlike behavior, I think this can be removed. It should fail, but that is another (unrelated) issue.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 1, 2023

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jan 1, 2023
@rhshadrach
Copy link
Member

@thlautenschlaeger - friendly ping. Are you interested in continuing?

@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

@mroeschke mroeschke closed this Jan 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: inconsistent DataFrame.agg behavoir when passing as kwargs numeric_only=True
4 participants