Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: add numeric_only support to groupby agg #58132

Conversation

nicholas-ys-tan
Copy link

@nicholas-ys-tan nicholas-ys-tan commented Apr 3, 2024

Currently in early phase seeking prelim feedback.

@nicholas-ys-tan nicholas-ys-tan force-pushed the issue56946-add_numeric_only_support_to_groupby_agg branch from e11a6d2 to 181f593 Compare April 3, 2024 12:52
@nicholas-ys-tan
Copy link
Author

Currently still not operating on the right numbers on test case numeric_only True - 1_grouper_str - numpy_mean.

Need to add test cases with expected data.

@nicholas-ys-tan
Copy link
Author

nicholas-ys-tan commented Apr 4, 2024

Currently still not operating on the right numbers on test case numeric_only True - 1_grouper_str - numpy_mean.

Need to add test cases with expected data.

Failing may be an existing bug - documented here: #58146

Also mangled up the ResamplerApply - need to rethink strategy

@nicholas-ys-tan nicholas-ys-tan force-pushed the issue56946-add_numeric_only_support_to_groupby_agg branch from 181f593 to 6f39c4f Compare April 5, 2024 13:53
Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! It looks like you are using **kwargs to pass numeric_only in the call to agg. These kwargs are meant for the user to pass through to the function and I do not think we should be allowing "built-in" arguments into them. Instead, numeric_only should be added to the signature of agg.

In addition, it appears that there are various paths through agg that do not respect numeric_only, e.g. df.groupby(...).agg("sum", numeric_only=True). because **kwargs is passed through, this was wrong!

@@ -219,6 +219,25 @@ def _obj_with_exclusions(self):
else:
return self.obj

@final
@cache_readonly
def _obj_numeric_only_with_exclusions(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the performance of this change needs to be closely studied to see if it is really worth the complexity (and extra memory usage). In any case, I do not think it should be introduced here, but rather in its own PR. Can you follow the existing pattern of subsetting the data directly in the agg methods instead.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related to comment I put below:

#58132 (comment)

obj = self._obj_with_exclusions

mgr = self._get_data_to_aggregate(numeric_only=numeric_only)
data = self._wrap_agged_manager(mgr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any advantage (or disadvantage - for that matter) of calling this obj vs data. Because of this, I think we should leave it as obj (to avoid the code churn).

If you see a reason to prefer data over obj, let me know!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, the issue I was having on this point was because _obj_with_exclusions doesn't update when the dataframe is updated to remove non-numeric columns. I assume this is related to it being a @cache_readonly in pandas.core.base.py. My tentative workaround was to generate the data again.

On my local branch, I simplified this by creating another object:

@cache_readonly
def _obj_numeric_only_with_exclusions(self):

and in those segments:

if numeric_only:
    obj = self._obj_numeric_only_with_exclusions(self):
else:
    obj = self._obj_with_exclusions

Do you have any thoughts on this approach?

Comment on lines +1682 to +1684
" string_mean ",
" numpy_mean ",
" list_of_str_and_str ",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the spaces?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think it was ease for reading for myself when it was showing up on my tests, so it was easier to read

xxx - yyyy - zzzz

Will make sure to cleanup before taking off draft marker though.

}
)
if numeric_only or isinstance(aggfunc, dict):
df.groupby(by=groupers).agg(func=aggfunc, numeric_only=numeric_only)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to check that the result is as expected. See other tests above for examples.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, sorry, I was in the middle of this ticket when I encountered the bug around #39169 which was an issue because numeric_only being fed as a kwarg sends it down this bugged path, then raised #58132 hoping to address that. We've discussed over there that this silent fix isn't ideal and we will be deprecating.

I will look into submitting a fix with an explicit numeric_only argument, as that can change the code path management.


def compute_list_like(
self,
op_name: Literal["agg", "apply"],
selected_obj: Series | DataFrame,
kwargs: dict[str, Any],
**kwargs: dict[str, Any],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you revert this change. If you make it **kwargs, then users will run into an error if they try to pass through e.g. selected_obj as a keyword argument.

@nicholas-ys-tan
Copy link
Author

nicholas-ys-tan commented Apr 17, 2024

Thanks for the PR! It looks like you are using **kwargs to pass numeric_only in the call to agg. These kwargs are meant for the user to pass through to the function and I do not think we should be allowing "built-in" arguments into them. Instead, numeric_only should be added to the signature of agg.

In addition, it appears that there are various paths through agg that do not respect numeric_only, e.g. df.groupby(...).agg("sum", numeric_only=True). because **kwargs is passed through, this was wrong!

Thank you for this note, I had assumed that we did not want to explicitly take numeric_only as a kwarg based on this comment:
#49352 (comment)

If taking numeric_only as an arguement is the preferred method, this would probably greatly simplify things. One of my big struggles was the paths changing with and without kwargs but that can be more easily managed if we aren't feeding through kwargs.

Note that this was a half-cooked draft PR because I ran into #39169 bug, and had submitted PR #58132, but you mentioned we will be deprecating before doing a more complete fix.

@rhshadrach
Copy link
Member

Thank you for this note, I had assumed that we did not want to explicitly take numeric_only as a kwarg based on this comment:
#49352 (comment)

That was if we were to fix it in a patch version (1.5.2). For 3.0, I think we should prefer a new argument.

If taking numeric_only as an arguement is the preferred method, this would probably greatly simplify things.

It would also be more straightforward for users!

Note that this was a half-cooked draft PR because I ran into #39169 bug

Yes - I think I failed to notice it was in draft status when I first reviewed. My bad.

Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label May 20, 2024
@nicholas-ys-tan
Copy link
Author

Closing as much of this will be addressed with a different PR, a fundamental misunderstanding of whether numeric_only should be its own kwarg, and the changes here being breaking changes that require more performance assessment.

Thank you to @rhshadrach for your time and your feedback on this PR. Will keep the above in mind for any future contributions I make.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants