-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: add numeric_only support to groupby agg #58132
API: add numeric_only support to groupby agg #58132
Conversation
e11a6d2
to
181f593
Compare
Currently still not operating on the right numbers on test case Need to add test cases with expected data. |
Failing may be an existing bug - documented here: #58146 Also mangled up the ResamplerApply - need to rethink strategy |
181f593
to
6f39c4f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! It looks like you are using **kwargs
to pass numeric_only
in the call to agg
. These kwargs
are meant for the user to pass through to the function and I do not think we should be allowing "built-in" arguments into them. Instead, numeric_only
should be added to the signature of agg
.
In addition, it appears that there are various paths through because agg
that do not respect numeric_only
, e.g. df.groupby(...).agg("sum", numeric_only=True)
.**kwargs
is passed through, this was wrong!
@@ -219,6 +219,25 @@ def _obj_with_exclusions(self): | |||
else: | |||
return self.obj | |||
|
|||
@final | |||
@cache_readonly | |||
def _obj_numeric_only_with_exclusions(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the performance of this change needs to be closely studied to see if it is really worth the complexity (and extra memory usage). In any case, I do not think it should be introduced here, but rather in its own PR. Can you follow the existing pattern of subsetting the data directly in the agg
methods instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
related to comment I put below:
obj = self._obj_with_exclusions | ||
|
||
mgr = self._get_data_to_aggregate(numeric_only=numeric_only) | ||
data = self._wrap_agged_manager(mgr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any advantage (or disadvantage - for that matter) of calling this obj vs data. Because of this, I think we should leave it as obj (to avoid the code churn).
If you see a reason to prefer data over obj, let me know!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, the issue I was having on this point was because _obj_with_exclusions
doesn't update when the dataframe is updated to remove non-numeric columns. I assume this is related to it being a @cache_readonly
in pandas.core.base.py
. My tentative workaround was to generate the data again.
On my local branch, I simplified this by creating another object:
@cache_readonly
def _obj_numeric_only_with_exclusions(self):
and in those segments:
if numeric_only:
obj = self._obj_numeric_only_with_exclusions(self):
else:
obj = self._obj_with_exclusions
Do you have any thoughts on this approach?
" string_mean ", | ||
" numpy_mean ", | ||
" list_of_str_and_str ", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the spaces?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think it was ease for reading for myself when it was showing up on my tests, so it was easier to read
xxx - yyyy - zzzz
Will make sure to cleanup before taking off draft marker though.
} | ||
) | ||
if numeric_only or isinstance(aggfunc, dict): | ||
df.groupby(by=groupers).agg(func=aggfunc, numeric_only=numeric_only) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to check that the result is as expected. See other tests above for examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, sorry, I was in the middle of this ticket when I encountered the bug around #39169 which was an issue because numeric_only
being fed as a kwarg
sends it down this bugged path, then raised #58132 hoping to address that. We've discussed over there that this silent fix isn't ideal and we will be deprecating.
I will look into submitting a fix with an explicit numeric_only
argument, as that can change the code path management.
|
||
def compute_list_like( | ||
self, | ||
op_name: Literal["agg", "apply"], | ||
selected_obj: Series | DataFrame, | ||
kwargs: dict[str, Any], | ||
**kwargs: dict[str, Any], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you revert this change. If you make it **kwargs
, then users will run into an error if they try to pass through e.g. selected_obj
as a keyword argument.
Thank you for this note, I had assumed that we did not want to explicitly take If taking Note that this was a half-cooked draft PR because I ran into #39169 bug, and had submitted PR #58132, but you mentioned we will be deprecating before doing a more complete fix. |
That was if we were to fix it in a patch version (1.5.2). For 3.0, I think we should prefer a new argument.
It would also be more straightforward for users!
Yes - I think I failed to notice it was in draft status when I first reviewed. My bad. |
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
Closing as much of this will be addressed with a different PR, a fundamental misunderstanding of whether Thank you to @rhshadrach for your time and your feedback on this PR. Will keep the above in mind for any future contributions I make. |
Currently in early phase seeking prelim feedback.
numeric_only
in groupby ops #56946, relevant to API: GroupBy.agg() numeric_only deprecation with custom function #50538doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.