-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: fix aggregation when using udf with kwarg #58170
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1546,14 +1546,14 @@ def aggregate(self, func=None, *args, engine=None, engine_kwargs=None, **kwargs) | |
if self._grouper.nkeys > 1: | ||
# test_groupby_as_index_series_scalar gets here with 'not self.as_index' | ||
return self._python_agg_general(func, *args, **kwargs) | ||
elif args or kwargs: | ||
elif (args or kwargs) and (len(self._obj_with_exclusions.columns) == 1): | ||
# test_pass_args_kwargs gets here (with and without as_index) | ||
# can't return early | ||
result = self._aggregate_frame(func, *args, **kwargs) | ||
|
||
else: | ||
# try to treat as if we are passing a list | ||
gba = GroupByApply(self, [func], args=(), kwargs={}) | ||
gba = GroupByApply(self, [func], args=args, kwargs=kwargs) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The removal of args and kwargs at this point seems to have been a design decision made in 2014 in #6718 to address a bug when using UDF There have been considerable changes since and There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed that this is okay - but not just because all tests pass. It is not good behavior to silently ignore args/kwargs, and it seems to me users are reasonable to expect that these would be passed through. |
||
try: | ||
result = gba.agg() | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need this extra condition? With this modification, I am thinking we should hold off on merging this PR.
_aggregate_frame
passes the entire DataFrame rather than column-by-column that the rest of the code paths do. So this could cause breakage in user code.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Richard,
That was the intent, because the bug currently occurs when multiple columns to be aggregated are passed into
_aggregate_frame
, the aggregation operates on all columns and provides duplicate (and incorrect) values such as this (full example in #58146 - and is the output of the test case in this PR for the main branch):Instead of using that method when there are multiple columns, I thought passing it to the next code path that does column-by-column would address the issue.
That said, I don't have the skills and knowledge of the codebase the maintainers have to fully appreciate the extent of impact this change might have on the users' codebase, but I couldn't find examples where it would be desired behaviour to pass in the entire DataFrame to
_aggregate_frame
when there are multiple columns to aggregate (with the exception of passing innumpy
functions with kwargaxis=0
which manages it well).Just thought I might put this idea forward as a change that would address the issue of multiple columns getting aggregated together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see - thanks! Yes, we discussed this on the last dev meeting. I think we want to look to see if we can deprecate this behavior instead of changing it as a bug fix because it has the risk of quietly changing the value in the result without failing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, no worries! Just thought it might be an appropriate workaround
I was hoping to get a workaround in so I might be able to work on the
numeric_only
support as having a kwarg pushes the code down this pathway more frequently. Which then would allow me to address an issue in Geopandas. It sounds like it might be best to wait for the deprecation before working on the downstream items.I'l update my PR to only address the
GroupByApply
.Thanks for your time and feedback.