Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: fix aggregation when using udf with kwarg #58170

Conversation

nicholas-ys-tan
Copy link

@nicholas-ys-tan nicholas-ys-tan commented Apr 6, 2024

closes #39169

# test_pass_args_kwargs gets here (with and without as_index)
# can't return early
result = self._aggregate_frame(func, *args, **kwargs)

else:
# try to treat as if we are passing a list
gba = GroupByApply(self, [func], args=(), kwargs={})
gba = GroupByApply(self, [func], args=args, kwargs=kwargs)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removal of args and kwargs at this point seems to have been a design decision made in 2014 in #6718 to address a bug when using UDF

There have been considerable changes since and agg generally seems to pass on kwargs well now. When updating just this line, all tests passed so I assume this change would be okay.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that this is okay - but not just because all tests pass. It is not good behavior to silently ignore args/kwargs, and it seems to me users are reasonable to expect that these would be passed through.

@nicholas-ys-tan
Copy link
Author

#58146 (comment)

Noted comment that the changes to agg will be breaking changes and I would need a workaround for #58132. This is fairly important as adding a numeric_only kwarg will more often push the code down DataFrameGroupBy._aggregate_frame.

I was hoping to get your view on this PR as a temporary fix with minimalist code, it changes the path the code will take to use GroupByApply.agg as a list if there is more than one column to operate on - hence the aggregation done as loop of SeriesGroupBy if I understood the code correctly.

@nicholas-ys-tan nicholas-ys-tan marked this pull request as ready for review April 7, 2024 09:19
@mroeschke mroeschke added Groupby Apply Apply, Aggregate, Transform, Map labels Apr 9, 2024
@mroeschke
Copy link
Member

Could you add a whatsnew note to v3.0.0.rst (under Bug FIx in the groupby section)?

# test_pass_args_kwargs gets here (with and without as_index)
# can't return early
result = self._aggregate_frame(func, *args, **kwargs)

else:
# try to treat as if we are passing a list
gba = GroupByApply(self, [func], args=(), kwargs={})
gba = GroupByApply(self, [func], args=args, kwargs=kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that this is okay - but not just because all tests pass. It is not good behavior to silently ignore args/kwargs, and it seems to me users are reasonable to expect that these would be passed through.

@@ -1546,14 +1546,14 @@ def aggregate(self, func=None, *args, engine=None, engine_kwargs=None, **kwargs)
if self._grouper.nkeys > 1:
# test_groupby_as_index_series_scalar gets here with 'not self.as_index'
return self._python_agg_general(func, *args, **kwargs)
elif args or kwargs:
elif (args or kwargs) and (len(self._obj_with_exclusions.columns) == 1):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need this extra condition? With this modification, I am thinking we should hold off on merging this PR. _aggregate_frame passes the entire DataFrame rather than column-by-column that the rest of the code paths do. So this could cause breakage in user code.

Copy link
Author

@nicholas-ys-tan nicholas-ys-tan Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Richard,

That was the intent, because the bug currently occurs when multiple columns to be aggregated are passed into _aggregate_frame, the aggregation operates on all columns and provides duplicate (and incorrect) values such as this (full example in #58146 - and is the output of the test case in this PR for the main branch):

A	B
groupby1		
diamond	13.25	13.25
spade	27.00	27.00

Instead of using that method when there are multiple columns, I thought passing it to the next code path that does column-by-column would address the issue.

That said, I don't have the skills and knowledge of the codebase the maintainers have to fully appreciate the extent of impact this change might have on the users' codebase, but I couldn't find examples where it would be desired behaviour to pass in the entire DataFrame to _aggregate_frame when there are multiple columns to aggregate (with the exception of passing in numpy functions with kwarg axis=0 which manages it well).

Just thought I might put this idea forward as a change that would address the issue of multiple columns getting aggregated together.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see - thanks! Yes, we discussed this on the last dev meeting. I think we want to look to see if we can deprecate this behavior instead of changing it as a bug fix because it has the risk of quietly changing the value in the result without failing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, no worries! Just thought it might be an appropriate workaround

I was hoping to get a workaround in so I might be able to work on the numeric_only support as having a kwarg pushes the code down this pathway more frequently. Which then would allow me to address an issue in Geopandas. It sounds like it might be best to wait for the deprecation before working on the downstream items.

I'l update my PR to only address the GroupByApply.

Thanks for your time and feedback.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The passing through args/kwargs here to GroupByApply looks great, but think the other change (the extra condition commented on below) should be deprecated instead.

@@ -1546,14 +1546,14 @@ def aggregate(self, func=None, *args, engine=None, engine_kwargs=None, **kwargs)
if self._grouper.nkeys > 1:
# test_groupby_as_index_series_scalar gets here with 'not self.as_index'
return self._python_agg_general(func, *args, **kwargs)
elif args or kwargs:
elif (args or kwargs) and (len(self._obj_with_exclusions.columns) == 1):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see - thanks! Yes, we discussed this on the last dev meeting. I think we want to look to see if we can deprecate this behavior instead of changing it as a bug fix because it has the risk of quietly changing the value in the result without failing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Groupby
Projects
None yet
3 participants