BUG: fix aggregation when using udf with kwarg #58170

nicholas-ys-tan · 2024-04-06T03:56:11Z

closes BUG: incorrect aggregation of dataframe when using UDF with kwarg #58146
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
[ Not applicable ] Added type annotations to new arguments/methods/functions.
[ Pending review ] Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

nicholas-ys-tan · 2024-04-06T04:07:35Z

pandas/core/groupby/generic.py

                # test_pass_args_kwargs gets here (with and without as_index)
                # can't return early
                result = self._aggregate_frame(func, *args, **kwargs)

            else:
                # try to treat as if we are passing a list
-                gba = GroupByApply(self, [func], args=(), kwargs={})
+                gba = GroupByApply(self, [func], args=args, kwargs=kwargs)


The removal of args and kwargs at this point seems to have been a design decision made in 2014 in #6718 to address a bug when using UDF

There have been considerable changes since and agg generally seems to pass on kwargs well now. When updating just this line, all tests passed so I assume this change would be okay.

Agreed that this is okay - but not just because all tests pass. It is not good behavior to silently ignore args/kwargs, and it seems to me users are reasonable to expect that these would be passed through.

nicholas-ys-tan · 2024-04-06T04:13:02Z

#58146 (comment)

Noted comment that the changes to agg will be breaking changes and I would need a workaround for #58132. This is fairly important as adding a numeric_only kwarg will more often push the code down DataFrameGroupBy._aggregate_frame.

I was hoping to get your view on this PR as a temporary fix with minimalist code, it changes the path the code will take to use GroupByApply.agg as a list if there is more than one column to operate on - hence the aggregation done as loop of SeriesGroupBy if I understood the code correctly.

mroeschke · 2024-04-09T16:53:23Z

Could you add a whatsnew note to v3.0.0.rst (under Bug FIx in the groupby section)?

rhshadrach · 2024-04-10T01:41:46Z

pandas/core/groupby/generic.py

                # test_pass_args_kwargs gets here (with and without as_index)
                # can't return early
                result = self._aggregate_frame(func, *args, **kwargs)

            else:
                # try to treat as if we are passing a list
-                gba = GroupByApply(self, [func], args=(), kwargs={})
+                gba = GroupByApply(self, [func], args=args, kwargs=kwargs)


Agreed that this is okay - but not just because all tests pass. It is not good behavior to silently ignore args/kwargs, and it seems to me users are reasonable to expect that these would be passed through.

rhshadrach · 2024-04-10T01:41:54Z

pandas/core/groupby/generic.py

@@ -1546,14 +1546,14 @@ def aggregate(self, func=None, *args, engine=None, engine_kwargs=None, **kwargs)
            if self._grouper.nkeys > 1:
                # test_groupby_as_index_series_scalar gets here with 'not self.as_index'
                return self._python_agg_general(func, *args, **kwargs)
-            elif args or kwargs:
+            elif (args or kwargs) and (len(self._obj_with_exclusions.columns) == 1):


Why do you need this extra condition? With this modification, I am thinking we should hold off on merging this PR. _aggregate_frame passes the entire DataFrame rather than column-by-column that the rest of the code paths do. So this could cause breakage in user code.

Thanks Richard,

That was the intent, because the bug currently occurs when multiple columns to be aggregated are passed into _aggregate_frame, the aggregation operates on all columns and provides duplicate (and incorrect) values such as this (full example in #58146 - and is the output of the test case in this PR for the main branch):

A B groupby1 diamond 13.25 13.25 spade 27.00 27.00

Instead of using that method when there are multiple columns, I thought passing it to the next code path that does column-by-column would address the issue.

That said, I don't have the skills and knowledge of the codebase the maintainers have to fully appreciate the extent of impact this change might have on the users' codebase, but I couldn't find examples where it would be desired behaviour to pass in the entire DataFrame to _aggregate_frame when there are multiple columns to aggregate (with the exception of passing in numpy functions with kwarg axis=0 which manages it well).

Just thought I might put this idea forward as a change that would address the issue of multiple columns getting aggregated together.

I see - thanks! Yes, we discussed this on the last dev meeting. I think we want to look to see if we can deprecate this behavior instead of changing it as a bug fix because it has the risk of quietly changing the value in the result without failing.

I see, no worries! Just thought it might be an appropriate workaround

I was hoping to get a workaround in so I might be able to work on the numeric_only support as having a kwarg pushes the code down this pathway more frequently. Which then would allow me to address an issue in Geopandas. It sounds like it might be best to wait for the deprecation before working on the downstream items.

I'l update my PR to only address the GroupByApply.

Thanks for your time and feedback.

rhshadrach

The passing through args/kwargs here to GroupByApply looks great, but think the other change (the extra condition commented on below) should be deprecated instead.

rhshadrach · 2024-04-10T02:26:03Z

pandas/core/groupby/generic.py

@@ -1546,14 +1546,14 @@ def aggregate(self, func=None, *args, engine=None, engine_kwargs=None, **kwargs)
            if self._grouper.nkeys > 1:
                # test_groupby_as_index_series_scalar gets here with 'not self.as_index'
                return self._python_agg_general(func, *args, **kwargs)
-            elif args or kwargs:
+            elif (args or kwargs) and (len(self._obj_with_exclusions.columns) == 1):


I see - thanks! Yes, we discussed this on the last dev meeting. I think we want to look to see if we can deprecate this behavior instead of changing it as a bug fix because it has the risk of quietly changing the value in the result without failing.

BUG: fix aggregation when using udf with kwarg

95966c9

nicholas-ys-tan commented Apr 6, 2024

View reviewed changes

nicholas-ys-tan marked this pull request as ready for review April 7, 2024 09:19

nicholas-ys-tan requested a review from rhshadrach as a code owner April 7, 2024 09:19

mroeschke added Groupby Apply Apply, Aggregate, Transform, Map labels Apr 9, 2024

rhshadrach reviewed Apr 10, 2024

View reviewed changes

nicholas-ys-tan closed this Apr 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: fix aggregation when using udf with kwarg #58170

BUG: fix aggregation when using udf with kwarg #58170

nicholas-ys-tan commented Apr 6, 2024 •

edited by mroeschke

Loading

nicholas-ys-tan Apr 6, 2024

rhshadrach Apr 10, 2024

nicholas-ys-tan commented Apr 6, 2024

mroeschke commented Apr 9, 2024

rhshadrach Apr 10, 2024

rhshadrach Apr 10, 2024

nicholas-ys-tan Apr 10, 2024 •

edited

Loading

rhshadrach Apr 10, 2024

nicholas-ys-tan Apr 10, 2024

rhshadrach left a comment

rhshadrach Apr 10, 2024

BUG: fix aggregation when using udf with kwarg #58170

BUG: fix aggregation when using udf with kwarg #58170

Conversation

nicholas-ys-tan commented Apr 6, 2024 • edited by mroeschke Loading

nicholas-ys-tan Apr 6, 2024

Choose a reason for hiding this comment

rhshadrach Apr 10, 2024

Choose a reason for hiding this comment

nicholas-ys-tan commented Apr 6, 2024

mroeschke commented Apr 9, 2024

rhshadrach Apr 10, 2024

Choose a reason for hiding this comment

rhshadrach Apr 10, 2024

Choose a reason for hiding this comment

nicholas-ys-tan Apr 10, 2024 • edited Loading

Choose a reason for hiding this comment

rhshadrach Apr 10, 2024

Choose a reason for hiding this comment

nicholas-ys-tan Apr 10, 2024

Choose a reason for hiding this comment

rhshadrach left a comment

Choose a reason for hiding this comment

rhshadrach Apr 10, 2024

Choose a reason for hiding this comment

nicholas-ys-tan commented Apr 6, 2024 •

edited by mroeschke

Loading

nicholas-ys-tan Apr 10, 2024 •

edited

Loading