Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: fix aggregation when using udf with kwarg #58170

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions pandas/core/groupby/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -1546,14 +1546,14 @@ def aggregate(self, func=None, *args, engine=None, engine_kwargs=None, **kwargs)
if self._grouper.nkeys > 1:
# test_groupby_as_index_series_scalar gets here with 'not self.as_index'
return self._python_agg_general(func, *args, **kwargs)
elif args or kwargs:
elif (args or kwargs) and (len(self._obj_with_exclusions.columns) == 1):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need this extra condition? With this modification, I am thinking we should hold off on merging this PR. _aggregate_frame passes the entire DataFrame rather than column-by-column that the rest of the code paths do. So this could cause breakage in user code.

Copy link
Author

@nicholas-ys-tan nicholas-ys-tan Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Richard,

That was the intent, because the bug currently occurs when multiple columns to be aggregated are passed into _aggregate_frame, the aggregation operates on all columns and provides duplicate (and incorrect) values such as this (full example in #58146 - and is the output of the test case in this PR for the main branch):

A	B
groupby1		
diamond	13.25	13.25
spade	27.00	27.00

Instead of using that method when there are multiple columns, I thought passing it to the next code path that does column-by-column would address the issue.

That said, I don't have the skills and knowledge of the codebase the maintainers have to fully appreciate the extent of impact this change might have on the users' codebase, but I couldn't find examples where it would be desired behaviour to pass in the entire DataFrame to _aggregate_frame when there are multiple columns to aggregate (with the exception of passing in numpy functions with kwarg axis=0 which manages it well).

Just thought I might put this idea forward as a change that would address the issue of multiple columns getting aggregated together.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see - thanks! Yes, we discussed this on the last dev meeting. I think we want to look to see if we can deprecate this behavior instead of changing it as a bug fix because it has the risk of quietly changing the value in the result without failing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, no worries! Just thought it might be an appropriate workaround

I was hoping to get a workaround in so I might be able to work on the numeric_only support as having a kwarg pushes the code down this pathway more frequently. Which then would allow me to address an issue in Geopandas. It sounds like it might be best to wait for the deprecation before working on the downstream items.

I'l update my PR to only address the GroupByApply.

Thanks for your time and feedback.

# test_pass_args_kwargs gets here (with and without as_index)
# can't return early
result = self._aggregate_frame(func, *args, **kwargs)

else:
# try to treat as if we are passing a list
gba = GroupByApply(self, [func], args=(), kwargs={})
gba = GroupByApply(self, [func], args=args, kwargs=kwargs)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removal of args and kwargs at this point seems to have been a design decision made in 2014 in #6718 to address a bug when using UDF

There have been considerable changes since and agg generally seems to pass on kwargs well now. When updating just this line, all tests passed so I assume this change would be okay.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that this is okay - but not just because all tests pass. It is not good behavior to silently ignore args/kwargs, and it seems to me users are reasonable to expect that these would be passed through.

try:
result = gba.agg()

Expand Down
21 changes: 21 additions & 0 deletions pandas/tests/groupby/aggregate/test_aggregate.py
Original file line number Diff line number Diff line change
Expand Up @@ -1663,3 +1663,24 @@ def func(x):
msg = "length must not be 0"
with pytest.raises(ValueError, match=msg):
df.groupby("A", observed=False).agg(func)


def test_aggregation_of_UDF_with_kwargs():
df = DataFrame(
{
"A": [1, 2, 3, 4, 5],
"B": [10, 20, 30, 40, 50],
"groupby1": ["diamond", "diamond", "spade", "spade", "spade"],
}
)

def user_defined_aggfunc(df, addition=0):
return np.mean(df) + addition

out_df = df.groupby(by="groupby1").agg(func=user_defined_aggfunc, addition=5)
expected_df = DataFrame(
{"A": [6.5, 9.0], "B": [20.0, 45.0]},
index=Index(["diamond", "spade"], name="groupby1"),
)

tm.assert_frame_equal(out_df, expected_df)