BUG: groupby.agg should always agg #57706

rhshadrach · 2024-03-02T18:33:27Z

closes REF: Paths through DataFrameGroupBy.agg #52362
closes BUG: "unique" and pd.Series.unique produce different results with aggregation #39920
closes ValueError: Must produce aggregated value, if DataFrame contains a DataFrame as an element #39436
closes BUG: df.agg call passes different things to a custom function depending on whether a unused kwarg is supplied or not #39169
closes Groupby Agg not working if different partials with same name on the same column #28570
closes Groupby with multiple aggregations doesn't pass through positional or keyword arguments. #26611

Built on top of #57671; the diff should get better once that's merged. Still plan on splitting part of this up as a precursor (and perhaps multiple).

For the closed issues above, tests here still likely need to be added.

The goal here is to make groupby.agg more consistently handle UDFs. Currently:

We sometimes raise if a UDF returns a NumPy ndarray
We sometimes treat non-scalars as transforms
We sometimes fail (non-purposefully) on non-scalars
We sometimes pass the entire group to the UDF rather than column-by-column

My opinion is that we should treat all UDFs as reducers, regardless of what they return. Some alternatives:

If we detect something as being a non-scalar, try treating it as a transform
Raise on anything detected as being a non-scalar

For 1, we will sometimes guess wrong, and transforming isn't something we should be doing in a method called agg anyways. For 2, we are restricting what I think are valid use cases for aggregation, e.g. gb.agg(np.array) or gb.agg(list).

In implementing this, I ran into two issues:

_aggregate_frame fails if non-scalars are returned by the UDF, and also passes all of the selected columns as a DataFrame to the UDF. This is called when there is a single grouping and args or kwargs are provided, or when there is a single grouping and passing the UDF each column individually fails with a ValueError("No objects to concatenate"). This does not seem possible to fix, would be hard to deprecate (could add a new argument or use a future option?), and is bad enough behavior that it seems to me we should just rip the band aid here for 3.0.
Resampler.apply is an alias for Resample.agg, and we do not want to impact Resampler.apply with these changes. For this, I kept the old paths through groupby specifically for resampler and plan to properly deprecate the current method and implement apply (by calling groupby's apply) as part of 3.x development. (Ref: BUG: resample apply is actually aggregate #38463)

rhshadrach · 2024-03-23T17:30:57Z

I think this is ready for review. Assuming the direction this is moving us is good, still need to decide if we are okay with this being a breaking change in 3.0 (my preference) or if it should be deprecated. If we do go the deprecation route, it will be noisy (many cases where results will be the same but we can't tell so need to warn). The only way I see a deprecation working is if we add an option, e.g. future_groupby_agg so that users can opt in to the new implementation.

cc @jorisvandenbossche @MarcoGorelli @Dr-Irv @mroeschke for any thoughts.

mroeschke · 2024-03-25T17:11:24Z

I could generally be OK making this a "breaking bug change" for 3.0. Just 2 points:

Is DataFrame.agg already strict like this?
For Raise on anything detected as being a non-scalar, I would be open to still allowing .agg to store a non-scalar, i.e. nested value, as a result element and return dtype=object. But these values should never expand the dimensions of the result when using agg

Dr-Irv · 2024-03-26T13:57:06Z

I'm not going to review the whole code change - beyond what I understand about how this all works, but I think the example I wrote here should be in the tests:
#33242 (comment)

rhshadrach · 2024-03-27T02:31:00Z

@mroeschke

Is DataFrame.agg already strict like this?

Great question - unfortunately the answer is no. We use DataFrame.apply under the hood. Perhaps if we are going to do this with groupby, we should also do it across the board. That would make me lean more toward introducing something like a future.agg option.

2. For Raise on anything detected as being a non-scalar, I would be open to still allowing .agg to store a non-scalar, i.e. nested value, as a result element and return dtype=object. But these values should never expand the dimensions of the result when using agg

Agreed. I believe that's the case here for a UDF, but not strings (e.g. cumsum). This is #44845 - but I'd like to focus on UDFs here and work on string arguments separately.

@Dr-Irv

I'm not going to review the whole code change - beyond what I understand about how this all works, but I think the example I wrote here should be in the tests:

Certainly - I added this as test_unused_kwargs. In that test, we use np.sum(np.sum(data)). This way works both on column and frame inputs, and would raise if we pass a frame instead of column-by-column. Is this sufficient?

Dr-Irv · 2024-03-27T13:49:48Z

Certainly - I added this as test_unused_kwargs. In that test, we use np.sum(np.sum(data)). This way works both on column and frame inputs, and would raise if we pass a frame instead of column-by-column. Is this sufficient?

I'm not sure. In the example I created, you had 2 functions, one with 1 argument, the other with 2 arguments, and what was being passed to those 2 functions was different because of the number of arguments. I don't see how the test you created confirms that Series are passed independent of the function declaration.

So maybe you should have a test like this that addresses the particular issue in #33242 :

def twoargs(x, y):
    assert isinstance(x, pd.Series)
    return x.sum()
    
def test_two_args(self):
    df = pd.DataFrame({'a': [1,2,3,4,5,6],
 'b': [1,1,0,1,1,0],
 'c': ['x','x','x','z','z','z'],
 'd': ['s','s','s','d','d','d']})
    df.groupby('c')[['a', 'b']].agg(twoargs, 0)

rhshadrach · 2024-03-27T20:58:30Z

I don't see how the test you created confirms that Series are passed independent of the function declaration.

We don't ever inspect the UDF to see what arguments it can take - our logic branches on whether additional arguments are passed in the call to agg: .agg(func, 0) vs .agg(func) previously resulted in two different paths, one which passed the DataFrame and one which passed each Seriers.

Still, no opposition to an additional test here. Will add.

rhshadrach added 2 commits March 2, 2024 12:54

BUG: groupby.agg should always agg

3505c7a

cleanup

12ef132

rhshadrach added Enhancement Groupby API Design Needs Discussion Requires discussion from core team before further action Apply Apply, Aggregate, Transform, Map labels Mar 2, 2024

rhshadrach added this to the 3.0 milestone Mar 2, 2024

rhshadrach and others added 8 commits March 3, 2024 07:23

Test fixup

87d32ae

Merge branch 'main' into gb_agg_must_agg

711fcb6

WIP

0849e1c

Fixup for pivot

08d26c6

Merge remote-tracking branch 'upstream/main' into gb_agg_must_agg

d9e34e7

Deprecate passing reduction kernels to groupby.agg

d264e20

Merge origin

6447e9f

Revert & more tests

2eeb95f

pre-commit

b6fed21

Dr-Irv marked this pull request as ready for review March 27, 2024 13:33

rhshadrach mentioned this pull request Apr 5, 2024

BUG: incorrect aggregation of dataframe when using UDF with kwarg #58146

Closed

3 tasks

WillAyd mentioned this pull request Apr 17, 2024

BUG: GroupBy.agg allows non-aggregating built-in functions #58284

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: groupby.agg should always agg #57706

BUG: groupby.agg should always agg #57706

rhshadrach commented Mar 2, 2024 •

edited

Loading

rhshadrach commented Mar 23, 2024

mroeschke commented Mar 25, 2024

Dr-Irv commented Mar 26, 2024

rhshadrach commented Mar 27, 2024 •

edited

Loading

Dr-Irv commented Mar 27, 2024

rhshadrach commented Mar 27, 2024

BUG: groupby.agg should always agg #57706

Are you sure you want to change the base?

BUG: groupby.agg should always agg #57706

Conversation

rhshadrach commented Mar 2, 2024 • edited Loading

rhshadrach commented Mar 23, 2024

mroeschke commented Mar 25, 2024

Dr-Irv commented Mar 26, 2024

rhshadrach commented Mar 27, 2024 • edited Loading

Dr-Irv commented Mar 27, 2024

rhshadrach commented Mar 27, 2024

rhshadrach commented Mar 2, 2024 •

edited

Loading

rhshadrach commented Mar 27, 2024 •

edited

Loading