Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: groupby.agg should always agg #57706

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

rhshadrach
Copy link
Member

@rhshadrach rhshadrach commented Mar 2, 2024

Built on top of #57671; the diff should get better once that's merged. Still plan on splitting part of this up as a precursor (and perhaps multiple).

For the closed issues above, tests here still likely need to be added.

The goal here is to make groupby.agg more consistently handle UDFs. Currently:

  • We sometimes raise if a UDF returns a NumPy ndarray
  • We sometimes treat non-scalars as transforms
  • We sometimes fail (non-purposefully) on non-scalars
  • We sometimes pass the entire group to the UDF rather than column-by-column

My opinion is that we should treat all UDFs as reducers, regardless of what they return. Some alternatives:

  1. If we detect something as being a non-scalar, try treating it as a transform
  2. Raise on anything detected as being a non-scalar

For 1, we will sometimes guess wrong, and transforming isn't something we should be doing in a method called agg anyways. For 2, we are restricting what I think are valid use cases for aggregation, e.g. gb.agg(np.array) or gb.agg(list).

In implementing this, I ran into two issues:

  • _aggregate_frame fails if non-scalars are returned by the UDF, and also passes all of the selected columns as a DataFrame to the UDF. This is called when there is a single grouping and args or kwargs are provided, or when there is a single grouping and passing the UDF each column individually fails with a ValueError("No objects to concatenate"). This does not seem possible to fix, would be hard to deprecate (could add a new argument or use a future option?), and is bad enough behavior that it seems to me we should just rip the band aid here for 3.0.
  • Resampler.apply is an alias for Resample.agg, and we do not want to impact Resampler.apply with these changes. For this, I kept the old paths through groupby specifically for resampler and plan to properly deprecate the current method and implement apply (by calling groupby's apply) as part of 3.x development. (Ref: BUG: resample apply is actually aggregate #38463)

@rhshadrach rhshadrach added Enhancement Groupby API Design Needs Discussion Requires discussion from core team before further action Apply Apply, Aggregate, Transform, Map labels Mar 2, 2024
@rhshadrach rhshadrach added this to the 3.0 milestone Mar 2, 2024
@rhshadrach
Copy link
Member Author

I think this is ready for review. Assuming the direction this is moving us is good, still need to decide if we are okay with this being a breaking change in 3.0 (my preference) or if it should be deprecated. If we do go the deprecation route, it will be noisy (many cases where results will be the same but we can't tell so need to warn). The only way I see a deprecation working is if we add an option, e.g. future_groupby_agg so that users can opt in to the new implementation.

cc @jorisvandenbossche @MarcoGorelli @Dr-Irv @mroeschke for any thoughts.

@mroeschke
Copy link
Member

I could generally be OK making this a "breaking bug change" for 3.0. Just 2 points:

  1. Is DataFrame.agg already strict like this?
  2. For Raise on anything detected as being a non-scalar, I would be open to still allowing .agg to store a non-scalar, i.e. nested value, as a result element and return dtype=object. But these values should never expand the dimensions of the result when using agg

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Mar 26, 2024

I'm not going to review the whole code change - beyond what I understand about how this all works, but I think the example I wrote here should be in the tests:
#33242 (comment)

@rhshadrach
Copy link
Member Author

rhshadrach commented Mar 27, 2024

@mroeschke

  1. Is DataFrame.agg already strict like this?

Great question - unfortunately the answer is no. We use DataFrame.apply under the hood. Perhaps if we are going to do this with groupby, we should also do it across the board. That would make me lean more toward introducing something like a future.agg option.

2. For Raise on anything detected as being a non-scalar, I would be open to still allowing .agg to store a non-scalar, i.e. nested value, as a result element and return dtype=object. But these values should never expand the dimensions of the result when using agg

Agreed. I believe that's the case here for a UDF, but not strings (e.g. cumsum). This is #44845 - but I'd like to focus on UDFs here and work on string arguments separately.

@Dr-Irv

I'm not going to review the whole code change - beyond what I understand about how this all works, but I think the example I wrote here should be in the tests:

Certainly - I added this as test_unused_kwargs. In that test, we use np.sum(np.sum(data)). This way works both on column and frame inputs, and would raise if we pass a frame instead of column-by-column. Is this sufficient?

@Dr-Irv Dr-Irv marked this pull request as ready for review March 27, 2024 13:33
@Dr-Irv
Copy link
Contributor

Dr-Irv commented Mar 27, 2024

Certainly - I added this as test_unused_kwargs. In that test, we use np.sum(np.sum(data)). This way works both on column and frame inputs, and would raise if we pass a frame instead of column-by-column. Is this sufficient?

I'm not sure. In the example I created, you had 2 functions, one with 1 argument, the other with 2 arguments, and what was being passed to those 2 functions was different because of the number of arguments. I don't see how the test you created confirms that Series are passed independent of the function declaration.

So maybe you should have a test like this that addresses the particular issue in #33242 :

def twoargs(x, y):
    assert isinstance(x, pd.Series)
    return x.sum()
    
def test_two_args(self):
    df = pd.DataFrame({'a': [1,2,3,4,5,6],
 'b': [1,1,0,1,1,0],
 'c': ['x','x','x','z','z','z'],
 'd': ['s','s','s','d','d','d']})
    df.groupby('c')[['a', 'b']].agg(twoargs, 0)

@rhshadrach
Copy link
Member Author

I don't see how the test you created confirms that Series are passed independent of the function declaration.

We don't ever inspect the UDF to see what arguments it can take - our logic branches on whether additional arguments are passed in the call to agg: .agg(func, 0) vs .agg(func) previously resulted in two different paths, one which passed the DataFrame and one which passed each Seriers.

Still, no opposition to an additional test here. Will add.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment