Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: groupby.apply axis=1 behavior #38042

Closed
rhshadrach opened this issue Nov 24, 2020 · 3 comments
Closed

ENH: groupby.apply axis=1 behavior #38042

rhshadrach opened this issue Nov 24, 2020 · 3 comments
Labels
API - Consistency Internal Consistency of API/Behavior Apply Apply, Aggregate, Transform, Map Enhancement Groupby

Comments

@rhshadrach
Copy link
Member

xref #9772

df = pd.DataFrame({i:pd.Series(np.random.normal(size=10),
                                index=range(10)) for i in range(11)})
g = df.groupby(['a']*6+['b']*5, axis=1)
g.apply(lambda x : x.sum())

now raises, but used to give the (perhaps) surprising output

           a         b
0  -0.381070       NaN
1  -1.214075       NaN
2  -1.496252       NaN
3   3.392565       NaN
4  -0.782376       NaN
5   1.306043       NaN
6        NaN -1.772334
7        NaN  4.125280
8        NaN  1.992329
9        NaN  4.283854
10       NaN -4.791092

A fix for this (today and previously) would be to pass axis=1 into the call to sum, but again I think that is viewed as unintuitive. In #9772 (comment) I argued:

...when pandas feeds a group of values into the UDF, they are not transposed. It seems reasonable to me to argue that they should be, but one technical hurdle here is what happens with a frame where the columns are different dtypes. Upon transposing, you now have columns of mixed dtypes, which are coerced to object type. So upon transposing the result back you lose type information. Since the UDF can return anything, there is no way to reliably determine that the resulting dtypes should be.

Of course, an argument against transposing the group when passing it to the UDF is that this would be a rather large change for what seems to me to be of little value.

A few counter points that I've realized in the meantime:

  • The case of multiple dtypes seems to me to be a very minor one, to the point of insignificance. Is there an example of a (somewhat natural) function that is applied to multiple dtypes where the resulting dtype does not get coerced correctly? I've played around with this in the code, the only such example in the tests is the identify function.
  • This is notably not how groupby(..., axis=1).transform works, nor is it how apply/transform/agg with no groupby and axis=1 work. These methods all feed in the row as a Series so that supplying axis=1 results in an error.

I'm now of the opinion that transposing the inputs and results is more maintainable and easier to grok for users.

@rhshadrach rhshadrach added Enhancement Groupby Apply Apply, Aggregate, Transform, Map API - Consistency Internal Consistency of API/Behavior labels Nov 24, 2020
@jbrockmendel
Copy link
Member

Is there an example of a (somewhat natural) function that is applied to multiple dtypes where the resulting dtype does not get coerced correctly?

I'm more familiar with DataFrame reductions than with GroupBy reductions but I imagine they behave similarly. Can you expand on what you mean by "does not get coerced correctly"? Could "mean doesn't get coerced by numpy so we have to do some special massaging"?

@rhshadrach
Copy link
Member Author

Yes - that's it. Compare:

df = pd.DataFrame({'a': [1, 1, 2, 3], 'b': [4.0, 5.0, 6.0, 7.0], 'c': [7, 8, 9, 10]})
print(df.apply(lambda x: x**2, axis=1))
print(df.transform(lambda x: x**2, axis=1))
print(df.groupby([0, 0, 1], axis=1).apply(lambda x: x**2))
print(df.groupby([0, 0, 1], axis=1).transform(lambda x: x**2))

which give the results:

# apply
     a     b      c
0  1.0  16.0   49.0
1  1.0  25.0   64.0
2  4.0  36.0   81.0
3  9.0  49.0  100.0

# transform
     a     b      c
0  1.0  16.0   49.0
1  1.0  25.0   64.0
2  4.0  36.0   81.0
3  9.0  49.0  100.0

# groupby-apply
   a     b    c
0  1  16.0   49
1  1  25.0   64
2  4  36.0   81
3  9  49.0  100

# groupby-transform
     a     b    c
0  1.0  16.0   49
1  1.0  25.0   64
2  4.0  36.0   81
3  9.0  49.0  100

All except groupby-apply are doing transpose -> op -> transpose which results in the float data type where it is not technically necessary. But of course, this is a silly example and shouldn't even be using apply/transform, let alone axis=1. What I don't know of is if there are any non-silly examples where the data type gets coerced but shouldn't be.

@rhshadrach
Copy link
Member Author

axis=1 has been deprecated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior Apply Apply, Aggregate, Transform, Map Enhancement Groupby
Projects
None yet
Development

No branches or pull requests

2 participants