API: inconsistent return format of groupby apply #13056

edfall · 2016-05-02T14:42:09Z

related issues:

groupby(..).apply() has inconsistent behavior when there is only one group #13121
df.groupby().apply() with only one group returns wrong shape! #5839
It's difficult to predict what DataFrame.groupby().apply() will return: #9867
df.groupby(col).resample('x') returning both unstacked or multiindex result #13255
Dataframe group apply can fail on trivial cases. #20066
DataFrame.groupby.apply returns different results with copy lambda functions. #14927 (about detecting mutating functions)

Thank you very much for provide us pandas which is really a good packages I have used!

There were several times when I use pandas but get inconsistent return format which finally leads to a break!

for example:

import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(10,1), columns = ['x'])
df['y'] = 1
df.groupby('y').apply(lambda x: x.x)

#x        0         1         2         3         4         5         6  \
#y                                                                        
#1 -1.12114  0.679616 -1.392863  0.032637 -0.051134  0.594201 -0.238833   

#x        7        8         9  
#y                              
#1  0.95173  1.07469 -0.062198  

df2 = df.copy()
df2['y'] = np.random.randint(1,3,10)
# do something similar like above
df2.groupby('y').apply(lambda x: x.x)
#y   
#1  3    0.032637
#   8    1.074690
#2  0   -1.121140
#   1    0.679616
#   2   -1.392863
#   4   -0.051134
#   5    0.594201
#   6   -0.238833
#   7    0.951730
#   9   -0.062198
#Name: x, dtype: float64

even though, I knew the way to avoid this issue, I still believe it is better to return the same format. It's really annoying when some inconsistent return format happened which force me to look backwards to use some awkward methods to avoid it.

So, if possible, I do suggest the above function will only return one kind of format.

jreback · 2016-05-02T14:47:19Z

You are trying to do too much inside an apply which is completely non-performant. .apply has to guess your return output. I suppose this is slightly confusing, in any event you should simply be doing this.

In [14]: df.groupby('y').x.apply(lambda x: x)
Out[14]: 
0    2.481148
1   -1.324170
2    0.783518
3    0.869827
4   -0.080157
5    0.071685
6    0.987246
7    0.099149
8   -0.159449
9   -0.383200
Name: x, dtype: float64

In [15]: df2.groupby('y').x.apply(lambda x: x)
Out[15]: 
0    2.481148
1   -1.324170
2    0.783518
3    0.869827
4   -0.080157
5    0.071685
6    0.987246
7    0.099149
8   -0.159449
9   -0.383200
Name: x, dtype: float64

I'll mark it as an issue if you'd like to investigate whether this could be made consistent w/o breaking anything else (might not be possible).

jreback · 2016-05-02T14:51:18Z

this was changed by this: #3239

sort of a special case was added I think.

edfall · 2016-05-02T15:06:54Z

yep, I think the way you(@jreback) use do solve problem of inconsistent return format, However, sometimes it's important to keep the groupby variables as a label, which your method failed to.

what's more, the function in .apply formula can be more complicated as the aim and data changes.

jreback · 2016-05-02T15:09:20Z

@edfall and so how would you reconcile this issue?

I don't think you can infer what format the user wants and meet all users expectations.

edfall · 2016-05-02T15:21:52Z

@jreback

of cause, it's really difficult to meet all users' satisfaction, but return format of any specific application should always keep the same. and I think the result should at least be something like:

 df.groupby('y').apply(lambda x: x.x)
#y
#1 0    2.481148
#  1   -1.324170
#  2    0.783518
#  3    0.869827
#  4   -0.080157
#  5    0.071685
#  6    0.987246
#  7    0.099149
#  8   -0.159449
#  9   -0.383200

Maybe I have use R package data.table, which can do similar things as '.groupby().apply' do, but all results of its' application is predictable.

jorisvandenbossche · 2018-07-07T20:27:54Z

Also related to this: #14927 (about detecting mutating functions)

jorisvandenbossche · 2018-07-08T16:43:25Z

Here is a short notebook with an overview of the behaviours: http://nbviewer.jupyter.org/gist/jorisvandenbossche/16f511fe111a8b9fa0eac74e920c5251

Some things to note:

We have the different code path depending on whether the original group (mutated or not) is returned (related to DataFrame.groupby.apply returns different results with copy lambda functions. #14927 ).
- If the original object is returned, a "transform" behaviour is used (keep original index order, don't add groupby key to the index)
- I would personally deprecate this, and point users to groupby().transform(..)
Then for the "normal" apply behaviour, we have:
- In general, specification is:
  - The func gets the full group (including the key column)
  - The results of func(group) are stacked, the groupby key is added to the Index (creating a MultiIndex)
- Some specific things to note how things are stacked:
  - If it is a Series, we make a distinction between:
    - resulting Series that for each group has the same length as the index: stack vertically
    - resulting Series has fixed size (or different to group size): stack horizontally
  - scalars result in a final Series

I think in gerenal the second bullet point above makes sense. Only the Series differentiation is a bit strange. Do we want to keep this?

Additional note:

Why do have both as_index=True/False and group_keys=True/False ?

cc @WillAyd

(the above is only for functions

fred-xue · 2018-11-09T13:57:27Z

I ran into the same issue. Need a consistent behavior for this.

ratulbhadury · 2019-02-27T12:14:19Z

Also hitting a problem because of the inconsistent shape of what apply returns. +1 for fix please!

liushapku · 2019-05-30T05:45:18Z

get the same problem. It is very annoying that every time I call group(...).apply(), I need to check how many groups are there in order to further process the result. Please fix.

TomAugspurger · 2019-05-30T11:58:04Z

@liushapku are you interested in working on it?

TomAugspurger · 2020-07-16T13:46:46Z

This is being reverted in #35306 and will be re-fixed in #34998 in pandas 1.2.

MoritzLaurer · 2023-01-25T15:26:53Z

I'm still encountering this issue of inconsistent return formats as mentioned in #5839 with pandas 1.4.4 and 1.5

Polaris000 · 2023-01-26T13:54:24Z

Same here

edfall changed the title ~~inconsistent return format of Dataframe group apply function!!~~ BUG: inconsistent return format of Dataframe group apply function!! May 2, 2016

edfall changed the title ~~BUG: inconsistent return format of Dataframe group apply function!!~~ BUG: inconsistent return format of Dataframe group apply function May 2, 2016

jreback added Groupby Reshaping Concat, Merge/Join, Stack/Unstack, Explode Difficulty Intermediate labels May 2, 2016

jreback added the API Design label May 2, 2016

jreback added this to the 0.18.2 milestone May 9, 2016

jreback added the Master Tracker High level tracker for similar issues label May 9, 2016

This was referenced May 9, 2016

groupby(..).apply() has inconsistent behavior when there is only one group #13121

Closed

df.groupby().apply() with only one group returns wrong shape! #5839

Closed

It's difficult to predict what DataFrame.groupby().apply() will return: #9867

Closed

jreback modified the milestones: 0.19.0, 0.18.2 May 9, 2016

jreback mentioned this issue May 23, 2016

df.groupby(col).resample('x') returning both unstacked or multiindex result #13255

Closed

jreback modified the milestones: 0.18.2, 0.19.0 May 23, 2016

jorisvandenbossche modified the milestones: 0.20.0, 0.19.0 Sep 1, 2016

jreback modified the milestones: 0.20.0, Next Major Release Mar 23, 2017

jreback modified the milestones: Next Major Release, High Level Issue Tracking Sep 24, 2017

jorisvandenbossche added the Master Tracker High level tracker for similar issues label Jul 7, 2018

h-vetinari mentioned this issue Aug 30, 2018

API: groupby aggregation with apply does not drop groupby-column #22542

Closed

jreback modified the milestones: 0.24.0, Contributions Welcome Dec 2, 2018

WillAyd mentioned this issue Dec 25, 2018

Cythonized GroupBy Quantile #20405

Merged

4 tasks

jbrockmendel removed Effort Medium labels Oct 21, 2019

mroeschke added Apply Apply, Aggregate, Transform, Map Bug labels Apr 14, 2020

fjetter mentioned this issue May 27, 2020

BUG: Ensure same index is returned for slow and fast path in groupby.apply #31613

Merged

6 tasks

jreback closed this as completed in #31613 May 27, 2020

TomAugspurger mentioned this issue Jun 15, 2020

Inconsistent output in GroupBy.apply returning a DataFrame #34809

Closed

jorisvandenbossche mentioned this issue Jul 13, 2020

API: User-control of result keys in GroupBy.apply #34998

Merged

TomAugspurger reopened this Jul 16, 2020

jreback modified the milestones: Contributions Welcome, 1.3 Nov 29, 2020

mroeschke removed the API Design label Apr 24, 2021

simonjayhawkins removed this from the 1.3 milestone Jun 11, 2021

jreback added this to the 1.5 milestone Mar 30, 2022

jreback closed this as completed in #34998 Mar 30, 2022

MoritzLaurer mentioned this issue Mar 7, 2023

DataFrame.groupby.apply returns different results with copy lambda functions. #14927

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: inconsistent return format of groupby apply #13056

API: inconsistent return format of groupby apply #13056

edfall commented May 2, 2016 •

edited by rhshadrach

Loading

jreback commented May 2, 2016

jreback commented May 2, 2016

edfall commented May 2, 2016

jreback commented May 2, 2016

edfall commented May 2, 2016 •

edited

Loading

jorisvandenbossche commented Jul 7, 2018

jorisvandenbossche commented Jul 8, 2018

fred-xue commented Nov 9, 2018

ratulbhadury commented Feb 27, 2019

liushapku commented May 30, 2019

TomAugspurger commented May 30, 2019

TomAugspurger commented Jul 16, 2020

MoritzLaurer commented Jan 25, 2023 •

edited

Loading

Polaris000 commented Jan 26, 2023

API: inconsistent return format of groupby apply #13056

API: inconsistent return format of groupby apply #13056

Comments

edfall commented May 2, 2016 • edited by rhshadrach Loading

There were several times when I use pandas but get inconsistent return format which finally leads to a break!

jreback commented May 2, 2016

jreback commented May 2, 2016

edfall commented May 2, 2016

jreback commented May 2, 2016

edfall commented May 2, 2016 • edited Loading

jorisvandenbossche commented Jul 7, 2018

jorisvandenbossche commented Jul 8, 2018

fred-xue commented Nov 9, 2018

ratulbhadury commented Feb 27, 2019

liushapku commented May 30, 2019

TomAugspurger commented May 30, 2019

TomAugspurger commented Jul 16, 2020

MoritzLaurer commented Jan 25, 2023 • edited Loading

Polaris000 commented Jan 26, 2023

edfall commented May 2, 2016 •

edited by rhshadrach

Loading

edfall commented May 2, 2016 •

edited

Loading

MoritzLaurer commented Jan 25, 2023 •

edited

Loading