Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: inconsistent return format of groupby apply #13056

Closed
5 of 6 tasks
edfall opened this issue May 2, 2016 · 16 comments · Fixed by #31613 or #34998
Closed
5 of 6 tasks

API: inconsistent return format of groupby apply #13056

edfall opened this issue May 2, 2016 · 16 comments · Fixed by #31613 or #34998
Labels
Apply Apply, Aggregate, Transform, Map Bug Groupby Master Tracker High level tracker for similar issues
Milestone

Comments

@edfall
Copy link

edfall commented May 2, 2016

related issues:

Thank you very much for provide us pandas which is really a good packages I have used!

There were several times when I use pandas but get inconsistent return format which finally leads to a break!

  • for example:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(10,1), columns = ['x'])
df['y'] = 1
df.groupby('y').apply(lambda x: x.x)

#x        0         1         2         3         4         5         6  \
#y                                                                        
#1 -1.12114  0.679616 -1.392863  0.032637 -0.051134  0.594201 -0.238833   

#x        7        8         9  
#y                              
#1  0.95173  1.07469 -0.062198  

df2 = df.copy()
df2['y'] = np.random.randint(1,3,10)
# do something similar like above
df2.groupby('y').apply(lambda x: x.x)
#y   
#1  3    0.032637
#   8    1.074690
#2  0   -1.121140
#   1    0.679616
#   2   -1.392863
#   4   -0.051134
#   5    0.594201
#   6   -0.238833
#   7    0.951730
#   9   -0.062198
#Name: x, dtype: float64

even though, I knew the way to avoid this issue, I still believe it is better to return the same format. It's really annoying when some inconsistent return format happened which force me to look backwards to use some awkward methods to avoid it.

So, if possible, I do suggest the above function will only return one kind of format.

@edfall edfall changed the title inconsistent return format of Dataframe group apply function!! BUG: inconsistent return format of Dataframe group apply function!! May 2, 2016
@edfall edfall changed the title BUG: inconsistent return format of Dataframe group apply function!! BUG: inconsistent return format of Dataframe group apply function May 2, 2016
@jreback
Copy link
Contributor

jreback commented May 2, 2016

You are trying to do too much inside an apply which is completely non-performant. .apply has to guess your return output. I suppose this is slightly confusing, in any event you should simply be doing this.

In [14]: df.groupby('y').x.apply(lambda x: x)
Out[14]: 
0    2.481148
1   -1.324170
2    0.783518
3    0.869827
4   -0.080157
5    0.071685
6    0.987246
7    0.099149
8   -0.159449
9   -0.383200
Name: x, dtype: float64

In [15]: df2.groupby('y').x.apply(lambda x: x)
Out[15]: 
0    2.481148
1   -1.324170
2    0.783518
3    0.869827
4   -0.080157
5    0.071685
6    0.987246
7    0.099149
8   -0.159449
9   -0.383200
Name: x, dtype: float64

I'll mark it as an issue if you'd like to investigate whether this could be made consistent w/o breaking anything else (might not be possible).

@jreback jreback added Groupby Reshaping Concat, Merge/Join, Stack/Unstack, Explode Difficulty Intermediate labels May 2, 2016
@jreback
Copy link
Contributor

jreback commented May 2, 2016

this was changed by this: #3239

sort of a special case was added I think.

@edfall
Copy link
Author

edfall commented May 2, 2016

yep, I think the way you(@jreback) use do solve problem of inconsistent return format, However, sometimes it's important to keep the groupby variables as a label, which your method failed to.

what's more, the function in .apply formula can be more complicated as the aim and data changes.

@jreback
Copy link
Contributor

jreback commented May 2, 2016

@edfall and so how would you reconcile this issue?

I don't think you can infer what format the user wants and meet all users expectations.

@edfall
Copy link
Author

edfall commented May 2, 2016

@jreback

of cause, it's really difficult to meet all users' satisfaction, but return format of any specific application should always keep the same. and I think the result should at least be something like:

 df.groupby('y').apply(lambda x: x.x)
#y
#1 0    2.481148
#  1   -1.324170
#  2    0.783518
#  3    0.869827
#  4   -0.080157
#  5    0.071685
#  6    0.987246
#  7    0.099149
#  8   -0.159449
#  9   -0.383200

Maybe I have use R package data.table, which can do similar things as '.groupby().apply' do, but all results of its' application is predictable.

@jreback jreback added this to the 0.18.2 milestone May 9, 2016
@jreback jreback added the Master Tracker High level tracker for similar issues label May 9, 2016
@jreback jreback modified the milestones: 0.19.0, 0.18.2 May 9, 2016
@jreback jreback modified the milestones: 0.18.2, 0.19.0 May 23, 2016
@jorisvandenbossche jorisvandenbossche modified the milestones: 0.20.0, 0.19.0 Sep 1, 2016
@jreback jreback modified the milestones: 0.20.0, Next Major Release Mar 23, 2017
@jreback jreback modified the milestones: Next Major Release, High Level Issue Tracking Sep 24, 2017
@jorisvandenbossche jorisvandenbossche added the Master Tracker High level tracker for similar issues label Jul 7, 2018
@jorisvandenbossche
Copy link
Member

Also related to this: #14927 (about detecting mutating functions)

@jorisvandenbossche
Copy link
Member

Here is a short notebook with an overview of the behaviours: http://nbviewer.jupyter.org/gist/jorisvandenbossche/16f511fe111a8b9fa0eac74e920c5251

Some things to note:

  • We have the different code path depending on whether the original group (mutated or not) is returned (related to DataFrame.groupby.apply returns different results with copy lambda functions. #14927 ).
    • If the original object is returned, a "transform" behaviour is used (keep original index order, don't add groupby key to the index)
    • I would personally deprecate this, and point users to groupby().transform(..)
  • Then for the "normal" apply behaviour, we have:
    • In general, specification is:
      • The func gets the full group (including the key column)
      • The results of func(group) are stacked, the groupby key is added to the Index (creating a MultiIndex)
    • Some specific things to note how things are stacked:
      • If it is a Series, we make a distinction between:
        • resulting Series that for each group has the same length as the index: stack vertically
        • resulting Series has fixed size (or different to group size): stack horizontally
      • scalars result in a final Series

I think in gerenal the second bullet point above makes sense. Only the Series differentiation is a bit strange. Do we want to keep this?

Additional note:

  • Why do have both as_index=True/False and group_keys=True/False ?

cc @WillAyd

(the above is only for functions

@fred-xue
Copy link

fred-xue commented Nov 9, 2018

I ran into the same issue. Need a consistent behavior for this.

@jreback jreback modified the milestones: 0.24.0, Contributions Welcome Dec 2, 2018
@ratulbhadury
Copy link

Also hitting a problem because of the inconsistent shape of what apply returns. +1 for fix please!

@liushapku
Copy link

get the same problem. It is very annoying that every time I call group(...).apply(), I need to check how many groups are there in order to further process the result. Please fix.

@TomAugspurger
Copy link
Contributor

@liushapku are you interested in working on it?

@TomAugspurger
Copy link
Contributor

This is being reverted in #35306 and will be re-fixed in #34998 in pandas 1.2.

@TomAugspurger TomAugspurger reopened this Jul 16, 2020
@jreback jreback modified the milestones: Contributions Welcome, 1.3 Nov 29, 2020
@simonjayhawkins simonjayhawkins removed this from the 1.3 milestone Jun 11, 2021
@jreback jreback added this to the 1.5 milestone Mar 30, 2022
@MoritzLaurer
Copy link

MoritzLaurer commented Jan 25, 2023

I'm still encountering this issue of inconsistent return formats as mentioned in #5839 with pandas 1.4.4 and 1.5

@Polaris000
Copy link

Same here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Bug Groupby Master Tracker High level tracker for similar issues
Projects
None yet