Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistency with groupby as_index #5755

Closed
3 of 8 tasks
hayd opened this issue Dec 20, 2013 · 19 comments
Closed
3 of 8 tasks

Consistency with groupby as_index #5755

hayd opened this issue Dec 20, 2013 · 19 comments
Labels
API - Consistency Internal Consistency of API/Behavior Groupby Master Tracker High level tracker for similar issues

Comments

@hayd
Copy link
Contributor

hayd commented Dec 20, 2013

Checklist:

I've had this half-written for a while, but I wasn't sure actually explain it. I'm still not sure, but here goes.

Would be intersting to discuss what peoples views on the correctness of these was.

I think there's scope for cleaning up some possible inconsistencies (unfortunately this would mean breaking the API):

In [2]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])

In [3]: g = df.groupby('A', )

In [4]: g1 = df.groupby('A', as_index=False)

In [5]: g.head(1)   # I did a dirty hack to get the index like this!
Out[5]: 
     A  B
A        
1 0  1  2
5 2  5  6

In [6]: g1.head(1)
Out[6]: 
   A  B
0  1  2
2  5  6

In [7]: g.nth(0)
Out[7]: 
   B
A   
1  2
5  6

In [8]: g1.nth(0)  # the index doesn't match the original
Out[8]: 
   A  B
0  1  2
1  5  6

In [9]: g.filter(lambda x: x.B>0)
Out[9]: 
   A  B
0  1  2
1  1  4
2  5  6

In [10]: g1.filter(lambda x: x.B>0)
Out[10]: 
   A  B
0  1  2
1  1  4
2  5  6

In [13]: g.apply(lambda x: x.sum()) 
Out[13]: 
   A  B
A      
1  2  6
5  5  6

In [14]: g1.apply(lambda x: x.sum())  # but as_index=False
Out[14]: 
   A  B
A      
1  2  6
5  5  6

In [15]: g.apply(lambda x: x)  # no index ?
Out[15]: 
   A  B
0  1  2
1  1  4
2  5  6

In [16]: g1.apply(lambda x: x) 
Out[16]: 
   A  B
0  1  2
1  1  4
2  5  6

In [18]: g.agg(lambda x: x.sum())
Out[18]: 
   B
A   
1  6
5  6

In [19]: g1.agg(lambda x: x.sum())  # Note: A is not summed
Out[19]: 
   A  B
0  1  6
1  5  6

My opinion is that head/tail/nth should act like filter (i.e. both ignore the as_index).

@ghost
Copy link

ghost commented Dec 25, 2013

The docstring says:

        as_index : boolean, default True
            For aggregated output, return object with group labels as the
            index. Only relevant for DataFrame input. as_index=False is
            effectively "SQL-style" grouped output

So the contract on the index depends on the definition of what constitutes "aggregated output".
I'm guessing it means the ufuncs like sum. in which case nth should ignore as_index.

But that's a pretty strange design choice, I'd expect the result of a groupby to be a
single definite thing and all operations on it should behave as if they operated on that result.
In other words, if you use sum() and it behaves as if the groups had some set of indices
and when you use head() it behaves as if the groups had different indices, that's weird, and
I can't think of a reason why it makes sense.

By either definition though, nth seems baroquely broken. no idea. Since t's undocumented
I guess it's correct for some definition out there.

This looks like >=3 seperate issues to me.

@hayd
Copy link
Contributor Author

hayd commented Jan 27, 2014

Just to link #5552 on nth groupby.

Not sure what a good way to fix/break the current behaviour is (for 0.14)...

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Apr 29, 2014
@jreback
Copy link
Contributor

jreback commented Apr 29, 2014

bumping to 0.15 / this is a more complete #7002 so going to add my example at the top

@hayd
Copy link
Contributor Author

hayd commented Apr 29, 2014

a couple of PRs doing atm tick off some of this as well as previous (checked) stuff

I think I'll tweak the top description later to reflect the change (this was not changed in 0.13.1) so maybe will see some people confused here

@jreback
Copy link
Contributor

jreback commented Apr 29, 2014

gr8!

@jreback
Copy link
Contributor

jreback commented Apr 29, 2014

as an aside,

I think we need to make clear, maybe in the filter section of groupby that

head/tail/nth are REALLY filtration methods (and so return the grouper)

as compared to first/last/sum/mean/describe etc (which DON't return the grouper)

yes?

@jreback
Copy link
Contributor

jreback commented Apr 29, 2014

I added a bullet point at the end of here: #6944

@jorisvandenbossche
Copy link
Member

Maybe more appropriate here than in #7000, so repeating:

Looking at this discussion with a lot of Aaarghs (and also seeing some of the other issues with groupby), I was thinking: should we try to write down some 'design document' where we try to describe the 'rules'.

Eg the "we regard head as a filtering-like function (no aggregation), and those always ignore as_index/return the grouped values in their original column"-rule as used above.

This could maybe clarify some things for ourselves, and can be used as a reference for future PRs. (Or could it also turn out as a rabbit hole ..)

And I mean a rather detailed description for devs (and maybe simpler version in the docs, as added now to #6944)

@jreback
Copy link
Contributor

jreback commented Apr 29, 2014

just going to do that, I think a nice Note at the beginning of the aggregation/transform/filter secition should do it?

clear, reminding you that these functions return or don't return the grouped indices

@jreback
Copy link
Contributor

jreback commented Apr 29, 2014

@jreback
Copy link
Contributor

jreback commented May 1, 2014

@hayd alright I added the cumsum groupby consisncy here; will see if we get to before release
but ok for now I guess

@jseabold
Copy link
Contributor

jseabold commented Aug 6, 2015

Parking this here, because I think (?) it's the same issue.

In [20]: paste
df = pd.DataFrame()
y = np.random.randint(2, size=25)
labels = np.repeat(['a', 'b', 'c', 'd', 'e'], 5)
df['dv'] = y
df['label'] = labels
gb = df.groupby(['label','dv'], as_index=False).size()

## -- End pasted text --

In [21]: gb
Out[21]:
label  dv
a      0     3
       1     2
b      0     3
       1     2
c      1     5
d      0     3
       1     2
e      0     2
       1     3
dtype: int64

@hayd
Copy link
Contributor Author

hayd commented Aug 6, 2015

@jseabold Is that: size doesn't respect as_index ?

@jseabold
Copy link
Contributor

jseabold commented Aug 6, 2015

Yeah. It's an aggregation.

@jreback
Copy link
Contributor

jreback commented Jun 27, 2016

@jorisvandenbossche should update this, and include #13519

@jreback jreback added the Master Tracker High level tracker for similar issues label Jun 27, 2016
@jorisvandenbossche
Copy link
Member

@jreback Yes, this is one of the topics I planned to tackle

@dsaxton
Copy link
Member

dsaxton commented Aug 18, 2019

I wrote out a separate issue for size (using pandas 0.25) and then found that this has been open for a while. My expectation would be that this would create a separate column (especially given that one of the common uses of as_index=False is to facilitate subsequent joins), perhaps with the counts simply being labeled "size". Could it make sense to follow this naming pattern with some of the other aggregations as well (e.g., generic apply) to be consistent with column-naming in SQL (the documentation does say that the output is "SQL-style")?

@rhshadrach
Copy link
Member

rhshadrach commented Apr 27, 2020

Edit: This change was implemented in #34012

Currently nunique with as_index=False reports the number of unique elements for the grouping column, which is tautologically 1:

df = DataFrame({"A": list("abbacc"), "B": list("abxacc"), "C": list("abbacx")})
df.groupby("A", as_index=False).nunique()

df:

   A  B  C
0  a  a  a
1  b  b  b
2  b  x  b
3  a  a  a
4  c  c  c
5  c  c  x

result:

   A  B  C
0  1  1  1
1  1  2  1
2  1  1  2

It seems to me it'd be better to leave the grouping column as-is:

   A  B  C
0  a  1  1
1  b  2  1
2  c  1  2

Is there a reason the current behavior is a feature that we don't want to change?

@rhshadrach
Copy link
Member

rhshadrach commented Jun 23, 2020

As far as I can tell, all issues identified here have been solved. I think this issue can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior Groupby Master Tracker High level tracker for similar issues
Projects
None yet
Development

No branches or pull requests

8 participants