Consistency with groupby as_index #5755

hayd · 2013-12-20T06:42:07Z

Checklist:

head and tail act like filter
nth act like filter
nth on Series should dropna? (discuss?)
something subtle with some functions in agg/apply (again)
go through other methods to see if as_index acts correctly
cythonized vs with apply BUG/TST: verify that groupby apply with a column aggregation does not return the column #7002
get_group with as_index=False, BUG: Applying function on column of Groupby object with as_index=False does not select column #5764
returned group for a DataFrame.groupby.cumsum(), cumsum sums the groupby column #5614 / jreback@371a7bf / just need to test for transformers (that they ignore as_index)

I've had this half-written for a while, but I wasn't sure actually explain it. I'm still not sure, but here goes.

Would be intersting to discuss what peoples views on the correctness of these was.

I think there's scope for cleaning up some possible inconsistencies (unfortunately this would mean breaking the API):

In [2]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])

In [3]: g = df.groupby('A', )

In [4]: g1 = df.groupby('A', as_index=False)

In [5]: g.head(1)   # I did a dirty hack to get the index like this!
Out[5]: 
     A  B
A        
1 0  1  2
5 2  5  6

In [6]: g1.head(1)
Out[6]: 
   A  B
0  1  2
2  5  6

In [7]: g.nth(0)
Out[7]: 
   B
A   
1  2
5  6

In [8]: g1.nth(0)  # the index doesn't match the original
Out[8]: 
   A  B
0  1  2
1  5  6

In [9]: g.filter(lambda x: x.B>0)
Out[9]: 
   A  B
0  1  2
1  1  4
2  5  6

In [10]: g1.filter(lambda x: x.B>0)
Out[10]: 
   A  B
0  1  2
1  1  4
2  5  6

In [13]: g.apply(lambda x: x.sum()) 
Out[13]: 
   A  B
A      
1  2  6
5  5  6

In [14]: g1.apply(lambda x: x.sum())  # but as_index=False
Out[14]: 
   A  B
A      
1  2  6
5  5  6

In [15]: g.apply(lambda x: x)  # no index ?
Out[15]: 
   A  B
0  1  2
1  1  4
2  5  6

In [16]: g1.apply(lambda x: x) 
Out[16]: 
   A  B
0  1  2
1  1  4
2  5  6

In [18]: g.agg(lambda x: x.sum())
Out[18]: 
   B
A   
1  6
5  6

In [19]: g1.agg(lambda x: x.sum())  # Note: A is not summed
Out[19]: 
   A  B
0  1  6
1  5  6

My opinion is that head/tail/nth should act like filter (i.e. both ignore the as_index).

The text was updated successfully, but these errors were encountered:

ghost · 2013-12-25T02:53:55Z

The docstring says:

        as_index : boolean, default True
            For aggregated output, return object with group labels as the
            index. Only relevant for DataFrame input. as_index=False is
            effectively "SQL-style" grouped output

So the contract on the index depends on the definition of what constitutes "aggregated output".
I'm guessing it means the ufuncs like sum. in which case nth should ignore as_index.

But that's a pretty strange design choice, I'd expect the result of a groupby to be a
single definite thing and all operations on it should behave as if they operated on that result.
In other words, if you use sum() and it behaves as if the groups had some set of indices
and when you use head() it behaves as if the groups had different indices, that's weird, and
I can't think of a reason why it makes sense.

By either definition though, nth seems baroquely broken. no idea. Since t's undocumented
I guess it's correct for some definition out there.

This looks like >=3 seperate issues to me.

hayd · 2014-01-27T21:40:57Z

Just to link #5552 on nth groupby.

Not sure what a good way to fix/break the current behaviour is (for 0.14)...

jreback · 2014-04-29T20:51:57Z

bumping to 0.15 / this is a more complete #7002 so going to add my example at the top

hayd · 2014-04-29T21:08:57Z

a couple of PRs doing atm tick off some of this as well as previous (checked) stuff

I think I'll tweak the top description later to reflect the change (this was not changed in 0.13.1) so maybe will see some people confused here

jreback · 2014-04-29T21:10:35Z

gr8!

jreback · 2014-04-29T21:11:45Z

as an aside,

I think we need to make clear, maybe in the filter section of groupby that

head/tail/nth are REALLY filtration methods (and so return the grouper)

as compared to first/last/sum/mean/describe etc (which DON't return the grouper)

yes?

jreback · 2014-04-29T21:16:00Z

I added a bullet point at the end of here: #6944

jorisvandenbossche · 2014-04-29T21:33:48Z

Maybe more appropriate here than in #7000, so repeating:

Looking at this discussion with a lot of Aaarghs (and also seeing some of the other issues with groupby), I was thinking: should we try to write down some 'design document' where we try to describe the 'rules'.

Eg the "we regard head as a filtering-like function (no aggregation), and those always ignore as_index/return the grouped values in their original column"-rule as used above.

This could maybe clarify some things for ourselves, and can be used as a reference for future PRs. (Or could it also turn out as a rabbit hole ..)

And I mean a rather detailed description for devs (and maybe simpler version in the docs, as added now to #6944)

jreback · 2014-04-29T21:39:52Z

just going to do that, I think a nice Note at the beginning of the aggregation/transform/filter secition should do it?

clear, reminding you that these functions return or don't return the grouped indices

jreback · 2014-04-29T21:43:54Z

or maybe here? http://pandas-docs.github.io/pandas-docs-travis/groupby.html#other-useful-features

jreback · 2014-05-01T17:46:46Z

@hayd alright I added the cumsum groupby consisncy here; will see if we get to before release
but ok for now I guess

jseabold · 2015-08-06T18:35:36Z

Parking this here, because I think (?) it's the same issue.

In [20]: paste
df = pd.DataFrame()
y = np.random.randint(2, size=25)
labels = np.repeat(['a', 'b', 'c', 'd', 'e'], 5)
df['dv'] = y
df['label'] = labels
gb = df.groupby(['label','dv'], as_index=False).size()

## -- End pasted text --

In [21]: gb
Out[21]:
label  dv
a      0     3
       1     2
b      0     3
       1     2
c      1     5
d      0     3
       1     2
e      0     2
       1     3
dtype: int64

hayd · 2015-08-06T18:49:01Z

@jseabold Is that: size doesn't respect as_index ?

jseabold · 2015-08-06T18:53:59Z

Yeah. It's an aggregation.

jreback · 2016-06-27T13:01:56Z

@jorisvandenbossche should update this, and include #13519

jorisvandenbossche · 2016-06-30T10:31:22Z

@jreback Yes, this is one of the topics I planned to tackle

dsaxton · 2019-08-18T13:16:05Z

I wrote out a separate issue for size (using pandas 0.25) and then found that this has been open for a while. My expectation would be that this would create a separate column (especially given that one of the common uses of as_index=False is to facilitate subsequent joins), perhaps with the counts simply being labeled "size". Could it make sense to follow this naming pattern with some of the other aggregations as well (e.g., generic apply) to be consistent with column-naming in SQL (the documentation does say that the output is "SQL-style")?

rhshadrach · 2020-04-27T23:30:33Z

Edit: This change was implemented in #34012

Currently nunique with as_index=False reports the number of unique elements for the grouping column, which is tautologically 1:

df = DataFrame({"A": list("abbacc"), "B": list("abxacc"), "C": list("abbacx")})
df.groupby("A", as_index=False).nunique()

df:

   A  B  C
0  a  a  a
1  b  b  b
2  b  x  b
3  a  a  a
4  c  c  c
5  c  c  x

result:

It seems to me it'd be better to leave the grouping column as-is:

Is there a reason the current behavior is a feature that we don't want to change?

rhshadrach · 2020-06-23T21:25:40Z

As far as I can tell, all issues identified here have been solved. I think this issue can be closed.

hayd mentioned this issue Jan 27, 2014

cumsum sums the groupby column #5614

Closed

This was referenced Mar 3, 2014

Grouby select multiple columns #6524

Closed

API change in groupby head and tail #6533

Merged

jreback modified the milestones: 0.15.0, 0.14.0 Apr 29, 2014

jorisvandenbossche mentioned this issue Apr 29, 2014

ENH/BUG: add count to grouper / ensure that grouper keys are not included in the returned #7000

Merged

jreback mentioned this issue May 1, 2014

BUG: Applying function on column of Groupby object with as_index=False does not select column #5764

Closed

hayd mentioned this issue May 30, 2014

v0.14 group.head() not really grouped #7287

Closed

jreback mentioned this issue Oct 28, 2014

BUG: column name conflict & as_index=False breaks groupby ops #8585

Merged

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

jreback mentioned this issue Aug 18, 2015

Weird behaviour of groupby with quantile #10842

Closed

jreback mentioned this issue Sep 10, 2015

PERF: improves performance in GroupBy.cumcount #11039

Closed

jreback added the Master Tracker High level tracker for similar issues label Jun 27, 2016

chris-b1 mentioned this issue Aug 11, 2016

API: Expanded resample #13961

Closed

4 tasks

ghost mentioned this issue Jan 8, 2017

Unclear ValueError on core.groupby #15082

Closed

Kevin-McIsaac mentioned this issue Feb 2, 2017

as_index =True for groupby transform inconsistent with agg #15290

Closed

chris-b1 mentioned this issue Feb 27, 2018

Cythonized GroupBy pct_change #19919

Merged

4 tasks

jbrockmendel added API - Consistency Internal Consistency of API/Behavior and removed API Design labels Dec 18, 2019

rhshadrach mentioned this issue Apr 19, 2020

BUG: DataFrameGroupby std/sem modify grouped column when as_index=False #33630

Merged

5 tasks

rhshadrach mentioned this issue Jun 14, 2020

BUG: Groupby with as_index=False raises error when type is Category #34767

Merged

5 tasks

mroeschke closed this as completed Apr 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consistency with groupby as_index #5755

Consistency with groupby as_index #5755

hayd commented Dec 20, 2013

ghost commented Dec 25, 2013

hayd commented Jan 27, 2014

jreback commented Apr 29, 2014

hayd commented Apr 29, 2014

jreback commented Apr 29, 2014

jreback commented Apr 29, 2014

jreback commented Apr 29, 2014

jorisvandenbossche commented Apr 29, 2014

jreback commented Apr 29, 2014

jreback commented Apr 29, 2014

jreback commented May 1, 2014

jseabold commented Aug 6, 2015

hayd commented Aug 6, 2015

jseabold commented Aug 6, 2015

jreback commented Jun 27, 2016

jorisvandenbossche commented Jun 30, 2016

dsaxton commented Aug 18, 2019

rhshadrach commented Apr 27, 2020 •

edited

Loading

rhshadrach commented Jun 23, 2020 •

edited

Loading

Consistency with groupby as_index #5755

Consistency with groupby as_index #5755

Comments

hayd commented Dec 20, 2013

ghost commented Dec 25, 2013

hayd commented Jan 27, 2014

jreback commented Apr 29, 2014

hayd commented Apr 29, 2014

jreback commented Apr 29, 2014

jreback commented Apr 29, 2014

jreback commented Apr 29, 2014

jorisvandenbossche commented Apr 29, 2014

jreback commented Apr 29, 2014

jreback commented Apr 29, 2014

jreback commented May 1, 2014

jseabold commented Aug 6, 2015

hayd commented Aug 6, 2015

jseabold commented Aug 6, 2015

jreback commented Jun 27, 2016

jorisvandenbossche commented Jun 30, 2016

dsaxton commented Aug 18, 2019

rhshadrach commented Apr 27, 2020 • edited Loading

rhshadrach commented Jun 23, 2020 • edited Loading

rhshadrach commented Apr 27, 2020 •

edited

Loading

rhshadrach commented Jun 23, 2020 •

edited

Loading