API: User-control of result keys in GroupBy.apply #34998

TomAugspurger · 2020-06-25T21:10:36Z

Currently we determine whether to include the group keys as a level of the result's MultiIndex by looking at whether the result's index matches the original DataFrame. This provides a keyword to control that behavior so that users can consistently get the group keys or not, regardless of whether the udf happens to be a transform or not. The default behavior remains the same.

I called the keyword result_group_keys, but would welcome alternatives.

Closes #34809
Closes #31612
Closes #14927
Closes #13056
Closes #27212
Closes #9704

cc @WillAyd and @jorisvandenbossche from the issue.

WillAyd · 2020-06-25T21:15:03Z

Is it possible to leverage the existing as_index keyword instead of introducing a new one? I've never been quite clear on what the former should do, so maybe it should cover at least part of this

jreback

this is exactly what group_keys keys does no? https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html?highlight=groupby#pandas.DataFrame.groupby

jreback · 2020-06-25T22:12:50Z

adding more keywords is a huge -1 here. Ok if you deprecate and remove one or more, but adding more to an already really complex operation is sure to cause way more bugs.

TomAugspurger · 2020-06-26T11:22:11Z

The documentation of group_keys isn't clear. Is that adding the groups to the inputs before the apply or the outputs after the apply? If it's intended for the outputs I can resuse it, but currently it seemingly has no effect for this usecase:

In [10]: def f(x):
    ...:     return x.copy()  # same index

In [11]: def g(x):
    ...:     return x.copy().rename(lambda x: x + 1)  # different index

In [12]: df = pd.DataFrame({"A": ['a', 'b'], "B": [1, 2]})


In [4]: df.groupby("A", group_keys=True).apply(f)
Out[4]:
   A  B
0  a  1
1  b  2

In [5]: df.groupby("A", group_keys=True).apply(g)
Out[5]:
     A  B
A
a 1  a  1
b 2  b  2

In [6]: df.groupby("A", group_keys=False).apply(f)
Out[6]:
   A  B
0  a  1
1  b  2

In [7]: df.groupby("A", group_keys=False).apply(g)
Out[7]:
   A  B
1  a  1
2  b  2

TomAugspurger · 2020-06-26T11:31:31Z

group_keys does kick in here, when the applied function returns a subset of the dataframe..

In [8]: df = DataFrame({"key": [1, 1, 1, 2, 2, 2, 3, 3, 3], "value": range(9)})
   ...: df.groupby("key", group_keys=True).apply(lambda x: x[:2])
   ...:
   ...:
   ...:
Out[8]:
       key  value
key
1   0    1      0
    1    1      1
2   3    2      3
    4    2      4
3   6    3      6
    7    3      7

In [9]: df = DataFrame({"key": [1, 1, 1, 2, 2, 2, 3, 3, 3], "value": range(9)})
   ...: df.groupby("key", group_keys=False).apply(lambda x: x[:2])
   ...:
   ...:
   ...:
   ...:
Out[9]:
   key  value
0    1      0
1    1      1
3    2      3
4    2      4
6    3      6
7    3      7

So it seems the intent is similar to what we want. However I don't know about the defaults. group_keys defaults to true. We'll need to see whether that's appropriate for this case.

jreback · 2020-06-26T11:42:47Z

might be ok to add result_type keyword similar to what we use in apply but must depreciate group_keys then

TomAugspurger · 2020-06-26T11:43:03Z

Well, 50 failures in tests/groupby when I just treat group_keys like you're suggesting (adding the group keys to the output). I'll see what I can do.

TomAugspurger · 2020-06-26T13:58:06Z

It does seem like the intent of group_keys is to control this sort of behavior. We just didn't apply it consistently for a few reasons.

It just wasn't used in _GroupBy._python_apply_general.
It's dropped when slicing a DataFrameGroupBy object

In [2]: df = pd.DataFrame({"A": [0, 0, 1], "B": [1, 2, 3]})

In [3]: gb = df.groupby("A", group_keys=False)

In [4]: gb.group_keys
Out[4]: False

In [5]: gb[['B']].group_keys
Out[5]: True

In [6]: gb['B'].group_keys
Out[6]: True

I'll push a commit later with some updates to always honor group keys, but my initial reaction is that this is too large of a change. I'm sure there are people relying on the buggy behavior, so we'll need to do this with a deprecation.

TomAugspurger · 2020-06-29T13:54:35Z

OK, tests should hopefully pass, but this isn't mergeable yet because of the behavior change. Namely honoring group_keys changes the output shape of "transform-like" .apply:

# 1.0.4
In [2]: df = pd.DataFrame({"A": [0, 0, 1], "B": [1, 2, 3]})

In [3]: df.groupby("A").apply(lambda x: x)
Out[3]:
   A  B
0  0  1
1  0  2
2  1  3

# PR
In [3]: df.groupby("A").apply(lambda x: x)
Out[3]:
     A  B
A
0 0  0  1
  1  0  2
1 2  1  3

My next commit will have compatibility layer. We'll change the default group_keys to None and warn if we get to a case where group_keys=None and we've detected a transform-like apply.

Note that this doesn't help people who were already doing .groupby(..., group_keys=True).apply(lambda x: x), but I think that's unavoidable. I think we just call that buggy behavior that's been fixed.

TomAugspurger · 2020-06-29T14:35:29Z

Here's the warning

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({"A": [0, 0, 1], "B": [1, 2, 3]})

In [3]: df.groupby("A").apply(lambda x: x)
/Users/taugspurger/.virtualenvs/pandas-dev/bin/ipython:1: FutureWarning: Not prepending group keys to the result index of transform-like apply. In the future, the group keys will be included in the index, regardless of whether the applied function returns a like-indexed object.
To preserve the previous behavior, use

        >>> .groupby(..., group_keys=False)

To adopt the future behavior and silence this warning, use

        >>> .groupby(..., group_keys=True)
  #!/Users/taugspurger/Envs/pandas-dev/bin/python
Out[3]:
   A  B
0  0  1
1  0  2
2  1  3

WillAyd · 2020-06-29T21:15:50Z

On board with the direction here; I was initially thinking through if the default should be True or False going forward but I think the way you have it is nice if can get green

jreback · 2022-02-20T16:20:19Z

will look again soon

…9-result-type

…9-result-type � Conflicts: � pandas/core/groupby/groupby.py

…pandas into 34809-result-type

rhshadrach · 2022-03-15T10:28:42Z

@jreback ping

…9-result-type � Conflicts: � pandas/core/groupby/groupby.py

…pandas into 34809-result-type

…9-result-type

rhshadrach · 2022-03-29T21:32:14Z

@jreback - ping

jreback · 2022-03-30T15:44:09Z

thanks @rhshadrach and @TomAugspurger nice to have the consistency!

and of course let the bug reports roll in......

TomAugspurger · 2022-03-30T15:45:11Z

Really glad to see this in. Great work @rhshadrach!

jbrockmendel · 2022-08-25T22:14:41Z

@TomAugspurger this is causing warnings in test_multiindex_group_all_columns_when_empty, can you update the affected tests to avoid/catch these warnings?

rhshadrach · 2022-08-26T18:35:30Z

Thanks for finding these @jbrockmendel - this will be addressed in #48159

- [x] This PR adds support for `group_keys` in `groupby`. Starting pandas 1.5.0, issues around `group_keys` have been resolved: pandas-dev/pandas#34998 pandas-dev/pandas#47185 - [x] This PR defaults `group_keys` to `False` which is the same as what pandas is going to be defaulting to in the future version. - [x] Required to unblock `pandas-1.5.0` upgrade in cudf: #11617 Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Bradley Dice (https://github.com/bdice) - Ashwin Srinath (https://github.com/shwina) URL: #11659

jbrockmendel · 2022-10-28T16:14:51Z

@TomAugspurger want to make a PR enforcing this deprecation?

rhshadrach · 2022-10-28T19:04:11Z

@TomAugspurger - more than happy if you want to take this, otherwise I can take it.

TomAugspurger · 2022-10-29T13:49:07Z

Thanks Richard. It’s all yours :)

…

On Oct 28, 2022 at 2:04:22 PM, Richard Shadrach ***@***.***> wrote: @TomAugspurger <https://github.com/TomAugspurger> - more than happy if you want to take this, otherwise I can take it. — Reply to this email directly, view it on GitHub <#34998 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOIUKBYS3GFUR6RWSG73WFQPTNANCNFSM4OIX2YCA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

API: User-control of result keys

0fa2104

TomAugspurger added the Groupby label Jun 25, 2020

TomAugspurger added this to the 1.1 milestone Jun 25, 2020

jreback requested changes Jun 25, 2020

View reviewed changes

TomAugspurger added 4 commits June 26, 2020 08:30

wip

13a38a2

Merge remote-tracking branch 'upstream/master' into 34809-result-type

ebb2a2d

mmm

f8d646f

updates

623526c

TomAugspurger mentioned this pull request Jun 26, 2020

Inconsistent output in GroupBy.apply returning a DataFrame #34809

Closed

TomAugspurger added 5 commits June 26, 2020 09:49

fixups

6871ed0

test fixups

00ce5dc

Merge remote-tracking branch 'upstream/master' into 34809-result-type

c28f176

update doctests

c0b1140

resample

cfd4d73

TomAugspurger added 5 commits June 29, 2020 09:06

warning

c05b1ea

remove debug

919a4c7

warning

4a45ea0

warning

7f1478f

warning

a1d4da8

wip

9c229c6

rhshadrach mentioned this pull request Feb 24, 2022

BUG: dataframe returned by groupby.apply() is missing outer group indices if row indices of returned dataframe are identical to those of the input dataframe #46041

Closed

3 tasks

rhshadrach added 3 commits February 27, 2022 11:47

Merge branch 'main' of https://github.com/pandas-dev/pandas into 3480…

cb4bed6

…9-result-type

Merge branch 'main' of https://github.com/pandas-dev/pandas into 3480…

4ea1550

…9-result-type � Conflicts: � pandas/core/groupby/groupby.py

Merge branch '34809-result-type' of https://github.com/TomAugspurger/…

eace964

…pandas into 34809-result-type

rhshadrach added 4 commits March 22, 2022 17:18

Merge branch 'main' of https://github.com/pandas-dev/pandas into 3480…

6407abd

…9-result-type � Conflicts: � pandas/core/groupby/groupby.py

Merge branch '34809-result-type' of https://github.com/TomAugspurger/…

deb5479

…pandas into 34809-result-type

Merge cleanup

a0fd04c

Merge branch 'main' of https://github.com/pandas-dev/pandas into 3480…

f9aa547

…9-result-type

jreback added this to the 1.5 milestone Mar 30, 2022

jreback approved these changes Mar 30, 2022

View reviewed changes

jreback merged commit efb262f into pandas-dev:main Mar 30, 2022

TomAugspurger deleted the 34809-result-type branch March 30, 2022 15:45

ian-r-rose mentioned this pull request Apr 8, 2022

[DO NOT MERGE] Test upstream CI build dask/dask#8875

Closed

rhshadrach mentioned this pull request Jul 12, 2022

API: numeric_only consistency #46560

Closed

9 tasks

yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022

API: User-control of result keys in GroupBy.apply (pandas-dev#34998)

6b29968

galipremsagar mentioned this pull request Sep 7, 2022

Add support for group_keys in groupby rapidsai/cudf#11659

Merged

6 tasks

jorisvandenbossche mentioned this pull request Nov 8, 2022

API: Consolidate groupby as_index and group_keys #49543

Open

j-bennet mentioned this pull request Feb 15, 2023

Fix test_groupby_unaligned_index for pandas 2.0 dask/dask#9963

Merged

2 tasks

rhshadrach mentioned this pull request Mar 19, 2023

DEPR: Properly enforce group_keys defaulting to False in resample #52071

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: User-control of result keys in GroupBy.apply #34998

API: User-control of result keys in GroupBy.apply #34998

TomAugspurger commented Jun 25, 2020 •

edited by rhshadrach

Loading

WillAyd commented Jun 25, 2020

jreback left a comment

jreback commented Jun 25, 2020

TomAugspurger commented Jun 26, 2020 •

edited by jorisvandenbossche

Loading

TomAugspurger commented Jun 26, 2020

jreback commented Jun 26, 2020

TomAugspurger commented Jun 26, 2020

TomAugspurger commented Jun 26, 2020

TomAugspurger commented Jun 29, 2020 •

edited

Loading

TomAugspurger commented Jun 29, 2020

WillAyd commented Jun 29, 2020

jreback commented Feb 20, 2022

rhshadrach commented Mar 15, 2022

rhshadrach commented Mar 29, 2022

jreback commented Mar 30, 2022

TomAugspurger commented Mar 30, 2022

jbrockmendel commented Aug 25, 2022

rhshadrach commented Aug 26, 2022

jbrockmendel commented Oct 28, 2022

rhshadrach commented Oct 28, 2022

TomAugspurger commented Oct 29, 2022 via email

API: User-control of result keys in GroupBy.apply #34998

API: User-control of result keys in GroupBy.apply #34998

Conversation

TomAugspurger commented Jun 25, 2020 • edited by rhshadrach Loading

WillAyd commented Jun 25, 2020

jreback left a comment

Choose a reason for hiding this comment

jreback commented Jun 25, 2020

TomAugspurger commented Jun 26, 2020 • edited by jorisvandenbossche Loading

TomAugspurger commented Jun 26, 2020

jreback commented Jun 26, 2020

TomAugspurger commented Jun 26, 2020

TomAugspurger commented Jun 26, 2020

TomAugspurger commented Jun 29, 2020 • edited Loading

TomAugspurger commented Jun 29, 2020

WillAyd commented Jun 29, 2020

jreback commented Feb 20, 2022

rhshadrach commented Mar 15, 2022

rhshadrach commented Mar 29, 2022

jreback commented Mar 30, 2022

TomAugspurger commented Mar 30, 2022

jbrockmendel commented Aug 25, 2022

rhshadrach commented Aug 26, 2022

jbrockmendel commented Oct 28, 2022

rhshadrach commented Oct 28, 2022

TomAugspurger commented Oct 29, 2022 via email

TomAugspurger commented Jun 25, 2020 •

edited by rhshadrach

Loading

TomAugspurger commented Jun 26, 2020 •

edited by jorisvandenbossche

Loading

TomAugspurger commented Jun 29, 2020 •

edited

Loading