-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: User-control of result keys in GroupBy.apply #34998
Conversation
Is it possible to leverage the existing as_index keyword instead of introducing a new one? I've never been quite clear on what the former should do, so maybe it should cover at least part of this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is exactly what group_keys
keys does no? https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html?highlight=groupby#pandas.DataFrame.groupby
adding more keywords is a huge -1 here. Ok if you deprecate and remove one or more, but adding more to an already really complex operation is sure to cause way more bugs. |
The documentation of In [10]: def f(x):
...: return x.copy() # same index
In [11]: def g(x):
...: return x.copy().rename(lambda x: x + 1) # different index
In [12]: df = pd.DataFrame({"A": ['a', 'b'], "B": [1, 2]})
In [4]: df.groupby("A", group_keys=True).apply(f)
Out[4]:
A B
0 a 1
1 b 2
In [5]: df.groupby("A", group_keys=True).apply(g)
Out[5]:
A B
A
a 1 a 1
b 2 b 2
In [6]: df.groupby("A", group_keys=False).apply(f)
Out[6]:
A B
0 a 1
1 b 2
In [7]: df.groupby("A", group_keys=False).apply(g)
Out[7]:
A B
1 a 1
2 b 2 |
In [8]: df = DataFrame({"key": [1, 1, 1, 2, 2, 2, 3, 3, 3], "value": range(9)})
...: df.groupby("key", group_keys=True).apply(lambda x: x[:2])
...:
...:
...:
Out[8]:
key value
key
1 0 1 0
1 1 1
2 3 2 3
4 2 4
3 6 3 6
7 3 7
In [9]: df = DataFrame({"key": [1, 1, 1, 2, 2, 2, 3, 3, 3], "value": range(9)})
...: df.groupby("key", group_keys=False).apply(lambda x: x[:2])
...:
...:
...:
...:
Out[9]:
key value
0 1 0
1 1 1
3 2 3
4 2 4
6 3 6
7 3 7 So it seems the intent is similar to what we want. However I don't know about the defaults. |
might be ok to add result_type keyword similar to what we use in apply but must depreciate group_keys then |
Well, 50 failures in |
It does seem like the intent of
In [2]: df = pd.DataFrame({"A": [0, 0, 1], "B": [1, 2, 3]})
In [3]: gb = df.groupby("A", group_keys=False)
In [4]: gb.group_keys
Out[4]: False
In [5]: gb[['B']].group_keys
Out[5]: True
In [6]: gb['B'].group_keys
Out[6]: True I'll push a commit later with some updates to always honor group keys, but my initial reaction is that this is too large of a change. I'm sure there are people relying on the buggy behavior, so we'll need to do this with a deprecation. |
OK, tests should hopefully pass, but this isn't mergeable yet because of the behavior change. Namely honoring # 1.0.4
In [2]: df = pd.DataFrame({"A": [0, 0, 1], "B": [1, 2, 3]})
In [3]: df.groupby("A").apply(lambda x: x)
Out[3]:
A B
0 0 1
1 0 2
2 1 3 # PR
In [3]: df.groupby("A").apply(lambda x: x)
Out[3]:
A B
A
0 0 0 1
1 0 2
1 2 1 3 My next commit will have compatibility layer. We'll change the default Note that this doesn't help people who were already doing |
Here's the warning In [1]: import pandas as pd
In [2]: df = pd.DataFrame({"A": [0, 0, 1], "B": [1, 2, 3]})
In [3]: df.groupby("A").apply(lambda x: x)
/Users/taugspurger/.virtualenvs/pandas-dev/bin/ipython:1: FutureWarning: Not prepending group keys to the result index of transform-like apply. In the future, the group keys will be included in the index, regardless of whether the applied function returns a like-indexed object.
To preserve the previous behavior, use
>>> .groupby(..., group_keys=False)
To adopt the future behavior and silence this warning, use
>>> .groupby(..., group_keys=True)
#!/Users/taugspurger/Envs/pandas-dev/bin/python
Out[3]:
A B
0 0 1
1 0 2
2 1 3 |
On board with the direction here; I was initially thinking through if the default should be True or False going forward but I think the way you have it is nice if can get green |
will look again soon |
…9-result-type � Conflicts: � pandas/core/groupby/groupby.py
…pandas into 34809-result-type
@jreback ping |
…9-result-type � Conflicts: � pandas/core/groupby/groupby.py
…pandas into 34809-result-type
@jreback - ping |
thanks @rhshadrach and @TomAugspurger nice to have the consistency! and of course let the bug reports roll in...... |
Really glad to see this in. Great work @rhshadrach! |
@TomAugspurger this is causing warnings in test_multiindex_group_all_columns_when_empty, can you update the affected tests to avoid/catch these warnings? |
Thanks for finding these @jbrockmendel - this will be addressed in #48159 |
- [x] This PR adds support for `group_keys` in `groupby`. Starting pandas 1.5.0, issues around `group_keys` have been resolved: pandas-dev/pandas#34998 pandas-dev/pandas#47185 - [x] This PR defaults `group_keys` to `False` which is the same as what pandas is going to be defaulting to in the future version. - [x] Required to unblock `pandas-1.5.0` upgrade in cudf: #11617 Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Bradley Dice (https://github.com/bdice) - Ashwin Srinath (https://github.com/shwina) URL: #11659
@TomAugspurger want to make a PR enforcing this deprecation? |
@TomAugspurger - more than happy if you want to take this, otherwise I can take it. |
Thanks Richard. It’s all yours :)
…On Oct 28, 2022 at 2:04:22 PM, Richard Shadrach ***@***.***> wrote:
@TomAugspurger <https://github.com/TomAugspurger> - more than happy if
you want to take this, otherwise I can take it.
—
Reply to this email directly, view it on GitHub
<#34998 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOIUKBYS3GFUR6RWSG73WFQPTNANCNFSM4OIX2YCA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Currently we determine whether to include the group keys as a level of the result's MultiIndex by looking at whether the result's index matches the original DataFrame. This provides a keyword to control that behavior so that users can consistently get the group keys or not, regardless of whether the udf happens to be a transform or not. The default behavior remains the same.
I called the keyword
result_group_keys
, but would welcome alternatives.Closes #34809
Closes #31612
Closes #14927
Closes #13056
Closes #27212
Closes #9704
cc @WillAyd and @jorisvandenbossche from the issue.