-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: Allow groupby's by to take column and index names [WIP] #7033
API: Allow groupby's by to take column and index names [WIP] #7033
Conversation
ambiguous should raise (e.g. if its in both the index & a column). de factor this: https://github.com/pydata/pandas/blob/master/pandas/tests/test_groupby.py#L3187 |
The problem with raising with an ambiguous case is it breaks backwards compat, since before |
where does it break backwards compat? (or did it just assume that index is the better choice?) |
df = DataFrame([[1, 2, 'x', 'a', 'a'],
[1, 3, 'x', 'a', 'b'],
[1, 4, 'x', 'b', 'a'],
[1, 5, 'y', 'b', 'b']],
columns=['c1', 'c2', 'g1', 'i1', 'i2'])
df = df.set_index(['i1', 'i2'])
df.index.names = ['i1', 'g1']
In [6]: df
Out[6]:
c1 c2 g1
i1 g1
a a 1 2 x
b 1 3 x
b a 1 4 x
b 1 5 y
[4 rows x 3 columns]
In [7]: df.groupby('g1').mean() # g1 not c1 like I had before. So before this is fine since
|
you prob mean yes, I think that you SHOULD use the index value (and maybe warn), that would be good. We can treat this like a deprecation and 0.16 cause this to raise (as if their is a conflict they should be using |
Yeah I meant |
oh...that's the problem, I think the existing code uses the |
Shoot, this implementation won't work since it modifies the object being grouped, which breaks the promise that |
@TomAugspurger what's the status of this? |
1 similar comment
@TomAugspurger what's the status of this? |
closing pls reopen if/when updated |
@TomAugspurger I'm interesting in working on #5677 and will likely use this PR as a starting point. Tom, if you have time could you elaborate on your comment from May 5, 2014 where you mention that this current implementation violates a property of |
@jmmease not 100% sure what I meant by that last comment. If I had to guess, it's that something like df = DataFrame([[1, 2, 'x', 'a', 'a'],
[2, 3, 'x', 'a', 'b'],
[3, 4, 'x', 'b', 'a'],
[4, 5, 'y', 'b', 'b']],
columns=['c1', 'c2', 'g1', 'i1', 'i2'])
df = df.set_index(['i1', 'i2'])
In [25]: df
Out[25]:
c1 c2 g1
i1 i2
a a 1 2 x
b 2 3 x
b a 3 4 x
b 4 5 y I don't know what the expected index of this operation would be df.groupby(['i1', 'g1']).transform('max') Typically the output of |
closes #5677
As a reminder, with a data frame like:
Before if you wanted to groupby a a column and index level, you'd have to
Now you can just do
df.groupby(['g0', 'g1'])
. In ambiguous cases where a there's a key inby
that's in both the index names and columns we warn and proceed with grouping by the columns (I haven't tested this part yet).