-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH/API: clarify groupby by to handle columns/index names #5677
Comments
I take it the idea is to provide sugar for grouping on a combination of a column and and a multiindex level.
|
How about resolving level names in the |
I don't think you even need these arguments, just an enhancement to figure out what the user wants. e.g I see a label in the grouper, then follow a simple algo:
|
That's just what I propose, except that variation breaks backwards-compat. |
that's reasonable |
@TomAugspurger so maybe let's change the name of this issue to something like 'clarify the grouper'? so it deals nicely with index/columns names (and can raise/warn if their are duplicates, taking the columns in preference to the index names). If there STILL is ambiguity let's discuss (e.g.if you specify a label and it could possibly be misinterpreted somehow). |
Just as an example: In [4]: df=mkdf(4,2,r_idx_nlevels=2)
...: df.columns = ["foo","bar"]
...: df.index.names = ["baz","foo"]
...: df
Out[4]:
foo bar
baz foo
R_l0_g0 R_l1_g0 R0C0 R0C1
R_l0_g1 R_l1_g1 R1C0 R1C1
R_l0_g2 R_l1_g2 R2C0 R2C1
R_l0_g3 R_l1_g3 R3C0 R3C1
[4 rows x 2 columns]
In [5]: df.groupby('foo') # not ambiguous now, but would be
Out[5]: <pandas.core.groupby.DataFrameGroupBy object at 0x39e9d50> Since groupy accepts an df=pd.DataFrame([[1,1,3],['a','b','c']],index=['a','b'])
...: df.groupby(['a'],1).groups
KeyError: u'no item named a' to be equivelent to: df=pd.DataFrame([[1,1,3],['a','b','c']],index=['a','b'])
...: df.T.groupby(['a']).groups # notice transpose
{1: [0, 1], 3: [2]} |
That all sounds reasonable. I can take a shot at implementing this in a few weeks. I'll probably have questions :) The other bit is that |
@TomAugspurger I dont' think it accept a list of mapping function / series, only a single mapper or series (otherwise should be an error) |
From the docstring.
I'll clarify that while I'm working on it. I just have to figure out what that actually does first. |
The docstring also says
Which is wrong. |
@TomAugspurger I think this is worthwhile.....and prob not too complex.... ? |
@jreback @TomAugspurger I'm interested in implementing this. Has anything changed since this discussion that I should be aware of? I'll likely use Tom's old PR (#7033) as a starting point. |
http://pandas.pydata.org/pandas-docs/stable/groupby.html#grouping-with-a-grouper-specification implements this though you could make sugar for non colliding names which could be a name of an index / multi index |
Ok, thanks. So am I following that the way to accomplish the original example from this issue, without resetting the index, is the following? In [75]: df.groupby(['B', pd.Grouper(level='inner')]).mean()
Out [75]:
A
B inner
one 1 0.0
2 2.5
3 5.0
two 1 3.0
3 2.0 Should this approach also work when the frame has a singe named index? e.g. In [76]: df2 = df.reset_index('outer')
In [77]: df2
Out [77]:
outer A B
inner
1 a 0 one
2 a 1 one
3 a 2 two
1 b 3 two
2 b 4 one
3 b 5 one
In [79]: df2.groupby(['B', pd.Grouper(level='inner')]).mean()
...
AttributeError: 'Int64Index' object has no attribute 'labels' In this case I'm getting an Attribute Error (I'm on pandas 0.18.1 and happy to file a bug if this is one). I would be interested in adding sugar in order to support df.groupby(['B', 'inner']).mean() and df2.groupby(['B', 'inner']).mean() where column names take precedence as discussed above. |
Yes, I think it should work. For example, it works when only specifying the index in this way:
So using it in a list to group by multiple columns/indexes should also work. Do you want to open a separate bug report for this? |
Yes, I'll open a separate bug report for this issue in a few hours. |
Referenced briefly in the OP at #3275
So the idea is to be able to call
instead of
Currently this raises
TypeError: 'numpy.ndarray' object is not callable
. Mostly just syntactic sugar, but I've been having to do a lot of this lately and all thereset_index
es are getting annoying. Thoughts?The text was updated successfully, but these errors were encountered: