API: Allow groupby's by to take column and index names [WIP] #7033

TomAugspurger · 2014-05-04T21:46:30Z

As a reminder, with a data frame like:

from itertools import cycle, islice 

np.random.seed(0)
df = pd.DataFrame(np.random.randn(20, 2))
df.columns = ["foo","bar"]
df['g0'] = list(islice(cycle('ab'), 20))
df['g1'] = ['a'] * 10 + ['b'] * 10
df['g2'] = ['c'] * 5 + ['d'] * 5 + ['c'] * 5 + ['d'] * 5

df = df.set_index(['g1', 'g2'])

Before if you wanted to groupby a a column and index level, you'd have to

g = df.reset_index().groupby(['g0', 'g1'])

Now you can just do df.groupby(['g0', 'g1']). In ambiguous cases where a there's a key in by that's in both the index names and columns we warn and proceed with grouping by the columns (I haven't tested this part yet).

jreback · 2014-05-04T22:20:29Z

ambiguous should raise (e.g. if its in both the index & a column). de factor this: https://github.com/pydata/pandas/blob/master/pandas/tests/test_groupby.py#L3187

TomAugspurger · 2014-05-04T23:27:39Z

The problem with raising with an ambiguous case is it breaks backwards compat, since before by didn't even look at the index names. Should we warn so that we are backwards compatible?

jreback · 2014-05-04T23:33:51Z

where does it break backwards compat? (or did it just assume that index is the better choice?)

TomAugspurger · 2014-05-04T23:38:13Z

        df = DataFrame([[1, 2, 'x', 'a', 'a'],
                        [1, 3, 'x', 'a', 'b'],
                        [1, 4, 'x', 'b', 'a'],
                        [1, 5, 'y', 'b', 'b']],
                       columns=['c1', 'c2', 'g1', 'i1', 'i2'])
        df = df.set_index(['i1', 'i2'])
        df.index.names = ['i1', 'g1']

In [6]: df
Out[6]: 
       c1  c2 g1
i1 g1           
a  a    1   2  x
   b    1   3  x
b  a    1   4  x
   b    1   5  y

[4 rows x 3 columns]

In [7]: df.groupby('g1').mean()   # g1 not c1 like I had before.

So before this is fine since by never looked at the index names. But now we'll either

warn that we're using the g1 from the columns, and not the one from the index.
raise a ValueError (which would break compat)

jreback · 2014-05-04T23:59:41Z

you prob mean g1, right?

yes, I think that you SHOULD use the index value (and maybe warn), that would be good. We can treat this like a deprecation and 0.16 cause this to raise (as if their is a conflict they should be using pd.Grouper)

TomAugspurger · 2014-05-05T00:04:50Z

Yeah I meant g1. And did you mean "should use the column value" (not index?). Because column is the current behavior.

jreback · 2014-05-05T00:06:16Z

oh...that's the problem, I think the existing code uses the index value first (but I could be wrong, just seem to remember it this way). whatever it uses now it should continue to use (then warn with a FutureWarning).

TomAugspurger · 2014-05-05T13:49:05Z

Shoot, this implementation won't work since it modifies the object being grouped, which breaks the promise that .transform will return an object with the same index. I'll think about a fix...

jreback · 2015-01-25T23:19:56Z

@TomAugspurger what's the status of this?

jreback · 2015-01-25T23:20:55Z

@TomAugspurger what's the status of this?

jreback · 2015-05-09T16:08:38Z

closing pls reopen if/when updated

jonmmease · 2016-09-29T23:46:10Z

@TomAugspurger I'm interesting in working on #5677 and will likely use this PR as a starting point. Tom, if you have time could you elaborate on your comment from May 5, 2014 where you mention that this current implementation violates a property of .transform?

TomAugspurger · 2016-09-30T01:40:27Z

@jmmease not 100% sure what I meant by that last comment. If I had to guess, it's that something like

df = DataFrame([[1, 2, 'x', 'a', 'a'],
                [2, 3, 'x', 'a', 'b'],
                [3, 4, 'x', 'b', 'a'],
                [4, 5, 'y', 'b', 'b']],
               columns=['c1', 'c2', 'g1', 'i1', 'i2'])
df = df.set_index(['i1', 'i2'])

In [25]: df
Out[25]:
       c1  c2 g1
i1 i2
a  a    1   2  x
   b    2   3  x
b  a    3   4  x
   b    4   5  y

I don't know what the expected index of this operation would be

df.groupby(['i1', 'g1']).transform('max')

Typically the output of transform retains the index, so it should have the same index has the input df. But in my implementation I was calling .reset_index(), so the output had a range(n) index.

Tom Augspurger added 2 commits May 4, 2014 16:24

add the grouping code

2800fa0

add tests

45dc31b

jreback added API Design labels May 4, 2014

jreback added this to the 0.14.0 milestone May 4, 2014

refactor index resetting

8a79ea1

add test [ci skip]

6be446b

jreback modified the milestones: 0.14.1, 0.14.0 May 10, 2014

jreback modified the milestones: 0.15.0, 0.14.1 Jun 13, 2014

TomAugspurger mentioned this pull request Sep 5, 2014

Allowing the index to be referenced by name, like a column #8162

Closed

3 tasks

jreback modified the milestones: 0.16.0, Next Major Release Mar 3, 2015

jreback closed this May 9, 2015

jonmmease mentioned this pull request Sep 29, 2016

ENH/API: clarify groupby by to handle columns/index names #5677

Closed

TomAugspurger deleted the groupby-level-names branch April 5, 2017 02:08

TomAugspurger restored the groupby-level-names branch April 5, 2017 02:08

TomAugspurger deleted the groupby-level-names branch April 5, 2017 02:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: Allow groupby's by to take column and index names [WIP] #7033

API: Allow groupby's by to take column and index names [WIP] #7033

TomAugspurger commented May 4, 2014

jreback commented May 4, 2014

TomAugspurger commented May 4, 2014

jreback commented May 4, 2014

TomAugspurger commented May 4, 2014

jreback commented May 4, 2014

TomAugspurger commented May 5, 2014

jreback commented May 5, 2014

TomAugspurger commented May 5, 2014

jreback commented Jan 25, 2015

jreback commented Jan 25, 2015

jreback commented May 9, 2015

jonmmease commented Sep 29, 2016

TomAugspurger commented Sep 30, 2016 •

edited

Loading

API: Allow groupby's by to take column and index names [WIP] #7033

API: Allow groupby's by to take column and index names [WIP] #7033

Conversation

TomAugspurger commented May 4, 2014

jreback commented May 4, 2014

TomAugspurger commented May 4, 2014

jreback commented May 4, 2014

TomAugspurger commented May 4, 2014

jreback commented May 4, 2014

TomAugspurger commented May 5, 2014

jreback commented May 5, 2014

TomAugspurger commented May 5, 2014

jreback commented Jan 25, 2015

jreback commented Jan 25, 2015

jreback commented May 9, 2015

jonmmease commented Sep 29, 2016

TomAugspurger commented Sep 30, 2016 • edited Loading

TomAugspurger commented Sep 30, 2016 •

edited

Loading