Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Allow groupby's by to take column and index names [WIP] #7033

Closed

Conversation

TomAugspurger
Copy link
Contributor

closes #5677

As a reminder, with a data frame like:

from itertools import cycle, islice 

np.random.seed(0)
df = pd.DataFrame(np.random.randn(20, 2))
df.columns = ["foo","bar"]
df['g0'] = list(islice(cycle('ab'), 20))
df['g1'] = ['a'] * 10 + ['b'] * 10
df['g2'] = ['c'] * 5 + ['d'] * 5 + ['c'] * 5 + ['d'] * 5

df = df.set_index(['g1', 'g2'])

Before if you wanted to groupby a a column and index level, you'd have to

g = df.reset_index().groupby(['g0', 'g1'])

Now you can just do df.groupby(['g0', 'g1']). In ambiguous cases where a there's a key in by that's in both the index names and columns we warn and proceed with grouping by the columns (I haven't tested this part yet).

@jreback jreback added this to the 0.14.0 milestone May 4, 2014
@jreback
Copy link
Contributor

jreback commented May 4, 2014

ambiguous should raise (e.g. if its in both the index & a column). de factor this: https://github.com/pydata/pandas/blob/master/pandas/tests/test_groupby.py#L3187

@TomAugspurger
Copy link
Contributor Author

The problem with raising with an ambiguous case is it breaks backwards compat, since before by didn't even look at the index names. Should we warn so that we are backwards compatible?

@jreback
Copy link
Contributor

jreback commented May 4, 2014

where does it break backwards compat? (or did it just assume that index is the better choice?)

@TomAugspurger
Copy link
Contributor Author

        df = DataFrame([[1, 2, 'x', 'a', 'a'],
                        [1, 3, 'x', 'a', 'b'],
                        [1, 4, 'x', 'b', 'a'],
                        [1, 5, 'y', 'b', 'b']],
                       columns=['c1', 'c2', 'g1', 'i1', 'i2'])
        df = df.set_index(['i1', 'i2'])
        df.index.names = ['i1', 'g1']

In [6]: df
Out[6]: 
       c1  c2 g1
i1 g1           
a  a    1   2  x
   b    1   3  x
b  a    1   4  x
   b    1   5  y

[4 rows x 3 columns]

In [7]: df.groupby('g1').mean()   # g1 not c1 like I had before.

So before this is fine since by never looked at the index names. But now we'll either

  1. warn that we're using the g1 from the columns, and not the one from the index.
  2. raise a ValueError (which would break compat)

@jreback
Copy link
Contributor

jreback commented May 4, 2014

you prob mean g1, right?

yes, I think that you SHOULD use the index value (and maybe warn), that would be good. We can treat this like a deprecation and 0.16 cause this to raise (as if their is a conflict they should be using pd.Grouper)

@TomAugspurger
Copy link
Contributor Author

Yeah I meant g1. And did you mean "should use the column value" (not index?). Because column is the current behavior.

@jreback
Copy link
Contributor

jreback commented May 5, 2014

oh...that's the problem, I think the existing code uses the index value first (but I could be wrong, just seem to remember it this way). whatever it uses now it should continue to use (then warn with a FutureWarning).

@TomAugspurger
Copy link
Contributor Author

Shoot, this implementation won't work since it modifies the object being grouped, which breaks the promise that .transform will return an object with the same index. I'll think about a fix...

@jreback jreback modified the milestones: 0.14.1, 0.14.0 May 10, 2014
@jreback jreback modified the milestones: 0.15.0, 0.14.1 Jun 13, 2014
@jreback
Copy link
Contributor

jreback commented Jan 25, 2015

@TomAugspurger what's the status of this?

1 similar comment
@jreback
Copy link
Contributor

jreback commented Jan 25, 2015

@TomAugspurger what's the status of this?

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 3, 2015
@jreback
Copy link
Contributor

jreback commented May 9, 2015

closing pls reopen if/when updated

@jreback jreback closed this May 9, 2015
@jonmmease
Copy link
Contributor

@TomAugspurger I'm interesting in working on #5677 and will likely use this PR as a starting point. Tom, if you have time could you elaborate on your comment from May 5, 2014 where you mention that this current implementation violates a property of .transform?

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Sep 30, 2016

@jmmease not 100% sure what I meant by that last comment. If I had to guess, it's that something like

df = DataFrame([[1, 2, 'x', 'a', 'a'],
                [2, 3, 'x', 'a', 'b'],
                [3, 4, 'x', 'b', 'a'],
                [4, 5, 'y', 'b', 'b']],
               columns=['c1', 'c2', 'g1', 'i1', 'i2'])
df = df.set_index(['i1', 'i2'])

In [25]: df
Out[25]:
       c1  c2 g1
i1 i2
a  a    1   2  x
   b    2   3  x
b  a    3   4  x
   b    4   5  y

I don't know what the expected index of this operation would be

df.groupby(['i1', 'g1']).transform('max')

Typically the output of transform retains the index, so it should have the same index has the input df. But in my implementation I was calling .reset_index(), so the output had a range(n) index.

@TomAugspurger TomAugspurger deleted the groupby-level-names branch April 5, 2017 02:08
@TomAugspurger TomAugspurger restored the groupby-level-names branch April 5, 2017 02:08
@TomAugspurger TomAugspurger deleted the groupby-level-names branch April 5, 2017 02:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH/API: clarify groupby by to handle columns/index names
3 participants