Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG/API: allow TimeGrouper with other columns in a groupby (GH3794) #6516

Merged
merged 4 commits into from
Mar 15, 2014

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Mar 1, 2014

closes #3794

In [3]: df
Out[3]: 
  Branch Buyer                Date  Quantity
0      A  Carl 2013-01-01 13:00:00         1
1      A  Mark 2013-01-01 13:05:00         3
2      A  Carl 2013-10-01 20:00:00         5
3      A  Carl 2013-10-02 10:00:00         1
4      A   Joe 2013-10-01 20:00:00         8
5      A   Joe 2013-10-02 10:00:00         1
6      A   Joe 2013-12-02 12:00:00         9
7      B  Carl 2013-12-02 14:00:00         3

[8 rows x 4 columns]

In [4]:    df.groupby([pd.Grouper(freq='1M',key='Date'),'Buyer']).sum()
Out[4]: 
                  Quantity
Date       Buyer          
2013-01-31 Carl          1
           Mark          3
2013-10-31 Carl          6
           Joe           9
2013-12-31 Carl          3
           Joe           9

[6 rows x 1 columns]

In [5]:    df = df.set_index('Date')

In [6]:    df['Date'] = df.index + pd.offsets.MonthEnd(2)

In [9]:    df.groupby([pd.Grouper(freq='6M',key='Date'),'Buyer']).sum()
Out[9]: 
                  Quantity
Date       Buyer          
2013-02-28 Carl          1
           Mark          3
2014-02-28 Carl          9
           Joe          18

[4 rows x 1 columns]

In [10]:    df.groupby([pd.Grouper(freq='6M',level='Date'),'Buyer']).sum()
Out[10]: 
                  Quantity
Date       Buyer          
2013-01-31 Carl          1
           Mark          3
2014-01-31 Carl          9
           Joe          18

[4 rows x 1 columns]

@jreback jreback added this to the 0.14.0 milestone Mar 1, 2014
@jreback
Copy link
Contributor Author

jreback commented Mar 1, 2014

@cpcloud @hayd @jorisvandenbossche
cc @TomAugspurger

lmk what you think about the way to specify what the TimeGrouper applies (as maybe we should use this elsewhere), eg #5677

@jreback
Copy link
Contributor Author

jreback commented Mar 1, 2014

maybe what would resolve both of these is to allow something like a G (which is essentially a generalizeation of TimeGrouper)

where this example could be written as:
df.reset_index().groupby([pd.G('Date',freq='1M'),'Buyer']).sum()

so if you want to specify something:

class G(object):
    def __init__(self, name=None, level=None, freq=None):
               # allow name=None only of freq is not None or level is not None          
df.groupby([G(level='foo'),G('bar'),'bah',G('Date',freq='6M')]).sum()

you would groupby: a level named foo, a column bar, a column bah, and a column named Date with 6M freq

so column_name is equiv to G(column_name)

  • I think this wouldn't break back compat
  • allows us to 'hide' TimeGrouper
  • makes ability to groupby with an arbitrary object a bit easier
  • this would not preclude the ENH/API: clarify groupby by to handle columns/index names #5677 modification (and specifying level may not even be
    necessary if you search first the index then the columns)

@TomAugspurger
Copy link
Contributor

This looks very useful. Just to make sure, you've only got the grouping by index level and column name working with the special case of a time grouper right?

I made some progress on #5677, but it's really messy. IIRC, the code for determining the grouping is really tightly tied to building the groups. Maybe I'll look at that again today and push something up. Then we can think about out to handle this also.

@jreback
Copy link
Contributor Author

jreback commented Mar 2, 2014

tom that's right

a construct like G solves the problem and/or can try to figure it out

but for TimeGrouper it's necessary (you can figure it out too but it's more ambiguous and should be explicitly specified)

@jreback
Copy link
Contributor Author

jreback commented Mar 11, 2014

@TomAugspurger @hayd anybody like/dislike the G syntax?

@hayd
Copy link
Contributor

hayd commented Mar 11, 2014

I really like the syntax. Only issue I see is if G is too short/not descriptive...

@jreback
Copy link
Contributor Author

jreback commented Mar 11, 2014

Grouper? Grouping (can alias G too)

@hayd
Copy link
Contributor

hayd commented Mar 11, 2014

annoyingly both these already exist in groupby

@jreback
Copy link
Contributor Author

jreback commented Mar 11, 2014

I know, but they are internal, so could rename

how about just pd.Group(...) (rememeber this is only needed in some limited circumstances)

@TomAugspurger
Copy link
Contributor

I was thinking pd.Group().

@jorisvandenbossche
Copy link
Member

The functionality here would really be a big improvement!

  • Isn't Grouper a little bit more correct in meaning than Group? As is the specifies the grouper, by which column/index level/... groups are made? If I am correctly understanding it, I would be in favor of Grouper.
  • For the docs, we could maybe recommend using from pandas import Group(er) as G?
  • The use of the name arg: any reason not to use column=? Or is there some similar functionalitiy that uses name?

I was wondering, are there other places users would at the moment use TimeGrouper? Or only in groupby?

@jreback
Copy link
Contributor Author

jreback commented Mar 12, 2014

This is going to become the base class of TimeGrouper; its not really any different, but would allow one to disambiguate things.

Grouper is correct, but exists as a 'private' class in groupby.py

@jreback
Copy link
Contributor Author

jreback commented Mar 13, 2014

updated....

@TomAugspurger @jorisvandenbossche @hayd @cpcloud

take a look when you have a chance (top section is updated)

This is actually not a very big change in implementation (e.g. determining how to group is still very complicate). But a bit cleaner I think. Could be cleaned up a bit more, e.g. a Categorical could be integrated into the Grouper hierarchy now (i'll open an issue for that),

@TomAugspurger
Copy link
Contributor

Looks good.

When I groupby by just a pd.Grouper, I get a TypeError: axis must be a DatetimeIndex, but got an instance of 'Int64Index'.

In [19]: df
Out[19]: 
                 Date Branch Buyer  Quantity
0 2013-10-01 13:00:00      A  Carl         1
1 2013-10-01 13:05:00      A  Mark         3
2 2013-10-01 20:00:00      A  Carl         5
3 2013-10-02 10:00:00      A  Carl         1
4 2013-10-01 20:00:00      A   Joe         8
5 2013-10-02 10:00:00      A   Joe         1
6 2013-10-02 12:00:00      A   Joe         9
7 2013-10-02 14:00:00      B  Carl         3

[8 rows x 4 columns]

In [20]: df.groupby(pd.Grouper(freq='1M', key='Date'))

I think we want to support this case, right?
Passing the Grouper inside a list works correctly. Does it get sent down a Series path with just the one grouper?

@jreback
Copy link
Contributor Author

jreback commented Mar 13, 2014

that's a bug

@hayd
Copy link
Contributor

hayd commented Mar 13, 2014

I really like this change.

@jreback
Copy link
Contributor Author

jreback commented Mar 13, 2014

well....turns out had to refactor a bit to get a nice inteface, whoosh. Upside is I actually understand grouping now. I put in lots of comments to explain, so should be easy to build on this (hopefully).

@TomAugspurger I think #5677 should be very straightforward on top of this and/or may not be necessary (as you can do it now, maybe just expand the docs section a tiny bit).

     in a more elegant / cleaner way by keeping internal
     groupby state inside the Grouper rather than passing
     around lots of results

DOC: minor doc edits for groupby.rst / v0.14.0
PEP8: minor pep changes
jreback added a commit that referenced this pull request Mar 15, 2014
BUG/API: allow TimeGrouper with other columns in a groupby (GH3794)
@jreback jreback merged commit 361f703 into pandas-dev:master Mar 15, 2014
if obj.index.name != level:
raise ValueError('level name %s is not the name of the '
'index' % level)
elif level > 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback is this elif supposed to be shifted left to be under the if at line 226?

@jreback
Copy link
Contributor Author

jreback commented Mar 16, 2014

@TomAugspurger if prob could but just catching an invalid level that is passed str a multi-index
the next case catches a invalid for an index

prob could be combined

@@ -125,6 +125,8 @@ API Changes
``DataFrame.stack`` operations where the name of the column index is used as
the name of the inserted column containing the pivoted data.

- Allow specification of a more complex groupby, via ``pd.Groupby`` (:issue:`3794`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pdf.Groupby -> pd.Grouper?

@jorisvandenbossche
Copy link
Member

Some after merge comments:

  • Since pd.Grouper is used now in the docs and so a public object, the docstring of it should also be included in the api docs I think (in groupby section)
  • However that will mean, as it is a class and not a function, that all methods will also be added to the docs, which is maybe not really desirable. Are there any methods of pd.Grouper that a user would ever have to use? If not, maybe make them private?
  • I think the docstring could also be a bit clearer (eg more explanation of what arguments key and level exactly mean)

@jreback
Copy link
Contributor Author

jreback commented Mar 17, 2014

i'll fix the docs....regarding your 2nd point.

none of the methods are meant to be public, but this is a 'private' class except for the actual constructor, so not sure what to do about that.

any suggestions?

@jorisvandenbossche
Copy link
Member

just make all methods _...? So they don't appear in the docs/are user visible. Or is this tedious to always have to use _.. methods internally?

@jreback
Copy link
Contributor Author

jreback commented Mar 17, 2014

object?   -> Details about 'object', use 'object??' for extra details.
pd.Grouper
In [1]: pd.Grouper?
Type:       type
String Form:<class 'pandas.core.groupby.Grouper'>
File:       /mnt/home/jreback/pandas/pandas/core/groupby.py
Docstring:
A Grouper allows the user to specify a groupby instruction for a target object,
e.g. the DataFrame that is being grouped.

This specification will select a column via the key parameter, or if the level and/or
axis parameters are given, a level of the index of the target object.

These are local specifications and will override 'global' settings, that is the parameters
axis and level which are passed to the groupby itself.

Parameters
----------
key : groupby key, which selects the grouping column of the target,
      defaults to None
level : name/number of the level for the target index, defaults to None
freq : string / freqency object, defaults to None
       This will groupby the specified frequency if the target selection (via key or level) is
       a datetime-like object
axis : number/name of the axis, defaults to None
sort : boolean, whether to sort the resulting labels, defaults to False

Returns
-------
A specification for a groupby instruction

Examples
--------
df.groupby(Grouper(key='A')) : syntatic sugar for df.groupby('A')
df.groupby(Grouper(key='date',freq='60s')) : specify a resample on the column 'date'
df.groupby(Grouper(level='date',freq='60s',axis=1)) :
   specify a resample on the level 'date' on the columns axis with a frequency of 60s
Constructor information:
 Definition:pd.Grouper(self, key=None, level=None, freq=None, axis=None, sort=False)

@jorisvandenbossche
Copy link
Member

Can you try to keep to the numpydoc format?

arg : type, default ... .
    Explanation.

The examples will also be rendered nicely (on the html pages) if you do eg:

Specify a resample on the column 'date':

>>> df.groupby(Grouper(key='date',freq='60s'))

@jreback
Copy link
Contributor Author

jreback commented Mar 17, 2014

don't want to make any methods internal. They are all 'internal' for the most part. I guess this could be changed but its a bit of a project in itself.

@jreback
Copy link
Contributor Author

jreback commented Mar 17, 2014

A Grouper allows the user to specify a groupby instruction for a target object,
e.g. the DataFrame that is being grouped.

This specification will select a column via the key parameter, or if the level and/or
axis parameters are given, a level of the index of the target object.

These are local specifications and will override 'global' settings, that is the parameters
axis and level which are passed to the groupby itself.

Parameters
----------
key : string, defaults to None
      groupby key, which selects the grouping column of the target
level : name/number, defaults to None
        the level for the target index
freq : string / freqency object, defaults to None
       This will groupby the specified frequency if the target selection (via key or level) is
       a datetime-like object
axis : number/name of the axis, defaults to None
sort : boolean, default to False
       whether to sort the resulting labels

Returns
-------
A specification for a groupby instruction

Examples
--------
>>> df.groupby(Grouper(key='A')) : syntatic sugar for df.groupby('A')
>>> df.groupby(Grouper(key='date',freq='60s')) : specify a resample on the column 'date'
>>> df.groupby(Grouper(level='date',freq='60s',axis=1)) :
    specify a resample on the level 'date' on the columns axis with a frequency of 60s

@jorisvandenbossche
Copy link
Member

What do you mean with 'make' them internal? What I meant was just simply renaming them eg Grouper.get_grouper() to Grouper._get_grouper(). But I don't know how much they are used/to what extent they are new, so can't judge how much work this is / worth the trouble.

@jorisvandenbossche
Copy link
Member

Another formatting comment: the explanation in

arg : type, default ... .
    Explanation.

does not have to be aligned with type, it is just intended 4 spaces from arg.

Something else, in A Grouper allows the user to specify a groupby instruction for a target object, e.g. the DataFrame that is being grouped., that last part eg the DataFrame ... is not really clear to me. Maybe just leave that out? (first part of the sentence is clear to me)

@jreback
Copy link
Contributor Author

jreback commented Mar 17, 2014

addressed in #6655

@jorisvandenbossche
Copy link
Member

@jreback Something else, TimeGrouper has still some more functionality than the new Grouper I think (eg the closed='left' as in http://stackoverflow.com/questions/14569223/timegrouper-pandas). So Grouper does not fully replace TimeGrouper usage (for users)? So should we still add TimeGrouper to the docs? or add this to Grouper?

@jreback
Copy link
Contributor Author

jreback commented Mar 26, 2014

no

Grouper will create a TimeGrouper if passed a freq

and TimeGrouper is a subclass of Grouper

so any kw args are passed thru

their are some more args that are mainly used by resample but I am not sure that you care about for grouping

eg base,loffset don't matter when grouping

closed might though - if so I will add to Grouping docs

@jreback
Copy link
Contributor Author

jreback commented Mar 26, 2014

c59bf0b

@jorisvandenbossche
Copy link
Member

OK, Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG/ENH: groupby with a list of customgroup and string should work
4 participants