BUG/API: allow TimeGrouper with other columns in a groupby (GH3794) #6516

jreback · 2014-03-01T22:37:45Z

In [3]: df
Out[3]: 
  Branch Buyer                Date  Quantity
0      A  Carl 2013-01-01 13:00:00         1
1      A  Mark 2013-01-01 13:05:00         3
2      A  Carl 2013-10-01 20:00:00         5
3      A  Carl 2013-10-02 10:00:00         1
4      A   Joe 2013-10-01 20:00:00         8
5      A   Joe 2013-10-02 10:00:00         1
6      A   Joe 2013-12-02 12:00:00         9
7      B  Carl 2013-12-02 14:00:00         3

[8 rows x 4 columns]

In [4]:    df.groupby([pd.Grouper(freq='1M',key='Date'),'Buyer']).sum()
Out[4]: 
                  Quantity
Date       Buyer          
2013-01-31 Carl          1
           Mark          3
2013-10-31 Carl          6
           Joe           9
2013-12-31 Carl          3
           Joe           9

[6 rows x 1 columns]

In [5]:    df = df.set_index('Date')

In [6]:    df['Date'] = df.index + pd.offsets.MonthEnd(2)

In [9]:    df.groupby([pd.Grouper(freq='6M',key='Date'),'Buyer']).sum()
Out[9]: 
                  Quantity
Date       Buyer          
2013-02-28 Carl          1
           Mark          3
2014-02-28 Carl          9
           Joe          18

[4 rows x 1 columns]

In [10]:    df.groupby([pd.Grouper(freq='6M',level='Date'),'Buyer']).sum()
Out[10]: 
                  Quantity
Date       Buyer          
2013-01-31 Carl          1
           Mark          3
2014-01-31 Carl          9
           Joe          18

[4 rows x 1 columns]

jreback · 2014-03-01T22:40:15Z

@cpcloud @hayd @jorisvandenbossche
cc @TomAugspurger

lmk what you think about the way to specify what the TimeGrouper applies (as maybe we should use this elsewhere), eg #5677

jreback · 2014-03-01T22:49:58Z

maybe what would resolve both of these is to allow something like a G (which is essentially a generalizeation of TimeGrouper)

where this example could be written as:
df.reset_index().groupby([pd.G('Date',freq='1M'),'Buyer']).sum()

so if you want to specify something:

class G(object):
    def __init__(self, name=None, level=None, freq=None):
               # allow name=None only of freq is not None or level is not None

df.groupby([G(level='foo'),G('bar'),'bah',G('Date',freq='6M')]).sum()

you would groupby: a level named foo, a column bar, a column bah, and a column named Date with 6M freq

so column_name is equiv to G(column_name)

I think this wouldn't break back compat
allows us to 'hide' TimeGrouper
makes ability to groupby with an arbitrary object a bit easier
this would not preclude the ENH/API: clarify groupby by to handle columns/index names #5677 modification (and specifying level may not even be
necessary if you search first the index then the columns)

TomAugspurger · 2014-03-02T14:57:35Z

This looks very useful. Just to make sure, you've only got the grouping by index level and column name working with the special case of a time grouper right?

I made some progress on #5677, but it's really messy. IIRC, the code for determining the grouping is really tightly tied to building the groups. Maybe I'll look at that again today and push something up. Then we can think about out to handle this also.

jreback · 2014-03-02T15:32:32Z

tom that's right

a construct like G solves the problem and/or can try to figure it out

but for TimeGrouper it's necessary (you can figure it out too but it's more ambiguous and should be explicitly specified)

jreback · 2014-03-11T19:04:26Z

@TomAugspurger @hayd anybody like/dislike the G syntax?

hayd · 2014-03-11T19:18:42Z

I really like the syntax. Only issue I see is if G is too short/not descriptive...

jreback · 2014-03-11T19:20:29Z

Grouper? Grouping (can alias G too)

hayd · 2014-03-11T20:14:05Z

annoyingly both these already exist in groupby

jreback · 2014-03-11T20:19:10Z

I know, but they are internal, so could rename

how about just pd.Group(...) (rememeber this is only needed in some limited circumstances)

TomAugspurger · 2014-03-12T12:46:49Z

I was thinking pd.Group().

jorisvandenbossche · 2014-03-12T13:37:14Z

The functionality here would really be a big improvement!

Isn't Grouper a little bit more correct in meaning than Group? As is the specifies the grouper, by which column/index level/... groups are made? If I am correctly understanding it, I would be in favor of Grouper.
For the docs, we could maybe recommend using from pandas import Group(er) as G?
The use of the name arg: any reason not to use column=? Or is there some similar functionalitiy that uses name?

I was wondering, are there other places users would at the moment use TimeGrouper? Or only in groupby?

jreback · 2014-03-12T13:51:17Z

This is going to become the base class of TimeGrouper; its not really any different, but would allow one to disambiguate things.

Grouper is correct, but exists as a 'private' class in groupby.py

jreback · 2014-03-13T12:19:54Z

updated....

@TomAugspurger @jorisvandenbossche @hayd @cpcloud

take a look when you have a chance (top section is updated)

This is actually not a very big change in implementation (e.g. determining how to group is still very complicate). But a bit cleaner I think. Could be cleaned up a bit more, e.g. a Categorical could be integrated into the Grouper hierarchy now (i'll open an issue for that),

TomAugspurger · 2014-03-13T16:42:09Z

Looks good.

When I groupby by just a pd.Grouper, I get a TypeError: axis must be a DatetimeIndex, but got an instance of 'Int64Index'.

In [19]: df
Out[19]: 
                 Date Branch Buyer  Quantity
0 2013-10-01 13:00:00      A  Carl         1
1 2013-10-01 13:05:00      A  Mark         3
2 2013-10-01 20:00:00      A  Carl         5
3 2013-10-02 10:00:00      A  Carl         1
4 2013-10-01 20:00:00      A   Joe         8
5 2013-10-02 10:00:00      A   Joe         1
6 2013-10-02 12:00:00      A   Joe         9
7 2013-10-02 14:00:00      B  Carl         3

[8 rows x 4 columns]

In [20]: df.groupby(pd.Grouper(freq='1M', key='Date'))

I think we want to support this case, right?
Passing the Grouper inside a list works correctly. Does it get sent down a Series path with just the one grouper?

jreback · 2014-03-13T16:54:46Z

that's a bug

hayd · 2014-03-13T17:09:30Z

I really like this change.

rename internally Grouper to BaseGrouper to avoid conflict TimeGrouper to now inherit from Grouper

jreback · 2014-03-13T22:22:04Z

well....turns out had to refactor a bit to get a nice inteface, whoosh. Upside is I actually understand grouping now. I put in lots of comments to explain, so should be easy to build on this (hopefully).

@TomAugspurger I think #5677 should be very straightforward on top of this and/or may not be necessary (as you can do it now, maybe just expand the docs section a tiny bit).

in a more elegant / cleaner way by keeping internal groupby state inside the Grouper rather than passing around lots of results DOC: minor doc edits for groupby.rst / v0.14.0 PEP8: minor pep changes

BUG/API: allow TimeGrouper with other columns in a groupby (GH3794)

TomAugspurger · 2014-03-16T15:07:01Z

pandas/core/groupby.py

+                        if obj.index.name != level:
+                            raise ValueError('level name %s is not the name of the '
+                                             'index' % level)
+                    elif level > 0:


@jreback is this elif supposed to be shifted left to be under the if at line 226?

jreback · 2014-03-16T15:15:51Z

@TomAugspurger if prob could but just catching an invalid level that is passed str a multi-index
the next case catches a invalid for an index

prob could be combined

jorisvandenbossche · 2014-03-17T12:41:40Z

doc/source/release.rst

@@ -125,6 +125,8 @@ API Changes
    ``DataFrame.stack`` operations where the name of the column index is used as
    the name of the inserted column containing the pivoted data.

+- Allow specification of a more complex groupby, via ``pd.Groupby`` (:issue:`3794`)


pdf.Groupby -> pd.Grouper?

jorisvandenbossche · 2014-03-17T12:45:15Z

Some after merge comments:

Since pd.Grouper is used now in the docs and so a public object, the docstring of it should also be included in the api docs I think (in groupby section)
However that will mean, as it is a class and not a function, that all methods will also be added to the docs, which is maybe not really desirable. Are there any methods of pd.Grouper that a user would ever have to use? If not, maybe make them private?
I think the docstring could also be a bit clearer (eg more explanation of what arguments key and level exactly mean)

jreback · 2014-03-17T12:53:30Z

i'll fix the docs....regarding your 2nd point.

none of the methods are meant to be public, but this is a 'private' class except for the actual constructor, so not sure what to do about that.

any suggestions?

jorisvandenbossche · 2014-03-17T12:56:31Z

just make all methods _...? So they don't appear in the docs/are user visible. Or is this tedious to always have to use _.. methods internally?

jreback · 2014-03-17T13:02:03Z

object?   -> Details about 'object', use 'object??' for extra details.
pd.Grouper
In [1]: pd.Grouper?
Type:       type
String Form:<class 'pandas.core.groupby.Grouper'>
File:       /mnt/home/jreback/pandas/pandas/core/groupby.py
Docstring:
A Grouper allows the user to specify a groupby instruction for a target object,
e.g. the DataFrame that is being grouped.

This specification will select a column via the key parameter, or if the level and/or
axis parameters are given, a level of the index of the target object.

These are local specifications and will override 'global' settings, that is the parameters
axis and level which are passed to the groupby itself.

Parameters
----------
key : groupby key, which selects the grouping column of the target,
      defaults to None
level : name/number of the level for the target index, defaults to None
freq : string / freqency object, defaults to None
       This will groupby the specified frequency if the target selection (via key or level) is
       a datetime-like object
axis : number/name of the axis, defaults to None
sort : boolean, whether to sort the resulting labels, defaults to False

Returns
-------
A specification for a groupby instruction

Examples
--------
df.groupby(Grouper(key='A')) : syntatic sugar for df.groupby('A')
df.groupby(Grouper(key='date',freq='60s')) : specify a resample on the column 'date'
df.groupby(Grouper(level='date',freq='60s',axis=1)) :
   specify a resample on the level 'date' on the columns axis with a frequency of 60s
Constructor information:
 Definition:pd.Grouper(self, key=None, level=None, freq=None, axis=None, sort=False)

jorisvandenbossche · 2014-03-17T13:06:05Z

Can you try to keep to the numpydoc format?

arg : type, default ... .
    Explanation.

The examples will also be rendered nicely (on the html pages) if you do eg:

Specify a resample on the column 'date':

>>> df.groupby(Grouper(key='date',freq='60s'))

jreback · 2014-03-17T13:06:11Z

don't want to make any methods internal. They are all 'internal' for the most part. I guess this could be changed but its a bit of a project in itself.

jreback · 2014-03-17T13:08:43Z

A Grouper allows the user to specify a groupby instruction for a target object,
e.g. the DataFrame that is being grouped.

This specification will select a column via the key parameter, or if the level and/or
axis parameters are given, a level of the index of the target object.

These are local specifications and will override 'global' settings, that is the parameters
axis and level which are passed to the groupby itself.

Parameters
----------
key : string, defaults to None
      groupby key, which selects the grouping column of the target
level : name/number, defaults to None
        the level for the target index
freq : string / freqency object, defaults to None
       This will groupby the specified frequency if the target selection (via key or level) is
       a datetime-like object
axis : number/name of the axis, defaults to None
sort : boolean, default to False
       whether to sort the resulting labels

Returns
-------
A specification for a groupby instruction

Examples
--------
>>> df.groupby(Grouper(key='A')) : syntatic sugar for df.groupby('A')
>>> df.groupby(Grouper(key='date',freq='60s')) : specify a resample on the column 'date'
>>> df.groupby(Grouper(level='date',freq='60s',axis=1)) :
    specify a resample on the level 'date' on the columns axis with a frequency of 60s

jorisvandenbossche · 2014-03-17T13:10:27Z

What do you mean with 'make' them internal? What I meant was just simply renaming them eg Grouper.get_grouper() to Grouper._get_grouper(). But I don't know how much they are used/to what extent they are new, so can't judge how much work this is / worth the trouble.

jorisvandenbossche · 2014-03-17T13:14:45Z

Another formatting comment: the explanation in

arg : type, default ... .
    Explanation.

does not have to be aligned with type, it is just intended 4 spaces from arg.

Something else, in A Grouper allows the user to specify a groupby instruction for a target object, e.g. the DataFrame that is being grouped., that last part eg the DataFrame ... is not really clear to me. Maybe just leave that out? (first part of the sentence is clear to me)

jreback · 2014-03-17T13:24:14Z

addressed in #6655

jorisvandenbossche · 2014-03-26T10:05:44Z

@jreback Something else, TimeGrouper has still some more functionality than the new Grouper I think (eg the closed='left' as in http://stackoverflow.com/questions/14569223/timegrouper-pandas). So Grouper does not fully replace TimeGrouper usage (for users)? So should we still add TimeGrouper to the docs? or add this to Grouper?

jreback · 2014-03-26T10:55:28Z

no

Grouper will create a TimeGrouper if passed a freq

and TimeGrouper is a subclass of Grouper

so any kw args are passed thru

their are some more args that are mainly used by resample but I am not sure that you care about for grouping

eg base,loffset don't matter when grouping

closed might though - if so I will add to Grouping docs

jreback · 2014-03-26T12:18:31Z

c59bf0b

jorisvandenbossche · 2014-03-26T15:33:35Z

OK, Thanks!

jreback added Enhancement labels Mar 1, 2014

jreback added this to the 0.14.0 milestone Mar 1, 2014

jreback mentioned this pull request Mar 13, 2014

ENH/CLN: add CategoricalGrouper #6626

Closed

jreback added 3 commits March 13, 2014 13:17

BUG/API: allow TimeGrouper with other columns in a groupby (GH3794)

a316f2f

CLN/API: replace groupby.CustomGrouper with Grouper

a7b19f9

rename internally Grouper to BaseGrouper to avoid conflict TimeGrouper to now inherit from Grouper

DOC: update groupby docs for using pd.Grouper

2f667db

CLN: refactor of groupby/resample to handle Grouper

5e965e9

in a more elegant / cleaner way by keeping internal groupby state inside the Grouper rather than passing around lots of results DOC: minor doc edits for groupby.rst / v0.14.0 PEP8: minor pep changes

jreback added a commit that referenced this pull request Mar 15, 2014

Merge pull request #6516 from jreback/time_grouper

361f703

BUG/API: allow TimeGrouper with other columns in a groupby (GH3794)

jreback merged commit 361f703 into pandas-dev:master Mar 15, 2014

jreback mentioned this pull request Mar 15, 2014

Make simpler API for using resampling infrastructure to do general gropuby #2450

Closed

TomAugspurger reviewed Mar 16, 2014
View reviewed changes

jorisvandenbossche reviewed Mar 17, 2014
View reviewed changes

jreback mentioned this pull request Mar 17, 2014

DOC/API: pd.Grouper docs / api #6655

Merged

jreback mentioned this pull request Apr 2, 2014

BUG: multiple grouping with a TimeGrouper requires sort #6764

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG/API: allow TimeGrouper with other columns in a groupby (GH3794) #6516

BUG/API: allow TimeGrouper with other columns in a groupby (GH3794) #6516

jreback commented Mar 1, 2014

jreback commented Mar 1, 2014

jreback commented Mar 1, 2014

TomAugspurger commented Mar 2, 2014

jreback commented Mar 2, 2014

jreback commented Mar 11, 2014

hayd commented Mar 11, 2014

jreback commented Mar 11, 2014

hayd commented Mar 11, 2014

jreback commented Mar 11, 2014

TomAugspurger commented Mar 12, 2014

jorisvandenbossche commented Mar 12, 2014

jreback commented Mar 12, 2014

jreback commented Mar 13, 2014

TomAugspurger commented Mar 13, 2014

jreback commented Mar 13, 2014

hayd commented Mar 13, 2014

jreback commented Mar 13, 2014

TomAugspurger Mar 16, 2014

jreback commented Mar 16, 2014

jorisvandenbossche Mar 17, 2014

jorisvandenbossche commented Mar 17, 2014

jreback commented Mar 17, 2014

jorisvandenbossche commented Mar 17, 2014

jreback commented Mar 17, 2014

jorisvandenbossche commented Mar 17, 2014

jreback commented Mar 17, 2014

jreback commented Mar 17, 2014

jorisvandenbossche commented Mar 17, 2014

jorisvandenbossche commented Mar 17, 2014

jreback commented Mar 17, 2014

jorisvandenbossche commented Mar 26, 2014

jreback commented Mar 26, 2014

jreback commented Mar 26, 2014

jorisvandenbossche commented Mar 26, 2014

BUG/API: allow TimeGrouper with other columns in a groupby (GH3794) #6516

BUG/API: allow TimeGrouper with other columns in a groupby (GH3794) #6516

Conversation

jreback commented Mar 1, 2014

jreback commented Mar 1, 2014

jreback commented Mar 1, 2014

TomAugspurger commented Mar 2, 2014

jreback commented Mar 2, 2014

jreback commented Mar 11, 2014

hayd commented Mar 11, 2014

jreback commented Mar 11, 2014

hayd commented Mar 11, 2014

jreback commented Mar 11, 2014

TomAugspurger commented Mar 12, 2014

jorisvandenbossche commented Mar 12, 2014

jreback commented Mar 12, 2014

jreback commented Mar 13, 2014

TomAugspurger commented Mar 13, 2014

jreback commented Mar 13, 2014

hayd commented Mar 13, 2014

jreback commented Mar 13, 2014

TomAugspurger Mar 16, 2014

Choose a reason for hiding this comment

jreback commented Mar 16, 2014

jorisvandenbossche Mar 17, 2014

Choose a reason for hiding this comment

jorisvandenbossche commented Mar 17, 2014

jreback commented Mar 17, 2014

jorisvandenbossche commented Mar 17, 2014

jreback commented Mar 17, 2014

jorisvandenbossche commented Mar 17, 2014

jreback commented Mar 17, 2014

jreback commented Mar 17, 2014

jorisvandenbossche commented Mar 17, 2014

jorisvandenbossche commented Mar 17, 2014

jreback commented Mar 17, 2014

jorisvandenbossche commented Mar 26, 2014

jreback commented Mar 26, 2014

jreback commented Mar 26, 2014

jorisvandenbossche commented Mar 26, 2014