API: Expanded resample #13961

chris-b1 · 2016-08-11T00:52:53Z

closes API: Expanded resample #13500
tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

My only question here is if on= is the keyword name for picking a column - in a TimeGrouper that's called key. But on is what was used in the rolling selection (and elsewhere) so seems consistent.

In [63]: df = pd.DataFrame({'date': pd.date_range('2015-01-01', freq='W', periods=5),
   ....:                    'a': np.arange(5)},
   ....:                   index=pd.MultiIndex.from_arrays([
   ....:                            [1,2,3,4,5],
   ....:                            pd.date_range('2015-01-01', freq='W', periods=5)],
   ....:                        names=['v','d']))
   ....: 

In [64]: df
Out[64]: 
              a       date
v d                       
1 2015-01-04  0 2015-01-04
2 2015-01-11  1 2015-01-11
3 2015-01-18  2 2015-01-18
4 2015-01-25  3 2015-01-25
5 2015-02-01  4 2015-02-01

In [65]: df.resample('M', on='date').sum()
Out[65]: 
            a
date         
2015-01-31  6
2015-02-28  4

In [66]: df.resample('M', level='d').sum()
Out[66]: 
            a
d            
2015-01-31  6
2015-02-28  4

jreback · 2016-08-11T01:03:29Z

pandas/core/generic.py

@@ -4164,12 +4169,16 @@ def resample(self, rule, how=None, axis=0, fill_method=None, closed=None,
        """
        from pandas.tseries.resample import (resample,
                                             _maybe_process_deprecations)
+        if is_list_like(on):
+            raise ValueError("Only a single column may be passed to on")
+        if is_list_like(level):


I would move these inside resample

actually I think might be able to remove these entirely. When TimeGrouper._set_grouper get's called, these are validated (same as in groupby). e.g. other fixes, #13907 should work for this as well.

jreback · 2016-08-11T01:04:25Z

would be ok with deprecating 'key' in favor of 'on' for Timegrouper as well

jreback · 2016-08-11T01:20:29Z

doc/source/whatsnew/v0.19.0.txt

+     df = pd.DataFrame({'date': pd.date_range('2015-01-01', freq='W', periods=5),
+                        'a': np.arange(5)},
+                       index=pd.MultiIndex.from_arrays([
+                                [1,2,3,4,5],


would add to the main docs a similar example

codecov-io · 2016-08-11T07:02:16Z

Current coverage is 85.27% (diff: 83.33%)

Merging #13961 into master will decrease coverage by 0.02%

@@             master     #13961   diff @@
==========================================
  Files           139        139          
  Lines         50164      50510   +346   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          42788      43071   +283   
- Misses         7376       7439    +63   
  Partials          0          0

Powered by Codecov. Last update 257ac88...10c7280

jorisvandenbossche · 2016-08-11T07:42:26Z

doc/source/whatsnew/v0.19.0.txt

@@ -377,6 +377,20 @@ Other enhancements

    pd.Timestamp(year=2012, month=1, day=1, hour=8, minute=30)

+- the ``.resample()`` function now accepts a ``on=`` or ``key=`` parameter for resampling on a column or ``MultiIndex`` level (:issue:`13500`)


key -> level

jorisvandenbossche · 2016-08-11T07:56:25Z

API design question: when using on, should that columns be set as the index, or left as a column? (for rolling, we kept it as a column, but of course, there the original index is left intact, which is not the case for resample)

jreback · 2016-08-11T10:19:27Z

@jorisvandenbossche I don't think we actually set the index for anything that accepts key/on (.groupby / .merge) ATM, nor do I think we should. That's exactly the point, the index setting needs to be explicit.

chris-b1 · 2016-08-11T10:24:49Z

But this will have the index set in the results. I do think that's the right thing to do - consistent with how groupby works elsewhere It'd be nice if this was an option, but doesn't seem implemented:

In [11]: df.groupby(pd.TimeGrouper(key='date', freq='M'), as_index=False).sum()
---------------------------------------------------------------------------

AttributeError: 'BinGrouper' object has no attribute 'compressed'

chris-b1 · 2016-08-11T10:28:42Z

I actually don't have a big problem with the key / on inconsistency. They do sort of represent different things - key is part of an abstract mapping, where on is a selection out of a concrete object?

jreback · 2016-08-11T10:44:59Z

@chris-b1 I don't think we set when on is indicated (in merging), and doesnt' exist in .groupby / .resample ATM (except via key). So can you show an example.

chris-b1 · 2016-08-11T10:49:07Z

I'm probably not being clear. By "setting" what I mean is that we return this:

In [19]: df.resample('M', on='date').sum()
Out[19]: 
            a
date         
2015-01-31  6
2015-02-28  4

Instead of this - i.e., the on column becomes an index.

In [20]: df.resample('M', on='date').sum().reset_index()
Out[20]: 
        date  a
0 2015-01-31  6
1 2015-02-28  4

jreback · 2016-08-11T10:53:54Z

@chris-b1 I understand. And I think the point IS to return [20]. we had this same discussion in #13358.

Certainly there is a case for this. E.g. Time is the grouper here, so we should set the index. And maybe this is different than the merging case rationale. Counterpoint is that the point of on is to indicate a grouper that we don't want to set in the first place (eg. this defeats the purpose of the .set_index(on_column).resample(...).reset_index(on_column) idiom.

Just want to clearly delineate cases where we should and should not set the resulting index.
I think having a table (in the dev docs?) where the philosophy / rationale would be good. (or can just lay it out here is fine too).

chris-b1 · 2016-08-11T11:34:02Z

xref #5755 - whether this is exactly what should be done, the basic rules now seem to be:

Transformations (sort, rolling, groupby.transform) - never set index
Reductions (df.sum, df.groupby.sum) - set index to reducing grouping. In some places (groupby is all I can think of) you have an option to make the grouping column rather than an index
Relational Ops
- index based (concat, join, align) - set index to resulting set
- column based (merge_*) - discard existing index

chris-b1 · 2016-08-11T23:40:40Z

Small updates pushed for the comments. I'd propose the that setting the on as an index (current impl) would be the most consistent API design, but certainly can change that if you feel otherwise.

jorisvandenbossche · 2016-08-15T18:36:46Z

@chris-b1 With the current PR, if you have a non-reducing method, you get the following:

In [4]: df.resample('M', on='date').transform(lambda x: x)
Out[4]: 
   a
0  0
1  1
2  2
3  3
4  4

Which is I think not what we want? (the 'date' column should still be there?)

I think either we should follow the logic from groupby (reducing -> set as index (current PR for reducing methods), transforming -> keep as column) or either keep it as column in all cases (to distinguish it as use case from set_index('date').resample(..)

chris-b1 · 2016-08-15T18:55:17Z

You must not be using the same frame as my example? I show the index always being preserved, which is expected. I could see an argument that 'date' should be preserved as a column in [3],

In [1]: pd.__version__
Out[1]: '0.18.1+345.gc4db0e7'

In [2]: df
Out[2]: 
              a       date
v d                       
1 2015-01-04  0 2015-01-04
2 2015-01-11  1 2015-01-11
3 2015-01-18  2 2015-01-18
4 2015-01-25  3 2015-01-25
5 2015-02-01  4 2015-02-01

In [3]: df.resample('M', on='date').transform(lambda x: x)
Out[3]: 
              a
v d            
1 2015-01-04  0
2 2015-01-11  1
3 2015-01-18  2
4 2015-01-25  3
5 2015-02-01  4

In [4]: df.resample('M', level='d').transform(lambda x: x)
Out[4]: 
              a       date
v d                       
1 2015-01-04  0 2015-01-04
2 2015-01-11  1 2015-01-11
3 2015-01-18  2 2015-01-18
4 2015-01-25  3 2015-01-25
5 2015-02-01  4 2015-02-01

jorisvandenbossche · 2016-08-15T18:57:16Z

@chris-b1 ah yes sorry, I used the same example dataframe, but without the index. It's because the index is identical to the column in your example that the results 'looks' good I think (because the index including the same values as the dropped column is still there):

In [63]: df2 = pd.DataFrame({'date': pd.date_range('2015-01-01', freq='W', periods=5),
   ....:                     'a': np.arange(5)})

In [64]: df2.resample('M', on='date').transform(lambda x: x)
Out[64]: 
   a
0  0
1  1
2  2
3  3
4  4

chris-b1 · 2016-08-15T19:01:54Z

I actually still think the current approach may be right. Is it any different than 'a' being excluded from the results here?

In [11]: df = pd.DataFrame({'a': [1, 1, 2, 2], 'b': [1, 2, 3, 4]})

In [12]: df.groupby('a').transform(lambda x: x)
Out[12]: 
   b
0  1
1  2
2  3
3  4

jorisvandenbossche · 2016-08-15T19:08:33Z

Hmm, OK, I was in the belief that that preserved the 'a' column, apparantly not ... Oh, what a jumble this grouping api :-)
Other 'transforming' methods keep the grouping column (eg ffill or head)

EDIT: after some thought, it is indeed useful transform does not preserve the grouping column, this way assigning the result to a new column of the original dataframe is much easier.

jreback · 2016-08-15T22:31:39Z

doc/source/timeseries.rst

@@ -1473,6 +1473,28 @@ Furthermore, you can also specify multiple aggregation functions for each column
   r.agg({'A' : ['sum','std'], 'B' : ['mean','std'] })


+If a ``DataFrame`` does not have a ``DatetimeIndex``, but instead you want
+to resample based on column in the frame, it can passed to the ``on`` keyword.


make it clear that the on (currently) still must be a datetimelike (so we of course accept PeriodIndex/TimedeltaIndex here as well (add tests if we don't have them for those as well)

use datetimelike rather than DatetimeIndex

jreback · 2016-08-15T22:44:26Z

I think this is a fine extension of the API. It is in-line with other methods of how it produces output.

chris-b1 · 2016-08-16T16:59:56Z

Ok, one (hopefully) last api question.

Right now this blows up in various ways if an upsample is attempted, e.g. df.resample('D', level='d').ffill(). Maybe that's a valid use case, but it's a lot less compelling to me, and doesn't map to groupby semantics, so what I'm proposing is to raise NotImplementedError for now,

chris-b1 · 2016-08-18T00:13:48Z

@jreback - expanded tests like you suggested, moving as much the api to the Base class as possible, so all three index types are hit.

This also now closes #14008, it's kind of a hack, but not sure there's a much better way without rethinking the whole approach, xref #12884.

I'm using the NotImplementedError approach for up-sampling suggested above.

jreback · 2016-08-18T22:50:00Z

pandas/core/generic.py

-
+        on : string, optional
+            For a DataFrame, column to use instead of index for resampling.
+            Column must be datetime-like.


add versionadded tags

jreback · 2016-08-18T22:51:54Z

pls remove the 'hack'. make this PR simpler.

chris-b1 · 2016-08-18T23:21:49Z

I'm very open to suggestions, but this is as clean of approach as I can think of. I can always back out changes and just leave #14008 open if the fix is worse than the bug.

jreback · 2016-08-18T23:24:29Z

yes pls back out

the period resampling needs to be fixed in s more core way

chris-b1 · 2016-08-20T15:12:16Z

If you'd like to have another look, I've removed the PeriodIndex changes. I did leave in the from_selection state (may be a better name for this? though internal only), but it's only used to raise sensible errors for not-implemented things.

jreback · 2016-08-26T20:29:49Z

pandas/core/groupby.py

@@ -255,7 +255,8 @@ def _set_grouper(self, obj, sort=False):
        Parameters
        ----------
        obj : the subject object
-
+        sort : bool, default False
+            whether the resulting grouper should be sorted


this was missing I guess?

jreback · 2016-08-27T17:13:30Z

pandas/tseries/resample.py

+            # upsampling and PeriodIndex resampling do not work
+            # if resampling on a column or mi level
+            # this state used to catch and raise an error
+            self._from_selection = (self.groupby.key is not None or


I meant make this a property

@property def _from_selection(self): return self.groupby is not None and (self.group.key is not None or self.groupy.level is not None)

chris-b1 · 2016-08-30T01:28:58Z

@jreback - updated for that last comment, let me know if you see anything else.

jreback · 2016-08-31T13:14:29Z

thanks @chris-b1

API: Expanded resample

def74de

jreback reviewed Aug 11, 2016
View reviewed changes

jorisvandenbossche reviewed Aug 11, 2016
View reviewed changes

jorisvandenbossche added Enhancement Resample resample method labels Aug 11, 2016

jorisvandenbossche added this to the 0.19.0 milestone Aug 11, 2016

move error handling; doc fixups

c4db0e7

jreback reviewed Aug 15, 2016
View reviewed changes

wip

b55309a

chris-b1 mentioned this pull request Aug 16, 2016

BUG: PeriodIndexGrouper fails with Grouper selection #14008

Open

more wip

7f9add4

add from_selection bookkeeping

5fd97d9

cleanup debugging

c7b299e

jreback reviewed Aug 18, 2016
View reviewed changes

chris-b1 added 2 commits August 20, 2016 08:41

remove PeriodIndex workaround

384026b

doc updates

e203fcf

jreback reviewed Aug 26, 2016
View reviewed changes

NotImp -> ValueError

10c7280

jreback reviewed Aug 27, 2016
View reviewed changes

make _from_selection a property

b8dd114

jreback closed this in 8654a9e Aug 31, 2016

jreback mentioned this pull request Sep 20, 2016

[0.19dev] Time-series aware rolling window with MultiIndex with a time-frequency index fails #14259

Closed

chris-b1 deleted the resample-api branch September 24, 2016 14:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: Expanded resample #13961

API: Expanded resample #13961

chris-b1 commented Aug 11, 2016 •

edited

Loading

jreback Aug 11, 2016

jreback Aug 11, 2016

jreback commented Aug 11, 2016

jreback Aug 11, 2016

codecov-io commented Aug 11, 2016 •

edited

Loading

jorisvandenbossche Aug 11, 2016

jorisvandenbossche commented Aug 11, 2016

jreback commented Aug 11, 2016

chris-b1 commented Aug 11, 2016

chris-b1 commented Aug 11, 2016 •

edited

Loading

jreback commented Aug 11, 2016

chris-b1 commented Aug 11, 2016

jreback commented Aug 11, 2016

chris-b1 commented Aug 11, 2016

chris-b1 commented Aug 11, 2016

jorisvandenbossche commented Aug 15, 2016

chris-b1 commented Aug 15, 2016

jorisvandenbossche commented Aug 15, 2016 •

edited

Loading

chris-b1 commented Aug 15, 2016

jorisvandenbossche commented Aug 15, 2016 •

edited

Loading

jreback Aug 15, 2016

jreback Aug 18, 2016

jreback commented Aug 15, 2016

chris-b1 commented Aug 16, 2016

chris-b1 commented Aug 18, 2016

jreback Aug 18, 2016

jreback commented Aug 18, 2016

chris-b1 commented Aug 18, 2016

jreback commented Aug 18, 2016

chris-b1 commented Aug 20, 2016

jreback Aug 26, 2016

jreback Aug 27, 2016

chris-b1 commented Aug 30, 2016

jreback commented Aug 31, 2016

		@@ -377,6 +377,20 @@ Other enhancements

		pd.Timestamp(year=2012, month=1, day=1, hour=8, minute=30)

		- the ``.resample()`` function now accepts a ``on=`` or ``key=`` parameter for resampling on a column or ``MultiIndex`` level (:issue:`13500`)

API: Expanded resample #13961

API: Expanded resample #13961

Conversation

chris-b1 commented Aug 11, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Aug 11, 2016

Choose a reason for hiding this comment

codecov-io commented Aug 11, 2016 • edited Loading

Current coverage is 85.27% (diff: 83.33%)

Choose a reason for hiding this comment

jorisvandenbossche commented Aug 11, 2016

jreback commented Aug 11, 2016

chris-b1 commented Aug 11, 2016

chris-b1 commented Aug 11, 2016 • edited Loading

jreback commented Aug 11, 2016

chris-b1 commented Aug 11, 2016

jreback commented Aug 11, 2016

chris-b1 commented Aug 11, 2016

chris-b1 commented Aug 11, 2016

jorisvandenbossche commented Aug 15, 2016

chris-b1 commented Aug 15, 2016

jorisvandenbossche commented Aug 15, 2016 • edited Loading

chris-b1 commented Aug 15, 2016

jorisvandenbossche commented Aug 15, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Aug 15, 2016

chris-b1 commented Aug 16, 2016

chris-b1 commented Aug 18, 2016

Choose a reason for hiding this comment

jreback commented Aug 18, 2016

chris-b1 commented Aug 18, 2016

jreback commented Aug 18, 2016

chris-b1 commented Aug 20, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chris-b1 commented Aug 30, 2016

jreback commented Aug 31, 2016

chris-b1 commented Aug 11, 2016 •

edited

Loading

codecov-io commented Aug 11, 2016 •

edited

Loading

chris-b1 commented Aug 11, 2016 •

edited

Loading

jorisvandenbossche commented Aug 15, 2016 •

edited

Loading

jorisvandenbossche commented Aug 15, 2016 •

edited

Loading