-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: Expanded resample #13961
API: Expanded resample #13961
Conversation
@@ -4164,12 +4169,16 @@ def resample(self, rule, how=None, axis=0, fill_method=None, closed=None, | |||
""" | |||
from pandas.tseries.resample import (resample, | |||
_maybe_process_deprecations) | |||
if is_list_like(on): | |||
raise ValueError("Only a single column may be passed to on") | |||
if is_list_like(level): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move these inside resample
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually I think might be able to remove these entirely. When TimeGrouper._set_grouper
get's called, these are validated (same as in groupby). e.g. other fixes, #13907 should work for this as well.
would be ok with deprecating 'key' in favor of 'on' for Timegrouper as well |
df = pd.DataFrame({'date': pd.date_range('2015-01-01', freq='W', periods=5), | ||
'a': np.arange(5)}, | ||
index=pd.MultiIndex.from_arrays([ | ||
[1,2,3,4,5], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would add to the main docs a similar example
Current coverage is 85.27% (diff: 83.33%)@@ master #13961 diff @@
==========================================
Files 139 139
Lines 50164 50510 +346
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 42788 43071 +283
- Misses 7376 7439 +63
Partials 0 0
|
@@ -377,6 +377,20 @@ Other enhancements | |||
|
|||
pd.Timestamp(year=2012, month=1, day=1, hour=8, minute=30) | |||
|
|||
- the ``.resample()`` function now accepts a ``on=`` or ``key=`` parameter for resampling on a column or ``MultiIndex`` level (:issue:`13500`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
key -> level
API design question: when using |
@jorisvandenbossche I don't think we actually set the index for anything that accepts |
But this will have the index set in the results. I do think that's the right thing to do - consistent with how
|
I actually don't have a big problem with the |
@chris-b1 I don't think we set when |
I'm probably not being clear. By "setting" what I mean is that we return this:
Instead of this - i.e., the
|
@chris-b1 I understand. And I think the point IS to return Certainly there is a case for this. E.g. Time is the grouper here, so we should set the index. And maybe this is different than the merging case rationale. Counterpoint is that the point of Just want to clearly delineate cases where we should and should not set the resulting index. |
xref #5755 - whether this is exactly what should be done, the basic rules now seem to be:
|
Small updates pushed for the comments. I'd propose the that setting the |
@chris-b1 With the current PR, if you have a non-reducing method, you get the following:
Which is I think not what we want? (the 'date' column should still be there?) I think either we should follow the logic from groupby (reducing -> set as index (current PR for reducing methods), transforming -> keep as column) or either keep it as column in all cases (to distinguish it as use case from |
You must not be using the same frame as my example? I show the index always being preserved, which is expected. I could see an argument that
|
@chris-b1 ah yes sorry, I used the same example dataframe, but without the index. It's because the index is identical to the column in your example that the results 'looks' good I think (because the index including the same values as the dropped column is still there):
|
I actually still think the current approach may be right. Is it any different than
|
Hmm, OK, I was in the belief that that preserved the 'a' column, apparantly not ... Oh, what a jumble this grouping api :-) EDIT: after some thought, it is indeed useful |
@@ -1473,6 +1473,28 @@ Furthermore, you can also specify multiple aggregation functions for each column | |||
r.agg({'A' : ['sum','std'], 'B' : ['mean','std'] }) | |||
|
|||
|
|||
If a ``DataFrame`` does not have a ``DatetimeIndex``, but instead you want | |||
to resample based on column in the frame, it can passed to the ``on`` keyword. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make it clear that the on
(currently) still must be a datetimelike (so we of course accept PeriodIndex/TimedeltaIndex
here as well (add tests if we don't have them for those as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use datetimelike
rather than DatetimeIndex
I think this is a fine extension of the API. It is in-line with other methods of how it produces output. |
Ok, one (hopefully) last api question. Right now this blows up in various ways if an upsample is attempted, e.g. |
@jreback - expanded tests like you suggested, moving as much the api to the This also now closes #14008, it's kind of a hack, but not sure there's a much better way without rethinking the whole approach, xref #12884. I'm using the |
|
||
on : string, optional | ||
For a DataFrame, column to use instead of index for resampling. | ||
Column must be datetime-like. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add versionadded tags
pls remove the 'hack'. make this PR simpler. |
I'm very open to suggestions, but this is as clean of approach as I can think of. I can always back out changes and just leave #14008 open if the fix is worse than the bug. |
yes pls back out the period resampling needs to be fixed in s more core way |
If you'd like to have another look, I've removed the |
@@ -255,7 +255,8 @@ def _set_grouper(self, obj, sort=False): | |||
Parameters | |||
---------- | |||
obj : the subject object | |||
|
|||
sort : bool, default False | |||
whether the resulting grouper should be sorted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was missing I guess?
# upsampling and PeriodIndex resampling do not work | ||
# if resampling on a column or mi level | ||
# this state used to catch and raise an error | ||
self._from_selection = (self.groupby.key is not None or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant make this a property
@property
def _from_selection(self):
return self.groupby is not None and (self.group.key is not None or self.groupy.level is not None)
@jreback - updated for that last comment, let me know if you see anything else. |
thanks @chris-b1 |
git diff upstream/master | flake8 --diff
My only question here is if
on=
is the keyword name for picking a column - in aTimeGrouper
that's calledkey
. Buton
is what was used in the rolling selection (and elsewhere) so seems consistent.