ENH: add 'origin' and 'offset' arguments to 'resample' and 'pd.Grouper' #31809

hasB4K · 2020-02-08T19:11:11Z

EDIT: this PR has changed, now instead of adding adjust_timestamp we are adding origin and offset arguments to resample and pd.Grouper (see #31809 (comment))

Hello,

This enhancement is an alternative to the base argument present in pd.Grouper or in the method resample. It adds the adjust_timestamp argument to change the current behavior of: https://github.com/pandas-dev/pandas/blob/master/pandas/core/resample.py#L1728

adjust_timestamp is the timestamp on which to adjust the grouping. If None is passed, the first day of the time series at midnight is used.

Currently the bins of the grouping are adjusted based on the beginning of the day of the time series starting point. This works well with frequencies that are multiples of a day (like 30D) or that divides a day (like 90s or 1min). But it can create inconsistencies with some frequencies that do not meet this criteria.

Here is a simple snippet from a test that I added that proves that the current behavior can lead to some inconsistencies. Inconsistencies that can be fixed if we use adjust_timestamp:

import pandas as pd
import numpy as np
import pandas._testing as tm
import pytest


freq = "1399min"  # prime number that is smaller than 24h
start, end = "1/1/2000 00:00:00", "1/31/2000 00:00"
middle = "1/15/2000 00:00:00"

rng = pd.date_range(start, end, freq="1231min")  # prime number
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts2 = ts[middle:end]

# proves that grouper without a fixed adjust_timestamp does not work
# when dealing with unusual frequencies
simple_grouper = pd.Grouper(freq=freq)
count_ts = ts.groupby(simple_grouper).agg("count")
count_ts = count_ts[middle:end]
count_ts2 = ts2.groupby(simple_grouper).agg("count")
with pytest.raises(AssertionError):
    tm.assert_index_equal(count_ts.index, count_ts2.index)

# test adjusted_timestamp on 1970-01-01 00:00:00
adjust_timestamp = pd.Timestamp(0)
adjusted_grouper = pd.Grouper(freq=freq, adjust_timestamp=adjust_timestamp)
adjusted_count_ts = ts.groupby(adjusted_grouper).agg("count")
adjusted_count_ts = adjusted_count_ts[middle:end]
adjusted_count_ts2 = ts2.groupby(adjusted_grouper).agg("count")
tm.assert_series_equal(adjusted_count_ts, adjusted_count_ts2)

I think this PR is ready to be merged, but I am of course open to any suggestions or criticism. 😉
For instance, I am not sure if the naming of adjust_timestamp is correct. An alternative could be base_timestamp or ref_timestamp 🤔?

Cheers,

closes Extending the grouper base argument #25226
closes groupby(pd.Grouper) ignores loffset #28302
closes resample becomes non-deterministic, depending on DateTimeIndex values #28675
closes BUG: resample closed='left' not binning correctly. #4197
closes ENH: resample(..., base='start') for automaticly determining base. #8521
Add 'origin' and 'offset' arguments to 'resample' and 'pd.Grouper'
tests added / passed
Add deprecation warning for loffset and base in the code
Add deprecation warning for loffset and base in the doc
Add examples in the doc for origin and offset
whatsnew entry (add deprecation notice with offset example)
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff

pep8speaks · 2020-02-08T19:11:18Z

Hello @hasB4K! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-05-09 21:51:23 UTC

mroeschke · 2020-02-08T23:35:07Z

Instead of adding a new keyword, might be nice if base could take a Timestamp instead since they are both relevant when a frequency is passed.

hasB4K · 2020-02-09T01:18:33Z

I would rather suggest the following:

I always thought that the base argument has kind of an ambiguous name. And the current behavior is quite confusing. It needs to be an integer (or a floating point) that matches the unit of the frequency:

So base=1 with a frequency of 5D is equal to 1D that we add as an offset.
So base=2 with a frequency of 5min is equal to 2min that we add as an offset.

This behavior is very confusing for the users (myself included), but it also creates bugs: see #25161, #25226

Instead of relying on base I would rather deprecate this argument. The argument loffset (currently broken for pd.Grouper as shown in #28302, but fixable in the current PR) is kind of equivalent to what base is doing (especially since it is a Timedelta).

Example of the current use of loffset with resample:

>>> start, end = "1/1/2000 00:00:00", "1/31/2000 00:00"
>>> rng = pd.date_range(start, end, freq="1231min")
>>> ts = pd.Series(np.arange(len(rng)), index=rng)
>>> ts.resample("1min", loffset=-pd.Timedelta("1min")).count()

1999-12-31 23:59:00    1
2000-01-01 00:00:00    0
2000-01-01 00:01:00    0
2000-01-01 00:02:00    0
2000-01-01 00:03:00    0
                      ..
2000-01-30 22:00:00    0
2000-01-30 22:01:00    0
2000-01-30 22:02:00    0
2000-01-30 22:03:00    0
2000-01-30 22:04:00    1
Freq: T, Length: 43086, dtype: int64

Example of the current broken loffset argument:

>>> ts.groupby(pd.Grouper(freq="1min", loffset=-pd.Timedelta("1min"))).count()

2000-01-01 00:00:00    1
2000-01-01 00:01:00    0
2000-01-01 00:02:00    0
2000-01-01 00:03:00    0
2000-01-01 00:04:00    0
                      ..
2000-01-30 22:01:00    0
2000-01-30 22:02:00    0
2000-01-30 22:03:00    0
2000-01-30 22:04:00    0
2000-01-30 22:05:00    1
Freq: T, Length: 43086, dtype: int64

That being said, I agree that the naming of adjust_timestamp is not ideal. I would rename it into: origin or base_timestamp.

The line https://github.com/pandas-dev/pandas/blob/master/pandas/core/resample.py#L1728 would be replaced by something roughly equivalent to:

origin = start_of_day if origin is None else start_of_day
origin = origin.value + loffset.value

TL;DR:

I would fix in this PR loffset for pd.Grouper and deprecate the confusing base argument
I would rename the added argument adjust_timestamp into origin
We would have origin = origin.value + loffset.value
I would add more tests to check the behavior of loffset and origin

What do you think?

hasB4K · 2020-02-09T10:33:23Z

I just realised that loffset and base are not equivalent at all since this works:

>>> start, end = "1/1/2000 00:00:00", "1/31/2000 00:00"
>>> rng = pd.date_range(start, end, freq="1231min")
>>> ts = pd.Series(np.arange(len(rng)), index=rng)
>>> ts.resample("1min", loffset=-pd.Timedelta("365D")).count()

1999-01-01 00:00:00    1
1999-01-01 00:01:00    0
1999-01-01 00:02:00    0
1999-01-01 00:03:00    0
1999-01-01 00:04:00    0
                      ..
1999-01-30 22:01:00    0
1999-01-30 22:02:00    0
1999-01-30 22:03:00    0
1999-01-30 22:04:00    0
1999-01-30 22:05:00    1
Freq: T, Length: 43086, dtype: int64

So I would suggest the following instead:

Deprecate the confusing base argument
Add an origin_offset argument that would be a Timestamp
Rename the added argument adjust_timestamp into origin_timestamp
We would have origin_nanos = origin_timestamp.value + origin_offset.value

I will not fix loffset in this PR since I am not sure of the behavior with pd.Grouper and how to fix it.

What do you think ?

jreback

I think base and loffset actually are pretty useful. However for non-evenly divisible freq the issue is that you likely simply want to use the first (or maybe the last) timestamp as the base. So how about we just add that ability in base to accept the string first or last rather than adding another keyword?

hasB4K · 2020-02-09T19:16:14Z

@jreback this won't fix the issue that I'm trying to tackle. The idea is to be able to have a fixed timestamp as a "origin" that does not depend of the time series. So neither the base argument with first (which is the current behavior) or last string will fix the issue.

I could use the base argument and use it as the "origin" argument that I want to add if baseis not a number like suggested @mroeschke. But I think this could create some confusion in the API (I still believe that base is useful but can be quite confusing to use).

jreback · 2020-02-09T19:22:33Z

@jreback this won't fix the issue that I'm trying to tackle. The idea is to be able to have a fixed timestamp as a "origin" that does not depend of the time series. So neither the base argument with first (which is the current behavior) or last string will fix the issue.

I could use the base argument and use it as the "origin" argument that I want to add if baseis not a number like suggested @mroeschke. But I think this could create some confusion in the API (I still believe that base is useful but can be quite confusing to use).

@hasB4K not averse with changing things. But we currently have base, loffset, so I don' really like the idea of another another pretty opaque options.

I would be onboard with deprecating both of these and replacing with 2 options, e.g. origin and offset come to mind.

hasB4K · 2020-02-09T19:43:28Z

I would be onboard with deprecating both of these and replacing with 2 options, e.g. origin and offset come to mind.

So would this signature be ok with you @jreback?

pd.Grouper(
    # ... arguments untouched with this enhancement like `freq`, `sort`, ...

    # ADDED arguments
    origin: pd.Timestamp, default None
        Only when freq is passed.
        The timestamp used as reference for the grouping bins.
        If None is passed, the first day of the time series at midnight is used.
    offset: pd.Timedelta, default None
        Only when freq is passed.
        An offset timedelta added to the origin.

    # DEPRECATED arguments
    base: int, default 0 (DEPRECATED)
    loffset: str, DateOffset, timedelta object (DEPRECATED)
)

jreback · 2020-02-09T20:44:06Z

@hasB4K

sure that looks reasonable.

we would need to have a pretty nice deprecation message that shows one how to convert base and/or loffset to the new args (as well as a whatsnew and warning box in the docs); they can bascially be the same though. its how we want folks to migrate.

hasB4K · 2020-02-09T21:32:33Z

sure that looks reasonable.

Perfect, I will implement that in this PR then 🙂

we would need to have a pretty nice deprecation message that shows one how to convert base and/or loffset to the new args (as well as a whatsnew and warning box in the docs); they can bascially be the same though. its how we want folks to migrate.

Yep, it seems quite necessary! Is there an example of a nice deprecation message in the current (or in the old) code that I could look into?
For now, I was thinking of adding to the documentation of resample and pd.Grouper examples of "how to migrate". And in the code something like this argument is deprecated, please see: <url>.

jreback · 2020-02-09T22:25:56Z

Yep, it seems quite necessary! Is there an example of a nice deprecation message in the current (or in the old) code that I could look into?
For now, I was thinking of adding to the documentation of resample and pd.Grouper examples of "how to migrate". And in the code something like this argument is deprecated, please see: <url>.

there are some (recently removed in 1.0.0) deprecation messages in resample on how to handle the freq arg. myabe not great but ok :->

hasB4K · 2020-02-14T16:07:06Z

@jreback I still need to add more examples for 'origin' and 'offset' and update the "what's new" part of the doc, but otherwise, it's ready for review 🙂

Co-Authored-By: William Ayd <william.ayd@icloud.com>

…das._typing

…mple

…with_day_freq_on_dst

hasB4K · 2020-05-09T23:46:34Z

@jreback Thank you for the merge of #33498! I rebased the current PR with master, let me know if you need anything else 🙂

jreback · 2020-05-10T15:54:06Z

very nice @hasB4K this was quite some PR!

please have a read thru the built docs (https://dev.pandas.io/), will take a little bfeore they are there.

and if needed issue a followup to clarify.

and keep em coming!

hasB4K · 2020-05-10T16:45:31Z

Thank you @jreback! 😃

The inputs and guidance from @mroeschke, @WillAyd and you was really interesting and challenging in the good way! I am really glad of the current state of this new functionality. Thank you all! 🎉

The "base" kwarg is no longer valid for resample in pandas. See pandas-dev/pandas#31809

hasB4K force-pushed the grouper-adjust-timestamp branch 6 times, most recently from 8c49bb6 to 0cc6149 Compare February 8, 2020 22:26

jreback requested changes Feb 9, 2020

View reviewed changes

jreback added API - Consistency Internal Consistency of API/Behavior Resample resample method labels Feb 9, 2020

hasB4K changed the title ~~ENH: add 'adjust_timestamp' argument to 'resample' and 'pd.Grouper'~~ ENH: add 'origin and 'offset' arguments to 'resample' and 'pd.Grouper' Feb 10, 2020

hasB4K force-pushed the grouper-adjust-timestamp branch 2 times, most recently from 2d2beba to 3097767 Compare February 13, 2020 23:10

hasB4K changed the title ~~ENH: add 'origin and 'offset' arguments to 'resample' and 'pd.Grouper'~~ ENH: add 'origin' and 'offset' arguments to 'resample' and 'pd.Grouper' Feb 13, 2020

hasB4K force-pushed the grouper-adjust-timestamp branch 6 times, most recently from 3cbeac4 to bbcbf7c Compare February 14, 2020 15:02

hasB4K and others added 14 commits May 9, 2020 22:37

CLN: fix lint issue with isort

7d4de49

Update pandas/core/generic.py

b83c5bf

Co-Authored-By: William Ayd <william.ayd@icloud.com>

CLN: add TimestampCompatibleTypes and TimedeltaCompatibleTypes in pan…

3e24d53

…das._typing

CLN: fix lint issue with isort

c2ee661

ENH: support 'epoch', 'start_day' and 'start' for origin

a6e94c0

DOC: add doc for origin that uses 'epoch', 'start' or 'start_day'

53802e5

TST: add test for origin that uses 'epoch', 'start' or 'start_day'

3fc2bf6

BUG: fix a timezone bug between origin and index on df.resample

4ad979a

DOC: change doc after review

343a30a

CLN: change typing for TimestampConvertibleTypes

efb572e

CLN: add nice message for ValueError of 'origin' and 'offset' in resa…

fcdde91

…mple

BUG: fix a bug when resampling in DST context

1fec946

TST: fix deprecation test

5695ffb

TST: using pytz instead of datetutil in test of test_resample_origin_…

de6b477

…with_day_freq_on_dst

hasB4K force-pushed the grouper-adjust-timestamp branch from 16a6831 to de6b477 Compare May 9, 2020 20:37

CLN: remove unused import

05ddd9b

jorisvandenbossche mentioned this pull request May 10, 2020

DEPR: log of deprecations in 1.x (to be removed in 2.0) #30228

Closed

jreback approved these changes May 10, 2020

View reviewed changes

jreback merged commit 4a267c6 into pandas-dev:master May 10, 2020

hasB4K deleted the grouper-adjust-timestamp branch May 10, 2020 16:39

mroeschke mentioned this pull request May 11, 2020

Snap to convention in resampling #2058

Closed

hasB4K mentioned this pull request May 30, 2020

BUG: fix origin epoch when freq is Day and harmonize epoch between timezones #34474

Merged

4 tasks

dsandeep0138 mentioned this pull request Jun 17, 2020

BUG: resample seems to convert hours to 00:00 #34833

Closed

3 tasks

leohazy mentioned this pull request Mar 11, 2021

ENH:Resume the 'loffset' arguments in Grouper and resample #40367

Closed

vamsi-verma-s mentioned this pull request Oct 18, 2022

DEP: remove deprecated loffset and base args for resample and Grouper #49101

Merged

3 tasks

rhshadrach mentioned this pull request Jan 13, 2024

BUG: Grouper: Origin param has no effect #47653

Open

3 tasks

mps01060 added a commit to mps01060/intense-qc that referenced this pull request Feb 5, 2024

Fix pandas resample API change for base kwarg

82aafad

The "base" kwarg is no longer valid for resample in pandas. See pandas-dev/pandas#31809

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: add 'origin' and 'offset' arguments to 'resample' and 'pd.Grouper' #31809

ENH: add 'origin' and 'offset' arguments to 'resample' and 'pd.Grouper' #31809

hasB4K commented Feb 8, 2020 •

edited

Loading

pep8speaks commented Feb 8, 2020 •

edited

Loading

mroeschke commented Feb 8, 2020

hasB4K commented Feb 9, 2020 •

edited

Loading

hasB4K commented Feb 9, 2020 •

edited

Loading

jreback left a comment

hasB4K commented Feb 9, 2020

jreback commented Feb 9, 2020

hasB4K commented Feb 9, 2020

jreback commented Feb 9, 2020

hasB4K commented Feb 9, 2020

jreback commented Feb 9, 2020

hasB4K commented Feb 14, 2020

hasB4K commented May 9, 2020

jreback commented May 10, 2020

hasB4K commented May 10, 2020

ENH: add 'origin' and 'offset' arguments to 'resample' and 'pd.Grouper' #31809

ENH: add 'origin' and 'offset' arguments to 'resample' and 'pd.Grouper' #31809

Conversation

hasB4K commented Feb 8, 2020 • edited Loading

pep8speaks commented Feb 8, 2020 • edited Loading

Comment last updated at 2020-05-09 21:51:23 UTC

mroeschke commented Feb 8, 2020

hasB4K commented Feb 9, 2020 • edited Loading

hasB4K commented Feb 9, 2020 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

hasB4K commented Feb 9, 2020

jreback commented Feb 9, 2020

hasB4K commented Feb 9, 2020

jreback commented Feb 9, 2020

hasB4K commented Feb 9, 2020

jreback commented Feb 9, 2020

hasB4K commented Feb 14, 2020

hasB4K commented May 9, 2020

jreback commented May 10, 2020

hasB4K commented May 10, 2020

hasB4K commented Feb 8, 2020 •

edited

Loading

pep8speaks commented Feb 8, 2020 •

edited

Loading

hasB4K commented Feb 9, 2020 •

edited

Loading

hasB4K commented Feb 9, 2020 •

edited

Loading