Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Window closed-ness defaults are inconsistent for temporal rolling functions #9193

Closed
2 tasks done
MarcoGorelli opened this issue Jun 2, 2023 · 22 comments
Closed
2 tasks done
Labels
A-temporal Area: date/time functionality bug Something isn't working python Related to Python Polars

Comments

@MarcoGorelli
Copy link
Collaborator

Polars version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Issue description

Noticed this while trying to write some docs in #9192

Seems like the defaults are:

  • closed on the right: groupby_rolling

  • closed on the left: rolling_mean, rolling_sum, rolling_min, rolling_max, rolling_std, rolling_var

Is there a reason which I'm missing for this? If not, any objections on consistently setting the default to be closed='right'?

Reproducible example

In [140]: df = pl.DataFrame({'ts': pl.date_range(dt.datetime(2020, 1, 1), dt.datetime(2020, 1, 4), eager=True), 'a': [1,2,3,4]})

In [141]: df
Out[141]: 
shape: (4, 2)
┌─────────────────────┬─────┐
│ tsa   │
│ ------ │
│ datetime[μs]        ┆ i64 │
╞═════════════════════╪═════╡
│ 2020-01-01 00:00:001   │
│ 2020-01-02 00:00:002   │
│ 2020-01-03 00:00:003   │
│ 2020-01-04 00:00:004   │
└─────────────────────┴─────┘

# closed on the right (includes current row)
In [144]: df.groupby_rolling('ts', period='3d').agg(pl.col('a'))
Out[144]: 
shape: (4, 2)
┌─────────────────────┬───────────┐
│ tsa         │
│ ------       │
│ datetime[μs]        ┆ list[i64] │
╞═════════════════════╪═══════════╡
│ 2020-01-01 00:00:00 ┆ [1]       │
│ 2020-01-02 00:00:00 ┆ [1, 2]    │
│ 2020-01-03 00:00:00 ┆ [1, 2, 3] │
│ 2020-01-04 00:00:00 ┆ [2, 3, 4] │
└─────────────────────┴───────────┘

# This one is closed on the left (only includes elements strictly before the given row)
In [146]: df.select(pl.col('a').rolling_sum('3d', by='ts'))
Out[146]: 
shape: (4, 1)
┌──────┐
│ a    │
│ ---  │
│ i64  │
╞══════╡
│ null │
│ 1    │
│ 3    │
│ 6    │
└──────┘

Expected behavior

I think,

In [146]: df.select(pl.col('a').rolling_sum('3d', by='ts'))
Out[146]: 
shape: (4, 1)
┌──────┐
│ a    │
│ ---  │
│ i64  │
╞══════╡
│ 1    │
│ 3    │
│ 6    │
│ 9    │
└──────┘

Installed versions

--------Version info---------
Polars:      0.18.0
Index type:  UInt32
Platform:    Linux-5.4.0-149-generic-x86_64-with-glibc2.31
Python:      3.10.11 (main, Apr  5 2023, 14:15:10) [GCC 9.4.0]

----Optional dependencies----
numpy:       1.24.3
pandas:      2.0.2
pyarrow:     12.0.0
connectorx:  <not installed>
deltalake:   <not installed>
fsspec:      <not installed>
matplotlib:  <not installed>
xlsx2csv:    <not installed>
xlsxwriter:  <not installed>


@MarcoGorelli MarcoGorelli added bug Something isn't working python Related to Python Polars labels Jun 2, 2023
@stinodego
Copy link
Contributor

stinodego commented Jun 3, 2023

I think @ghuls or @zundertj explained this to me once, but I forgot the reason. Something to do with how people from the finance world expect their data operations to work.

@zundertj
Copy link
Collaborator

zundertj commented Jun 3, 2023

In finance you would typically want closed="right" to be the default, i.e. the rolling mean/sum/etc includes the data point on the row where it is being stored. The reason is that often:

  1. data is sorted time ascending order (like in the example dataframe)
  2. you want to only use information up to and including that point in time

Pandas.rolling has closed="right" as the default.

I wasn't aware there are inconsistencies in the defaults, we should definitely fix that. Happy to help.

@stinodego : I have tried to search for an issue on this, as I remember discussing this, but can't find it. Is your point that closed="right" is a sensible default across the API, or that the inconsistency in defaults actually makes sense?

@stinodego
Copy link
Contributor

If I look at these examples, closed=right indeed makes for the most sensible default. I'd support making that consistent across methods. Unless there is something I am missing here...

@zundertj
Copy link
Collaborator

zundertj commented Jun 4, 2023

Grepping for our type annotation ClosedInterval, this is what we currently do:

Rolling:

  • DataFrame.groupby_rolling: right
  • DataFrame.groupby_dynamic: left
  • Expr.rolling_x : min/max/mean/sum/var/std/median/quantile: all set to left
  • LazyFrame.groupby_rolling: right
  • LazyFrame.groupby_dynamic: left

I can create a PR for where we currently default to left, to default to None in function signature, set to left internally but raise a warning that we are changing the default in the future. Making a hard change is not nice here. Ok?

Ranges:

  • Series.is_between: both
  • pl.date_range: both
  • pl.time_range: both
  • Expr.is_between: both

For ranges, both is intuitive imo.

@MarcoGorelli
Copy link
Collaborator Author

Great, thanks!

Groupby_dynamic is ok to be closed on the left IMO as it does something different, where left is more natural (e.g. resampling by day, you expect 2020-01-01T00:00 and 2020-01-01T00:01 to be part of the same group by default)

But the rolling functions all do the same kind of thing, so I'd make those consistent

So, I'd suggest only changing the default for the expr rolling functions

@zundertj
Copy link
Collaborator

zundertj commented Jun 4, 2023

The behaviour and docstring of rolling_sum is confusing me:

If you pass a ``by`` column ``<t_0, t_1, ..., t_2>``, then by default the
        windows will be:

            - [t_0 - window_size, t_0)
            - [t_1 - window_size, t_1)
            - ...
            - [t_n - window_size, t_n)

        Otherwise, the window at a given row will include the row itself, and the
        `window_size - 1` elements before it.

this basically says: we use "left" when using a by-column, and "right" if no by column.

The odd thing is,closed=left or right does not seem to make a difference at all, to me this is "right" behaviour only:

>>> pl.Series([1,2,3,4]).to_frame().select(pl.all().rolling_sum(2, closed="left"))
shape: (4, 1)
┌──────┐
│      │
│ ---  │
│ i64  │
╞══════╡
│ null │
│ 3    │
│ 5    │
│ 7    │
└──────┘
>>> pl.Series([1,2,3,4]).to_frame().select(pl.all().rolling_sum(2, closed="right"))
shape: (4, 1)
┌──────┐
│      │
│ ---  │
│ i64  │
╞══════╡
│ null │
│ 3    │
│ 5    │
│ 7    │
└──────┘

Same for both and none values, and for other functions. Am I missing something here, this should be different, per the docstring?

@zundertj
Copy link
Collaborator

zundertj commented Jun 4, 2023

Work in progress here: #9215. Not touching the groupby_dynamic.

@MarcoGorelli
Copy link
Collaborator Author

Perhaps the docstring could be clarified, but closed only takes effect if you have a by column

@zundertj
Copy link
Collaborator

zundertj commented Jun 4, 2023

Ok, so that is to accommodate similar behaviour as groupby_dynamic?

Groupby_dynamic is ok to be closed on the left IMO as it does something different, where left is more natural (e.g. resampling by day, you expect 2020-01-01T00:00 and 2020-01-01T00:01 to be part of the same group by default)

So do we want to change anything at all then, apart from docstrings? Maybe we should even raise an error if by is not used but the user does set closed)?

@MarcoGorelli
Copy link
Collaborator Author

Work in progress here: #9215.

Nice! ⭐

I don't know what the original reasoning was, but I think it makes sense - if you say window_size=3 and closed='none', say, then what does that mean? It doesn't seem so well-defined to me. So yeah, raising if the user explicitly passes a closed value and no by might be nice

@zundertj
Copy link
Collaborator

zundertj commented Jun 4, 2023

I have updated #9215 to purely update the docstrings. Reason being that when by is not set, I would call the behaviour most similar to closed="right". But if by is set, we want closed="left" as the default. If we set the defaults in the function signature, which I strongly prefer, we cannot support both. So right now it may be slightly confusing still that closed="left" is the default, but at least the docstring clarifies how the window is constructed.

@MarcoGorelli
Copy link
Collaborator Author

MarcoGorelli commented Jun 4, 2023

Sure - is the idea to then follow up with a PR to warn that the future default for closed will change from 'left' to 'right', for consistency with groupby_rolling?

@zundertj
Copy link
Collaborator

zundertj commented Jun 4, 2023

Ok, so that is to accommodate similar behaviour as groupby_dynamic?

Groupby_dynamic is ok to be closed on the left IMO as it does something different, where left is more natural (e.g. resampling by day, you expect 2020-01-01T00:00 and 2020-01-01T00:01 to be part of the same group by default)

So do we want to change anything at all then, apart from docstrings? Maybe we should even raise an error if by is not used but the user does set closed)?

Per ^, I don't know if we want to change at all? As you said groupby_dynamic is more natural with closed="left", doesnt the same argument apply for the rolling + by functions?

@MarcoGorelli
Copy link
Collaborator Author

MarcoGorelli commented Jun 4, 2023

I'd say that for groupby_dynamic it's natural to have closed='left', and for groupby_rolling it's natural to have closed='right'.

These are both already the case, all that I'm suggesting is that rolling_mean (and other rolling_ convenience functions) follow the default of groupby_rolling

It'd be natural to expect

  • df.groupby_rolling('ts', period='3d').agg(pl.col('a').mean()), and
  • df.with_columns(pl.col('a').rolling_mean('3d', by='ts'))
    to be the same.

Currently however, they're not - unless, that is, you change closed in one of them, e.g.:

In [128]: df.groupby_rolling('ts', period='3d', closed='left').agg(pl.col('a').mean())
Out[128]: 
shape: (5, 2)
┌─────────────────────┬──────┐
│ tsa    │
│ ------  │
│ datetime[μs]        ┆ f64  │
╞═════════════════════╪══════╡
│ 2020-01-01 00:00:00null │
│ 2020-01-02 00:00:001.0  │
│ 2020-01-03 00:00:001.5  │
│ 2020-01-04 00:00:002.0  │
│ 2020-01-05 00:00:003.0  │
└─────────────────────┴──────┘

In [129]: df.with_columns(pl.col('a').rolling_mean('3d', by='ts'))
Out[129]: 
shape: (5, 2)
┌─────────────────────┬──────┐
│ tsa    │
│ ------  │
│ datetime[μs]        ┆ f64  │
╞═════════════════════╪══════╡
│ 2020-01-01 00:00:00null │
│ 2020-01-02 00:00:001.0  │
│ 2020-01-03 00:00:001.5  │
│ 2020-01-04 00:00:002.0  │
│ 2020-01-05 00:00:003.0  │
└─────────────────────┴──────┘

@ritchie46
Copy link
Member

I am not an expert on the ergonomics of time series, but I just want to say we shouldn't use the same closedness for the sake of consistency between different functions per se. Especially groupby_dynamic is someting different than all rolling_x functions.

I believe, I chose the closedness at the time by looking what other tools did or what I found made sense.

Let's use this opportunity to determine what makes sense for them in the following groups:

  • dynamic
  • rolling
  • expanding (still have to add those)

and then document that, so we later understand our own rationale and have a place where we can follow the logic.

@MarcoGorelli
Copy link
Collaborator Author

Yeah agree - here's what I'd suggest:

  • groupby_dynamic: typical use case is "find the average rainfall per day", so the groups would be something like [2020-01-01, 2020-01-02), [2020-01-02, 2020-01-03), .... So, closed='left' as default seems natural
  • groupby_rolling: typical use case is something like "for each point, find the average value for the last day". You'd include the point itself, and the window goes backwards (as you can't take values from the future). So closed='right' seems natural
  • rolling_*: these are convenience functions for groupby_rolling operations, so it would be natural to follow what groupby_rolling does
  • expanding: this would take all the points up until the current one, so I don't think there would need to be a closed argument at all

If this sounds sensible, then I'd back the original approach by @zundertj in #9215

@zundertj
Copy link
Collaborator

Yep, that is updating the docstrings in #9215 indeed.

@MarcoGorelli
Copy link
Collaborator Author

Thanks @zundertj ! Do you want to pick up the original approach from #9215 again, i.e. warning about changing the default closed from 'left' to 'right'? Or I can do it, no worries, just checking we don't duplicate work

@zundertj
Copy link
Collaborator

I will pick that up for the rolling_<x> functions, leaving groupby_dynamic as "left" by default.

@MarcoGorelli
Copy link
Collaborator Author

legend 🙌

@zundertj
Copy link
Collaborator

See #9470

@MarcoGorelli MarcoGorelli added the A-temporal Area: date/time functionality label Jul 14, 2023
@MarcoGorelli
Copy link
Collaborator Author

This has been addressed, it just needs enforcing - but that'll happen as part of the usual version upgrade procedure

closing then, thanks all

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-temporal Area: date/time functionality bug Something isn't working python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

4 participants