-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(rust, python): groupby rolling with negative offset #9428
Merged
MarcoGorelli
merged 8 commits into
pola-rs:main
from
MarcoGorelli:fix-groupby-rolling-with-offset
Jun 20, 2023
Merged
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
a0ba32c
fix(rust, python) groupby rolling was producing wrong windows with no…
MarcoGorelli 9f119eb
add parametric test
MarcoGorelli a7e73c0
overload to fix typing
MarcoGorelli a3cf672
3.7 compat
MarcoGorelli 0236574
Merge remote-tracking branch 'upstream/main' into fix-groupby-rolling…
MarcoGorelli 59dd870
use .map
MarcoGorelli 471ceda
Merge remote-tracking branch 'upstream/main' into fix-groupby-rolling…
MarcoGorelli a0c0ad3
rename
MarcoGorelli File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,76 @@ | ||
from __future__ import annotations | ||
|
||
from datetime import timedelta | ||
from typing import TYPE_CHECKING | ||
|
||
import hypothesis.strategies as st | ||
from hypothesis import given, reject | ||
|
||
import polars as pl | ||
from polars.testing import assert_frame_equal | ||
from polars.testing.parametric.primitives import column, dataframes | ||
from polars.testing.parametric.strategies import strategy_closed, strategy_time_unit | ||
from polars.utils.convert import _timedelta_to_pl_duration | ||
|
||
if TYPE_CHECKING: | ||
from polars.type_aliases import ClosedInterval, TimeUnit | ||
|
||
|
||
@given( | ||
period=st.timedeltas(min_value=timedelta(microseconds=0)).map( | ||
_timedelta_to_pl_duration | ||
), | ||
offset=st.timedeltas().map(_timedelta_to_pl_duration), | ||
closed=strategy_closed, | ||
data=st.data(), | ||
time_unit=strategy_time_unit, | ||
) | ||
def test_groupby_rolling( | ||
period: str, | ||
offset: str, | ||
closed: ClosedInterval, | ||
data: st.DataObject, | ||
time_unit: TimeUnit, | ||
) -> None: | ||
dataframe = data.draw( | ||
dataframes( | ||
[ | ||
column("ts", dtype=pl.Datetime(time_unit)), | ||
column("value", dtype=pl.Int64), | ||
], | ||
) | ||
) | ||
df = dataframe.sort("ts").unique("ts") | ||
try: | ||
result = df.groupby_rolling( | ||
"ts", period=period, offset=offset, closed=closed | ||
).agg(pl.col("value")) | ||
except pl.exceptions.PolarsPanicError as exc: | ||
assert any( # noqa: PT017 | ||
msg in str(exc) | ||
for msg in ( | ||
"attempt to multiply with overflow", | ||
"attempt to add with overflow", | ||
) | ||
) | ||
reject() | ||
|
||
expected_dict: dict[str, list[object]] = {"ts": [], "value": []} | ||
for ts, _ in df.iter_rows(): | ||
window = df.filter( | ||
pl.col("ts").is_between( | ||
pl.lit(ts, dtype=pl.Datetime(time_unit)).dt.offset_by(offset), | ||
pl.lit(ts, dtype=pl.Datetime(time_unit)) | ||
.dt.offset_by(offset) | ||
.dt.offset_by(period), | ||
closed=closed, | ||
) | ||
) | ||
value = window["value"].to_list() | ||
expected_dict["ts"].append(ts) | ||
expected_dict["value"].append(value) | ||
expected = pl.DataFrame(expected_dict).select( | ||
pl.col("ts").cast(pl.Datetime(time_unit)), | ||
pl.col("value").cast(pl.List(pl.Int64)), | ||
) | ||
assert_frame_equal(result, expected) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, previously there were two paths:
offset
>=period
,offset
<period * 2
: groupby_values_iter_full_lookbehindoffset
>=period
,offset
>=period * 2
: groupby_values_iter_window_behind_toffset
<period
: groupby_values_iter_partial_lookbehindI don't get why there's the
< period * 2
check. Looks like it comes from https://github.com/pola-rs/polars/pull/4010/files, but I don't see whyAnyway,
groupby_values_iter_full_lookbehind
assumest
is at the end of the window (i.e.period == offset
), so changing the logic tooffset
==period
: groupby_values_iter_full_lookbehindoffset
>period
: groupby_values_iter_window_behind_t (slower, but this is quite unusual anyway?)offset
<period
: groupby_values_iter_partial_lookbehindfixes all the test cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I belive
groupby_values_iter_full_lookbehind
assumes thatt
is completely behind the window. So there are more cases where we have that besidesperiod == offset
.I will have to dive into it which cases it were again. Do you have on top of mind which predicate would inlcude all cases where
t
is full lookbehind?This is beneficial as in that case we can parallelize over
t
and then look from that point backwards in the slice to find the window.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If
period == offset
andclosed =='right'
, thent
is indeed included in the window (it's the right endpoint). For example the window could be(2020-01-01, 2020-01-02]
andt
could be2020-01-02
.From testing, that function only works if
offset== period
. There's an explicit check for whenclosed=='right'
, i.e. when it's not a full lookbehind:polars/polars/polars-time/src/windows/groupby.rs
Lines 275 to 277 in 12c4d9a
For
offset > period
, then it's incorrect for any value ofclosed
: #9250It may be possible to change it so it handles the case when
offset > period
. But for now, I'm suggesting to:t
is the right endpoint then ifclosed='right'
then it's not a full lookbehindoffset == period
(so at least the results are correct)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, let's first make it correct. We can try to find fast paths later if needed. 👍
Thanks!