Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-determinism in rolling code (unsure of exact cause) #14047

Closed
2 tasks done
kszlim opened this issue Jan 28, 2024 · 1 comment · Fixed by #14070
Closed
2 tasks done

Non-determinism in rolling code (unsure of exact cause) #14047

kszlim opened this issue Jan 28, 2024 · 1 comment · Fixed by #14070
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@kszlim
Copy link
Contributor

kszlim commented Jan 28, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
from polars.testing.asserts import assert_frame_equal
from datetime import timedelta, datetime

ldf = pl.LazyFrame(
    {
        "timestamp": pl.datetime_range(datetime(2024, 1, 12), datetime(2024, 1, 12, 0, 0, 0, 150_000), '10ms', eager=True, closed='left'),
        "price": list(range(15))
    }
)

first_result = None

for i in range(100):
    print(f"Iter: {i}")

    def count_diff(price: pl.Expr, upper_bound: float = 0.1, lower_bound: float = 0.001):
        span_end_to_curr = price.count().rolling("timestamp", period=timedelta(seconds=lower_bound))
        span_start_to_curr = price.count().rolling("timestamp", period=timedelta(seconds=upper_bound))
        return (span_start_to_curr - span_end_to_curr).alias(f"count_diff_{upper_bound}_{lower_bound}")
    
    def s_per_count(count_diff: pl.Expr, span: (int, int)) -> pl.Expr:
        return (span[1]*1000 - span[0]*1000) / count_diff

    spans = [(0.001, 0.1), (1, 10)]
    count_diff_exprs = [count_diff(pl.col("price"), span[0], span[1]) for span in spans]
    s_per_count_exprs = [s_per_count(count_diff, span).alias(f"zz_{span}") for count_diff, span in zip(count_diff_exprs, spans)]
    exprs = count_diff_exprs + s_per_count_exprs
    df = ldf.with_columns(*exprs).collect()
    
    if first_result is not None:
        assert_frame_equal(df, first_result)
    else:
        first_result = df

Log output

Iter: 0
Iter: 1
Iter: 2
Iter: 3
Iter: 4
Iter: 5
Iter: 6
Iter: 7
Traceback (most recent call last):
  File "REDACTED/default/lib/python3.11/site-packages/polars/testing/asserts/frame.py", line 114, in assert_frame_equal
    _assert_series_values_equal(
  File "REDACTED/default/lib/python3.11/site-packages/polars/testing/asserts/series.py", line 180, in _assert_series_values_equal
    raise_assertion_error(
  File "REDACTED/default/lib/python3.11/site-packages/polars/testing/asserts/utils.py", line 12, in raise_assertion_error
    raise AssertionError(msg) from cause
AssertionError: Series are different (exact value mismatch)
[left]:  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[right]: [0, 4294967295, 4294967294, 4294967293, 4294967292, 4294967291, 4294967290, 4294967289, 4294967288, 4294967287, 4294967287, 4294967287, 4294967287, 4294967287, 4294967287]

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "determinism.py", line 32, in <module>
    assert_frame_equal(df, first_result)
  File "REDACTED/default/lib/python3.11/site-packages/polars/testing/asserts/frame.py", line 123, in assert_frame_equal
    raise_assertion_error(
  File "REDACTED/default/lib/python3.11/site-packages/polars/testing/asserts/utils.py", line 12, in raise_assertion_error
    raise AssertionError(msg) from cause
AssertionError: DataFrames are different (value mismatch for column 'count_diff_0.001_0.1')
[left]:  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[right]: [0, 4294967295, 4294967294, 4294967293, 4294967292, 4294967291, 4294967290, 4294967289, 4294967288, 4294967287, 4294967287, 4294967287, 4294967287, 4294967287, 4294967287]

Issue description

You can see when running this code that should be 100% deterministic on a fixed piece of data, that it will eventually run into a case when looping where the result isn't the same and the assert_frame_equal fails. Sorry about the long repro code, couldn't get it to repro with anything more minimal.

Expected behavior

Shouldn't iterate through the entire loop without running into an assertion error.

Installed versions

--------Version info---------
Polars:               0.20.6
Index type:           UInt32
Platform:             Linux-6.2.6-76060206-generic-x86_64-with-glibc2.35
Python:               3.11.7 | packaged by conda-forge | (main, Dec 23 2023, 14:43:09) [GCC 12.3.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fsspec:               2023.6.0
gevent:               <not installed>
hvplot:               0.9.1
matplotlib:           3.8.2
numpy:                1.26.3
openpyxl:             <not installed>
pandas:               2.2.0
pyarrow:              15.0.0
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           3.1.9
@kszlim kszlim added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jan 28, 2024
@cmdlineluser
Copy link
Contributor

It seems to be CSE related.

It behaves deterministically if I disable comm_subexpr_elim

df = ldf.with_columns(*exprs).collect(comm_subexpr_elim=False)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants