Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

groupby_dynamic() on empty LazyFrame raises index out of bounds on collect() #6288

Closed
2 tasks done
markbaydoun opened this issue Jan 17, 2023 · 3 comments · Fixed by #6294
Closed
2 tasks done

groupby_dynamic() on empty LazyFrame raises index out of bounds on collect() #6288

markbaydoun opened this issue Jan 17, 2023 · 3 comments · Fixed by #6294
Labels
bug Something isn't working python Related to Python Polars

Comments

@markbaydoun
Copy link

markbaydoun commented Jan 17, 2023

Polars version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Issue description

I want to use groupby_dynamic() in a LazyFrame which has previously been filtered. If I try to run groupby_dynamic() on that LF and it happens to be empty because of the filtering, calling collect() raises the following error:

File /usr/local/lib/python3.9/site-packages/polars/internals/lazyframe/frame.py:1165, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, no_optimization, slice_pushdown, common_subplan_elimination, streaming)
   1154     common_subplan_elimination = False
   1156 ldf = self._ldf.optimization_toggle(
   1157     type_coercion,
   1158     predicate_pushdown,
   (...)
   1163     streaming,
   1164 )
-> 1165 return pli.wrap_df(ldf.collect())

PanicException: index out of bounds: the len is 0 but the index is 0

I have confirmed this does not happen with the normal groupby() method, it works as expected and returns an empty DataFrame when calling collect().

Also, I was able to work around this problem with this, but I'm not sure it's ideal:

if lf.fetch(1).is_empty():
     . . . # Don't groupby_dynamic
else:
     . . . # Use groupby_dynamic

Reproducible example

# Example copied from
# https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.groupby_dynamic.html#polars-dataframe-groupby-dynamic

from datetime import datetime
import polars as pl

df = pl.DataFrame(
    {
        "time": pl.date_range(
            low=datetime(2021, 12, 16),
            high=datetime(2021, 12, 16, 3),
            interval="30m",
        ),
        "n": range(7),
    }
).lazy()

# Works correctly
df.groupby_dynamic("time", every="1h", closed="right").agg(pl.col("n")).collect()

filtered = df.filter(pl.col("n") > 8)
assert filtered.collect().is_empty()

# This one fails
filtered.groupby_dynamic("time", every="1h", closed="right").agg(
    pl.col("n")
).collect()

Expected behavior

When runnin collect(), we receive an empty DataFrame, as groupby_dynamic() didn't have any data to work with in the LF, just as groupby() would.

Installed versions

---Version info---
Polars: 0.15.15
Index type: UInt32
Platform: Linux-5.4.72-microsoft-standard-WSL2-x86_64-with-glibc2.31
Python: 3.9.14 (main, Oct  5 2022, 15:22:07)
[GCC 10.2.1 20210110]
---Optional dependencies---
pyarrow: 9.0.0
pandas: 1.5.2
numpy: 1.24.1
fsspec: <not installed>
connectorx: <not installed>
xlsx2csv: <not installed>
matplotlib: <not installed>
@markbaydoun markbaydoun added bug Something isn't working python Related to Python Polars labels Jan 17, 2023
@markbaydoun
Copy link
Author

If it helps, this is the full backtrace:

thread '<unnamed>' panicked at 'index out of bounds: the len is 0 but the index is 0', /home/runner/work/polars/polars/polars/polars-time/src/windows/groupby.rs:58:17
stack backtrace:
   0:     0x7f11fa62db67 - <unknown>
   1:     0x7f11f92c496e - <unknown>
   2:     0x7f11fa601412 - <unknown>
   3:     0x7f11fa62f2ef - <unknown>
   4:     0x7f11fa62eebf - <unknown>
   5:     0x7f11fa62ff9b - <unknown>
   6:     0x7f11fa62fa34 - <unknown>
   7:     0x7f11fa62f99e - <unknown>
   8:     0x7f11fa62f971 - <unknown>
   9:     0x7f11f86dc2e2 - <unknown>
  10:     0x7f11f86dc361 - <unknown>
  11:     0x7f11fa490288 - <unknown>
  12:     0x7f11fa48d543 - <unknown>
  13:     0x7f11fa1061d8 - <unknown>
  14:     0x7f11fa105304 - <unknown>
  15:     0x7f11fa0a1f2f - <unknown>
  16:     0x7f11f8de4319 - <unknown>
  17:     0x7f11fe8e1b53 - <unknown>
  18:     0x7f11fe93467d - _PyEval_EvalFrameDefault
  19:     0x7f11fe9334d0 - <unknown>
  20:     0x7f11fe8dd9e7 - _PyFunction_Vectorcall
  21:     0x7f11fe9379b6 - _PyEval_EvalFrameDefault
  22:     0x7f11fe9334d0 - <unknown>
  23:     0x7f11fe8dd9e7 - _PyFunction_Vectorcall
  24:     0x7f11fe93467d - _PyEval_EvalFrameDefault
  25:     0x7f11fe9334d0 - <unknown>
  26:     0x7f11fe933201 - _PyEval_EvalCodeWithName
  27:     0x7f11fe9331a3 - PyEval_EvalCodeEx
  28:     0x7f11fe9a861b - PyEval_EvalCode
  29:     0x7f11fe9a7615 - <unknown>
  30:     0x7f11fe900264 - <unknown>
  31:     0x7f11fe9343d0 - _PyEval_EvalFrameDefault
  32:     0x7f11fe8e33ac - <unknown>
  33:     0x7f11fe93a3ef - _PyEval_EvalFrameDefault
  34:     0x7f11fe8e33ac - <unknown>
  35:     0x7f11fe93a3ef - _PyEval_EvalFrameDefault
  36:     0x7f11fe8e33ac - <unknown>
  37:     0x7f11fe8e20a4 - <unknown>
  38:     0x7f11fe93467d - _PyEval_EvalFrameDefault
  39:     0x7f11fe8ddc13 - <unknown>
  40:     0x7f11fe9343d0 - _PyEval_EvalFrameDefault
  41:     0x7f11fe8ddc13 - <unknown>
  42:     0x7f11fe93467d - _PyEval_EvalFrameDefault
  43:     0x7f11fe9334d0 - <unknown>
  44:     0x7f11fe8dd9e7 - _PyFunction_Vectorcall
  45:     0x7f11fe8df4f8 - <unknown>
  46:     0x7f11fe935152 - _PyEval_EvalFrameDefault
  47:     0x7f11fe8ddc13 - <unknown>
  48:     0x7f11fe93467d - _PyEval_EvalFrameDefault
  49:     0x7f11fe8ddc13 - <unknown>
  50:     0x7f11fe93467d - _PyEval_EvalFrameDefault
  51:     0x7f11fe8ddc13 - <unknown>
  52:     0x7f11fe93467d - _PyEval_EvalFrameDefault
  53:     0x7f11fe9334d0 - <unknown>
  54:     0x7f11fe8dd9e7 - _PyFunction_Vectorcall
  55:     0x7f11fe8df4f8 - <unknown>
  56:     0x7f11fe8de912 - PyVectorcall_Call
  57:     0x7f11fe9379b6 - _PyEval_EvalFrameDefault
  58:     0x7f11fe9334d0 - <unknown>
  59:     0x7f11fe8dd9e7 - _PyFunction_Vectorcall
  60:     0x7f11fe9343d0 - _PyEval_EvalFrameDefault
  61:     0x7f11fe9334d0 - <unknown>
  62:     0x7f11fe933201 - _PyEval_EvalCodeWithName
  63:     0x7f11fe9331a3 - PyEval_EvalCodeEx
  64:     0x7f11fe9a861b - PyEval_EvalCode
  65:     0x7f11fe9b9ced - <unknown>
  66:     0x7f11fe9b9c7b - <unknown>
  67:     0x7f11fe883f91 - <unknown>
  68:     0x7f11fe883d32 - PyRun_SimpleFileExFlags
  69:     0x7f11fe9c1600 - Py_RunMain
  70:     0x7f11fe9c1189 - Py_BytesMain
  71:     0x7f11fe5e9d0a - __libc_start_main
  72:     0x557b9760008a - _start
  73:                0x0 - <unknown>
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)

@markbaydoun
Copy link
Author

Another example which gives a different error, when calling groupby_dynamic with values for the by argument:

from datetime import datetime
import polars as pl

df = pl.DataFrame(
    {
        "time": pl.date_range(
            low=datetime(2021, 12, 16),
            high=datetime(2021, 12, 16, 3),
            interval="30m",
        ),
        "n": range(7),
        "t": [1,1,2,2,3,3,4]
    }
).lazy()

# Works correctly
df.groupby_dynamic("time", every="1h", closed="right").agg(pl.col("n")).collect()

filtered = df.filter(pl.col("n") > 8)
assert filtered.collect().is_empty()

# This one fails
filtered.groupby_dynamic("time", every="1h", closed="right",by=[pl.col('t')]).agg(
  pl.col("n")
).collect()

Error:

File /usr/local/lib/python3.9/site-packages/polars/internals/lazyframe/frame.py:1168, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, no_optimization, slice_pushdown, common_subplan_elimination, streaming)
   1157     common_subplan_elimination = False
   1159 ldf = self._ldf.optimization_toggle(
   1160     type_coercion,
   1161     predicate_pushdown,
   (...)
   1166     streaming,
   1167 )
-> 1168 return pli.wrap_df(ldf.collect())

PanicException: called `Option::unwrap()` on a `None` value

Backtrace:

thread '<unnamed>' panicked at 'called `Option::unwrap()` on a `None` value', /home/runner/work/polars/polars/polars/polars-time/src/groupby/dynamic.rs:338:29
stack backtrace:
   0:     0x7f09fb48cb67 - <unknown>
   1:     0x7f09fa12396e - <unknown>
   2:     0x7f09fb460412 - <unknown>
   3:     0x7f09fb48e2ef - <unknown>
   4:     0x7f09fb48debf - <unknown>
   5:     0x7f09fb48ef9b - <unknown>
   6:     0x7f09fb48ea04 - <unknown>
   7:     0x7f09fb48e99e - <unknown>
   8:     0x7f09fb48e971 - <unknown>
   9:     0x7f09f953b2e2 - <unknown>
  10:     0x7f09f953b3dc - <unknown>
  11:     0x7f09fb2eda73 - <unknown>
  12:     0x7f09faf651d8 - <unknown>
  13:     0x7f09faf64304 - <unknown>
  14:     0x7f09faf00f2f - <unknown>
  15:     0x7f09f9c43319 - <unknown>
  16:     0x7f0a05910b53 - <unknown>
  17:     0x7f0a0596367d - _PyEval_EvalFrameDefault
  18:     0x7f0a059624d0 - <unknown>
  19:     0x7f0a0590c9e7 - _PyFunction_Vectorcall
  20:     0x7f0a059669b6 - _PyEval_EvalFrameDefault
  21:     0x7f0a059624d0 - <unknown>
  22:     0x7f0a0590c9e7 - _PyFunction_Vectorcall
  23:     0x7f0a0596367d - _PyEval_EvalFrameDefault
  24:     0x7f0a059624d0 - <unknown>
  25:     0x7f0a05962201 - _PyEval_EvalCodeWithName
  26:     0x7f0a059621a3 - PyEval_EvalCodeEx
  27:     0x7f0a059d761b - PyEval_EvalCode
  28:     0x7f0a059d6615 - <unknown>
  29:     0x7f0a0592f264 - <unknown>
  30:     0x7f0a059633d0 - _PyEval_EvalFrameDefault
  31:     0x7f0a059123ac - <unknown>
  32:     0x7f0a059693ef - _PyEval_EvalFrameDefault
  33:     0x7f0a059123ac - <unknown>
  34:     0x7f0a059693ef - _PyEval_EvalFrameDefault
  35:     0x7f0a059123ac - <unknown>
  36:     0x7f0a059110a4 - <unknown>
  37:     0x7f0a0596367d - _PyEval_EvalFrameDefault
  38:     0x7f0a0590cc13 - <unknown>
  39:     0x7f0a059633d0 - _PyEval_EvalFrameDefault
  40:     0x7f0a0590cc13 - <unknown>
  41:     0x7f0a0596367d - _PyEval_EvalFrameDefault
  42:     0x7f0a059624d0 - <unknown>
  43:     0x7f0a0590c9e7 - _PyFunction_Vectorcall
  44:     0x7f0a0590e4f8 - <unknown>
  45:     0x7f0a05964152 - _PyEval_EvalFrameDefault
  46:     0x7f0a0590cc13 - <unknown>
  47:     0x7f0a0596367d - _PyEval_EvalFrameDefault
  48:     0x7f0a0590cc13 - <unknown>
  49:     0x7f0a0596367d - _PyEval_EvalFrameDefault
  50:     0x7f0a0590cc13 - <unknown>
  51:     0x7f0a0596367d - _PyEval_EvalFrameDefault
  52:     0x7f0a059624d0 - <unknown>
  53:     0x7f0a0590c9e7 - _PyFunction_Vectorcall
  54:     0x7f0a0590e4f8 - <unknown>
  55:     0x7f0a0590d912 - PyVectorcall_Call
  56:     0x7f0a059669b6 - _PyEval_EvalFrameDefault
  57:     0x7f0a059624d0 - <unknown>
  58:     0x7f0a0590c9e7 - _PyFunction_Vectorcall
  59:     0x7f0a059633d0 - _PyEval_EvalFrameDefault
  60:     0x7f0a059624d0 - <unknown>
  61:     0x7f0a05962201 - _PyEval_EvalCodeWithName
  62:     0x7f0a059621a3 - PyEval_EvalCodeEx
  63:     0x7f0a059d761b - PyEval_EvalCode
  64:     0x7f0a059e8ced - <unknown>
  65:     0x7f0a059e8c7b - <unknown>
  66:     0x7f0a058b2f91 - <unknown>
  67:     0x7f0a058b2d32 - PyRun_SimpleFileExFlags
  68:     0x7f0a059f0600 - Py_RunMain
  69:     0x7f0a059f0189 - Py_BytesMain
  70:     0x7f0a05618d0a - __libc_start_main
  71:     0x55c1c854108a - _start
  72:                0x0 - <unknown>

@markbaydoun
Copy link
Author

Update:
The mitigation code doesn't work, as fetch() sometimes returns empty when it shouldnt.

if lf.fetch(1).is_empty():
     . . . # Don't groupby_dynamic
else:
     . . . # Use groupby_dynamic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python Related to Python Polars
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant