Reading data spilled to disk may fail when multiple queries are ran in quick succession #13526

MarcoGorelli · 2024-01-08T10:48:22Z

noticed here https://github.com/pola-rs/polars/actions/runs/7446264538/job/20256144205?pr=13524

__________________________ test_out_of_core_sort_9503 __________________________
[gw0] linux -- Python 3.11.7 /home/runner/work/polars/polars/py-polars/.venv/bin/python
tests/unit/streaming/test_streaming_sort.py:137: in test_out_of_core_sort_9503
    df = q.collect(streaming=True)
polars/lazyframe/frame.py:1729: in collect
    return wrap_df(ldf.collect())
E   pyo3_runtime.PanicException: called `Option::unwrap()` on a `None` value
----------------------------- Captured stderr call -----------------------------
OOC sort forced
thread '<unnamed>' panicked at /home/runner/work/polars/polars/crates/polars-core/src/utils/mod.rs:559:34:
called `Option::unwrap()` on a `None` value
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

I can't reproduce the failure locally though

The text was updated successfully, but these errors were encountered:

stinodego · 2024-01-08T13:36:16Z

Oh no they are back 😰 these were a nightmare to deal with. Probably some streaming test added in the wrong place. I can investigate.

EDIT: Another failure:
https://github.com/pola-rs/polars/actions/runs/7507685324/job/20441722432?pr=13689

tests/unit/streaming/test_streaming_sort.py:107: in test_streaming_sort
    )
polars/lazyframe/frame.py:1730: in collect
    return wrap_df(ldf.collect())
E   FileNotFoundError: No such file or directory (os error 2)

stinodego · 2024-01-18T09:24:24Z

Another unrelated failure I've seen more often:
https://github.com/pola-rs/polars/actions/runs/7567589915/job/20607064007

Opened an issue here: ts-graphviz/setup-graphviz#557

stinodego · 2024-01-25T22:09:44Z

Looks like we're not out of the woods yet:
https://github.com/pola-rs/polars/actions/runs/7660919156/job/20879257793?pr=13984

tests/unit/streaming/test_streaming_sort.py:98: in test_streaming_sort
    assert (
polars/lazyframe/frame.py:1935: in collect
    return wrap_df(ldf.collect())
E   FileNotFoundError: No such file or directory (os error 2)

EDIT: Another failure:
https://github.com/pola-rs/polars/actions/runs/7871260700/job/21474183375?pr=14428

tests/unit/streaming/test_streaming_group_by.py:287: in test_streaming_group_by_ooc_q3
    .collect(streaming=True)
polars/lazyframe/frame.py:1935: in collect
    return wrap_df(ldf.collect())
E   FileNotFoundError: No such file or directory (os error 2)

stinodego · 2024-02-12T13:23:44Z

I've done everything I could think of to make sure the tests are run correctly. All streaming tests run on the same process, and the out of core tests spill to disk in their own temporary directly. Still, the errors keep happening.

This leads to me believe there is an issue with the functionality itself. Some issue with the garbage collector perhaps, since it expects files to exist when they do not.

I am rebranding this as a bug (renamed the issue to what I think causes this), though it will be hard to fix without a proper MRE.

I'm going to have to skip the tests that cause the most failures for now - this issue causes our test suite to be too unreliable.

nameexhaustion · 2024-02-12T16:53:58Z

This may help to reproduce a panic locally

import os
from pathlib import Path

tmp_dir = Path("env/pl_spill/")

os.environ["POLARS_FORCE_OOC"] = "1"
os.environ["POLARS_MAX_THREADS"] = "100"
os.environ["POLARS_TEMP_DIR"] = str(tmp_dir)

import polars as pl

print(f"{pl.__version__=} {pl.thread_pool_size()=}")

n = 10_000_000
df = pl.select((pl.int_range(n) // 9_999_999).shuffle(3).alias("x"))

print(df)

lf = df.lazy()

for i in range(1, 21):
    print(i)
    (
        pl.concat(
            [
                lf.group_by("x").len().select("x"),
                lf.sort("x"),
                lf.group_by("x").len().select("x"),
                lf.sort("x"),
                lf.group_by("x").len().select("x").sort("x"),
            ]
        ).collect(streaming=True)
    )

Most of the time it will print this panic (but won't crash the process)

thread '<unnamed>' panicked at crates/polars-pipe/src/executors/sinks/io.rs:97:51:
called `Result::unwrap()` on an `Err` value: Os { code: 2, kind: NotFound, message: "No such file or directory" }

But on my local, it usually crashes with this within 10 tries:

thread '<unnamed>' panicked at /home/runner/work/polars/polars/crates/polars-core/src/utils/mod.rs:572:34:
called `Option::unwrap()` on a `None` value
Traceback (most recent call last):
  File "/home/dev/y.py", line 33, in <module>
    ).collect(streaming=True)
      ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/.local/lib/python3.11/site-packages/polars/lazyframe/frame.py", line 1940, in collect
    return wrap_df(ldf.collect())
                   ^^^^^^^^^^^^^
pyo3_runtime.PanicException: called `Option::unwrap()` on a `None` value

My guess is maybe multiple IO threads using the same spill dir?

polars/crates/polars-pipe/src/executors/sinks/io.rs

Line 40 in 649c33a

let uuid = SystemTime::now()

stinodego · 2024-02-12T17:09:35Z

Thanks @nameexhaustion , this is very helpful!

stinodego · 2024-02-17T00:39:23Z

The nice reproducer by @nameexhaustion now seems fixed, but unfortunately, it seems like we're still getting FileNotFoundErrors in the test suite:
https://github.com/pola-rs/polars/actions/runs/7937634446/job/21675190182?pr=14548

tests/unit/streaming/test_streaming_group_by.py:274: in test_streaming_group_by_ooc_q3
    lf.group_by("a", "b")
polars/lazyframe/frame.py:1938: in collect
    return wrap_df(ldf.collect())
E   FileNotFoundError: No such file or directory (os error 2)

So there's still something going wrong somewhere.

nameexhaustion · 2024-02-17T01:03:48Z

This should give something for streaming group by on 0.20.9

import os
from pathlib import Path

tmp_dir = Path("env/pl_spill/")

os.environ["POLARS_FORCE_OOC"] = "1"
os.environ["POLARS_MAX_THREADS"] = "100"
os.environ["POLARS_TEMP_DIR"] = str(tmp_dir)

import polars as pl

print(f"{pl.__version__=} {pl.thread_pool_size()=}")

n = 1_000_000
random_integers = (pl.int_range(n, eager=True) // 333).shuffle(3)
lf = pl.LazyFrame({"a": random_integers, "b": random_integers})

for i in range(1, 11):
    print(i)
    result = pl.collect_all(
        20
        * [
            lf.group_by("a", "b")
            .agg(
                pl.first("a").alias("a_first"),
            )
            .sort("a")
        ],
        streaming=True,
    )

Should be able to see 2 variants of crashes after running it a few times:

Traceback (most recent call last):
  File "", line 20, in <module>
    result = pl.collect_all(
             ^^^^^^^^^^^^^^^
  File "/home/dev/.local/lib/python3.11/site-packages/polars/functions/lazy.py", line 1687, in collect_all
    out = plr.collect_all(prepared)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: No such file or directory (os error 2)

thread '<unnamed>' panicked at crates/polars-pipe/src/executors/sinks/sort/sink.rs:60:28:
called `Result::unwrap()` on an `Err` value: ComputeError(ErrString("could not create lockfile"))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "", line 20, in <module>
    result = pl.collect_all(
             ^^^^^^^^^^^^^^^
  File "/home/dev/.local/lib/python3.11/site-packages/polars/functions/lazy.py", line 1687, in collect_all
    out = plr.collect_all(prepared)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: ComputeError(ErrString("could not create lockfile"))

stinodego added accepted Ready for implementation test Related to the test suite internal An internal refactor or improvement labels Jan 8, 2024

stinodego self-assigned this Jan 8, 2024

stinodego mentioned this issue Jan 24, 2024

test(python): Fix spurious test failures #13961

Merged

stinodego closed this as completed in #13961 Jan 24, 2024

mcrumiller mentioned this issue Jan 24, 2024

CI failure due to test_streaming_unique.py::test_streaming_out_of_core_unique #13970

Closed

stinodego reopened this Jan 25, 2024

stinodego mentioned this issue Feb 12, 2024

test(python): Set specific temp dir for OOC tests #14420

Merged

stinodego changed the title ~~Spurious test failure test_out_of_core_sort_9503~~ Intermittent test failures for out-of-core tests Feb 12, 2024

stinodego added bug Something isn't working P-low Priority: low python Related to Python Polars and removed internal An internal refactor or improvement accepted Ready for implementation labels Feb 12, 2024

stinodego changed the title ~~Intermittent test failures for out-of-core tests~~ Reading data spilled to disk may fail when multiple queries are ran in quick succession Feb 12, 2024

stinodego mentioned this issue Feb 12, 2024

Out of core query panics with "expected BinaryOffset, got binary" #14430

Closed

2 tasks

stinodego added P-medium Priority: medium and removed P-low Priority: low labels Feb 12, 2024

stinodego mentioned this issue Feb 12, 2024

test(python): Skip some OOC tests that fail randomly in the CI #14434

Merged

stinodego added the A-streaming Related to the streaming engine label Feb 12, 2024

stinodego removed their assignment Feb 15, 2024

ritchie46 mentioned this issue Feb 15, 2024

fix: race conditions in OOC writing #14510

Merged

ritchie46 closed this as completed in #14510 Feb 15, 2024

stinodego reopened this Feb 17, 2024

This was referenced Feb 21, 2024

Revert "test(python): Re-enable streaming OOC tests" #14629

Merged

test(python): Skip more OOC tests #14677

Closed

ritchie46 mentioned this issue Feb 25, 2024

fix: Fix contention panics in file gc threads #14690

Merged

ritchie46 closed this as completed in #14690 Feb 25, 2024

c-peters added the accepted Ready for implementation label Feb 26, 2024

c-peters assigned ritchie46 Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading data spilled to disk may fail when multiple queries are ran in quick succession #13526

Reading data spilled to disk may fail when multiple queries are ran in quick succession #13526

MarcoGorelli commented Jan 8, 2024

stinodego commented Jan 8, 2024 •

edited

Loading

stinodego commented Jan 18, 2024 •

edited

Loading

stinodego commented Jan 25, 2024 •

edited

Loading

stinodego commented Feb 12, 2024 •

edited

Loading

nameexhaustion commented Feb 12, 2024 •

edited

Loading

stinodego commented Feb 12, 2024

stinodego commented Feb 17, 2024

nameexhaustion commented Feb 17, 2024

Reading data spilled to disk may fail when multiple queries are ran in quick succession #13526

Reading data spilled to disk may fail when multiple queries are ran in quick succession #13526

Comments

MarcoGorelli commented Jan 8, 2024

stinodego commented Jan 8, 2024 • edited Loading

stinodego commented Jan 18, 2024 • edited Loading

stinodego commented Jan 25, 2024 • edited Loading

stinodego commented Feb 12, 2024 • edited Loading

nameexhaustion commented Feb 12, 2024 • edited Loading

stinodego commented Feb 12, 2024

stinodego commented Feb 17, 2024

nameexhaustion commented Feb 17, 2024

stinodego commented Jan 8, 2024 •

edited

Loading

stinodego commented Jan 18, 2024 •

edited

Loading

stinodego commented Jan 25, 2024 •

edited

Loading

stinodego commented Feb 12, 2024 •

edited

Loading

nameexhaustion commented Feb 12, 2024 •

edited

Loading