Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading data spilled to disk may fail when multiple queries are ran in quick succession #13526

Closed
MarcoGorelli opened this issue Jan 8, 2024 · 8 comments · Fixed by #13961, #14510 or #14690
Assignees
Labels
A-streaming Related to the streaming engine accepted Ready for implementation bug Something isn't working P-medium Priority: medium python Related to Python Polars test Related to the test suite

Comments

@MarcoGorelli
Copy link
Collaborator

noticed here https://github.com/pola-rs/polars/actions/runs/7446264538/job/20256144205?pr=13524

__________________________ test_out_of_core_sort_9503 __________________________
[gw0] linux -- Python 3.11.7 /home/runner/work/polars/polars/py-polars/.venv/bin/python
tests/unit/streaming/test_streaming_sort.py:137: in test_out_of_core_sort_9503
    df = q.collect(streaming=True)
polars/lazyframe/frame.py:1729: in collect
    return wrap_df(ldf.collect())
E   pyo3_runtime.PanicException: called `Option::unwrap()` on a `None` value
----------------------------- Captured stderr call -----------------------------
OOC sort forced
thread '<unnamed>' panicked at /home/runner/work/polars/polars/crates/polars-core/src/utils/mod.rs:559:34:
called `Option::unwrap()` on a `None` value
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

I can't reproduce the failure locally though

@stinodego
Copy link
Member

stinodego commented Jan 8, 2024

Oh no they are back 😰 these were a nightmare to deal with. Probably some streaming test added in the wrong place. I can investigate.

EDIT: Another failure:
https://github.com/pola-rs/polars/actions/runs/7507685324/job/20441722432?pr=13689

tests/unit/streaming/test_streaming_sort.py:107: in test_streaming_sort
    )
polars/lazyframe/frame.py:1730: in collect
    return wrap_df(ldf.collect())
E   FileNotFoundError: No such file or directory (os error 2)

@stinodego stinodego added accepted Ready for implementation test Related to the test suite internal An internal refactor or improvement labels Jan 8, 2024
@stinodego stinodego self-assigned this Jan 8, 2024
@stinodego
Copy link
Member

stinodego commented Jan 18, 2024

Another unrelated failure I've seen more often:
https://github.com/pola-rs/polars/actions/runs/7567589915/job/20607064007

Opened an issue here: ts-graphviz/setup-graphviz#557

@stinodego
Copy link
Member

stinodego commented Jan 25, 2024

Looks like we're not out of the woods yet:
https://github.com/pola-rs/polars/actions/runs/7660919156/job/20879257793?pr=13984

tests/unit/streaming/test_streaming_sort.py:98: in test_streaming_sort
    assert (
polars/lazyframe/frame.py:1935: in collect
    return wrap_df(ldf.collect())
E   FileNotFoundError: No such file or directory (os error 2)

EDIT: Another failure:
https://github.com/pola-rs/polars/actions/runs/7871260700/job/21474183375?pr=14428

tests/unit/streaming/test_streaming_group_by.py:287: in test_streaming_group_by_ooc_q3
    .collect(streaming=True)
polars/lazyframe/frame.py:1935: in collect
    return wrap_df(ldf.collect())
E   FileNotFoundError: No such file or directory (os error 2)

@stinodego stinodego reopened this Jan 25, 2024
@stinodego stinodego changed the title Spurious test failure test_out_of_core_sort_9503 Intermittent test failures for out-of-core tests Feb 12, 2024
@stinodego
Copy link
Member

stinodego commented Feb 12, 2024

I've done everything I could think of to make sure the tests are run correctly. All streaming tests run on the same process, and the out of core tests spill to disk in their own temporary directly. Still, the errors keep happening.

This leads to me believe there is an issue with the functionality itself. Some issue with the garbage collector perhaps, since it expects files to exist when they do not.

I am rebranding this as a bug (renamed the issue to what I think causes this), though it will be hard to fix without a proper MRE.

I'm going to have to skip the tests that cause the most failures for now - this issue causes our test suite to be too unreliable.

@stinodego stinodego added bug Something isn't working P-low Priority: low python Related to Python Polars and removed internal An internal refactor or improvement accepted Ready for implementation labels Feb 12, 2024
@stinodego stinodego changed the title Intermittent test failures for out-of-core tests Reading data spilled to disk may fail when multiple queries are ran in quick succession Feb 12, 2024
@stinodego stinodego added P-medium Priority: medium and removed P-low Priority: low labels Feb 12, 2024
@stinodego stinodego added the A-streaming Related to the streaming engine label Feb 12, 2024
@nameexhaustion
Copy link
Collaborator

nameexhaustion commented Feb 12, 2024

This may help to reproduce a panic locally

import os
from pathlib import Path

tmp_dir = Path("env/pl_spill/")

os.environ["POLARS_FORCE_OOC"] = "1"
os.environ["POLARS_MAX_THREADS"] = "100"
os.environ["POLARS_TEMP_DIR"] = str(tmp_dir)

import polars as pl

print(f"{pl.__version__=} {pl.thread_pool_size()=}")

n = 10_000_000
df = pl.select((pl.int_range(n) // 9_999_999).shuffle(3).alias("x"))

print(df)

lf = df.lazy()

for i in range(1, 21):
    print(i)
    (
        pl.concat(
            [
                lf.group_by("x").len().select("x"),
                lf.sort("x"),
                lf.group_by("x").len().select("x"),
                lf.sort("x"),
                lf.group_by("x").len().select("x").sort("x"),
            ]
        ).collect(streaming=True)
    )

Most of the time it will print this panic (but won't crash the process)

thread '<unnamed>' panicked at crates/polars-pipe/src/executors/sinks/io.rs:97:51:
called `Result::unwrap()` on an `Err` value: Os { code: 2, kind: NotFound, message: "No such file or directory" }

But on my local, it usually crashes with this within 10 tries:

thread '<unnamed>' panicked at /home/runner/work/polars/polars/crates/polars-core/src/utils/mod.rs:572:34:
called `Option::unwrap()` on a `None` value
Traceback (most recent call last):
  File "/home/dev/y.py", line 33, in <module>
    ).collect(streaming=True)
      ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/.local/lib/python3.11/site-packages/polars/lazyframe/frame.py", line 1940, in collect
    return wrap_df(ldf.collect())
                   ^^^^^^^^^^^^^
pyo3_runtime.PanicException: called `Option::unwrap()` on a `None` value

My guess is maybe multiple IO threads using the same spill dir?

let uuid = SystemTime::now()

@stinodego
Copy link
Member

Thanks @nameexhaustion , this is very helpful!

@stinodego
Copy link
Member

The nice reproducer by @nameexhaustion now seems fixed, but unfortunately, it seems like we're still getting FileNotFoundErrors in the test suite:
https://github.com/pola-rs/polars/actions/runs/7937634446/job/21675190182?pr=14548

tests/unit/streaming/test_streaming_group_by.py:274: in test_streaming_group_by_ooc_q3
    lf.group_by("a", "b")
polars/lazyframe/frame.py:1938: in collect
    return wrap_df(ldf.collect())
E   FileNotFoundError: No such file or directory (os error 2)

So there's still something going wrong somewhere.

@stinodego stinodego reopened this Feb 17, 2024
@nameexhaustion
Copy link
Collaborator

This should give something for streaming group by on 0.20.9

import os
from pathlib import Path

tmp_dir = Path("env/pl_spill/")

os.environ["POLARS_FORCE_OOC"] = "1"
os.environ["POLARS_MAX_THREADS"] = "100"
os.environ["POLARS_TEMP_DIR"] = str(tmp_dir)

import polars as pl

print(f"{pl.__version__=} {pl.thread_pool_size()=}")

n = 1_000_000
random_integers = (pl.int_range(n, eager=True) // 333).shuffle(3)
lf = pl.LazyFrame({"a": random_integers, "b": random_integers})

for i in range(1, 11):
    print(i)
    result = pl.collect_all(
        20
        * [
            lf.group_by("a", "b")
            .agg(
                pl.first("a").alias("a_first"),
            )
            .sort("a")
        ],
        streaming=True,
    )

Should be able to see 2 variants of crashes after running it a few times:

Traceback (most recent call last):
  File "", line 20, in <module>
    result = pl.collect_all(
             ^^^^^^^^^^^^^^^
  File "/home/dev/.local/lib/python3.11/site-packages/polars/functions/lazy.py", line 1687, in collect_all
    out = plr.collect_all(prepared)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: No such file or directory (os error 2)
thread '<unnamed>' panicked at crates/polars-pipe/src/executors/sinks/sort/sink.rs:60:28:
called `Result::unwrap()` on an `Err` value: ComputeError(ErrString("could not create lockfile"))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "", line 20, in <module>
    result = pl.collect_all(
             ^^^^^^^^^^^^^^^
  File "/home/dev/.local/lib/python3.11/site-packages/polars/functions/lazy.py", line 1687, in collect_all
    out = plr.collect_all(prepared)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: ComputeError(ErrString("could not create lockfile"))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-streaming Related to the streaming engine accepted Ready for implementation bug Something isn't working P-medium Priority: medium python Related to Python Polars test Related to the test suite
Projects
Archived in project
5 participants