-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reading data spilled to disk may fail when multiple queries are ran in quick succession #13526
Comments
Oh no they are back 😰 these were a nightmare to deal with. Probably some streaming test added in the wrong place. I can investigate. EDIT: Another failure:
|
Another unrelated failure I've seen more often: Opened an issue here: ts-graphviz/setup-graphviz#557 |
Looks like we're not out of the woods yet:
EDIT: Another failure:
|
I've done everything I could think of to make sure the tests are run correctly. All streaming tests run on the same process, and the out of core tests spill to disk in their own temporary directly. Still, the errors keep happening. This leads to me believe there is an issue with the functionality itself. Some issue with the garbage collector perhaps, since it expects files to exist when they do not. I am rebranding this as a bug (renamed the issue to what I think causes this), though it will be hard to fix without a proper MRE. I'm going to have to skip the tests that cause the most failures for now - this issue causes our test suite to be too unreliable. |
This may help to reproduce a panic locally import os
from pathlib import Path
tmp_dir = Path("env/pl_spill/")
os.environ["POLARS_FORCE_OOC"] = "1"
os.environ["POLARS_MAX_THREADS"] = "100"
os.environ["POLARS_TEMP_DIR"] = str(tmp_dir)
import polars as pl
print(f"{pl.__version__=} {pl.thread_pool_size()=}")
n = 10_000_000
df = pl.select((pl.int_range(n) // 9_999_999).shuffle(3).alias("x"))
print(df)
lf = df.lazy()
for i in range(1, 21):
print(i)
(
pl.concat(
[
lf.group_by("x").len().select("x"),
lf.sort("x"),
lf.group_by("x").len().select("x"),
lf.sort("x"),
lf.group_by("x").len().select("x").sort("x"),
]
).collect(streaming=True)
) Most of the time it will print this panic (but won't crash the process)
But on my local, it usually crashes with this within 10 tries:
My guess is maybe multiple IO threads using the same spill dir?
|
Thanks @nameexhaustion , this is very helpful! |
The nice reproducer by @nameexhaustion now seems fixed, but unfortunately, it seems like we're still getting FileNotFoundErrors in the test suite:
So there's still something going wrong somewhere. |
This should give something for streaming group by on 0.20.9 import os
from pathlib import Path
tmp_dir = Path("env/pl_spill/")
os.environ["POLARS_FORCE_OOC"] = "1"
os.environ["POLARS_MAX_THREADS"] = "100"
os.environ["POLARS_TEMP_DIR"] = str(tmp_dir)
import polars as pl
print(f"{pl.__version__=} {pl.thread_pool_size()=}")
n = 1_000_000
random_integers = (pl.int_range(n, eager=True) // 333).shuffle(3)
lf = pl.LazyFrame({"a": random_integers, "b": random_integers})
for i in range(1, 11):
print(i)
result = pl.collect_all(
20
* [
lf.group_by("a", "b")
.agg(
pl.first("a").alias("a_first"),
)
.sort("a")
],
streaming=True,
) Should be able to see 2 variants of crashes after running it a few times:
|
noticed here https://github.com/pola-rs/polars/actions/runs/7446264538/job/20256144205?pr=13524
I can't reproduce the failure locally though
The text was updated successfully, but these errors were encountered: