-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GIL deadlock in to_numpy and list.eval() #10970
Comments
Hi guys, thanks for accepting this bug report. |
At the moment, I've found some workaround. # instead of
df1.to_numpy()
# use
np.stack([df1[c].to_numpy() for c in df1.columns], axis=1) Seems that series.to_numpy() executes in a single thread and not causes GIL deadlock. It is a bit slower than df.to_numpy(), but it works. |
for my 2d array, I've been able to get |
Yep, it also will work, but it will be much slower. Because all data converted through python. from random import randint
import numpy as np
import polars as pl
df1 = pl.DataFrame({
'A': [randint(1, 100) for _ in range(1000000)],
'B': [randint(1, 100) for _ in range(1000000)],
'C': [randint(1, 100) for _ in range(1000000)],
'D': [randint(1, 100) for _ in range(1000000)],
}) %%timeit
np.array(df1.rows())
471 ms ± 12.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %%timeit
np.stack([df1[c].to_numpy() for c in df1.columns], axis=1)
10.4 ms ± 646 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) %%timeit
df1.to_numpy()
3.15 ms ± 655 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) |
Thank you! I'll use your version instead! |
@orlp I've found that same GIL deadlock also appears in combination read_parquet() and list.eval() from random import randint
from threading import Thread
import polars as pl
df1 = pl.DataFrame({
'A': [randint(1, 100) for _ in range(10)],
'B': [randint(1, 100) for _ in range(10)],
'C': [randint(1, 100) for _ in range(10)],
'D': [randint(1, 100) for _ in range(10)],
})
df1.write_parquet('test.pq', compression='zstd')
df2 = pl.DataFrame({
'reg_A': [[6.2 for _ in range(2)]],
'reg_B': [[0.1 for _ in range(2)]],
'C': [randint(1, 100) / 1000],
'C_1': ['text'],
'C_2': [1],
'C_3': [2],
'C_4': [3],
'C_5': [4],
'C_6': ['text'],
})
def read_pq():
for i in range(500_000):
pl.read_parquet('test.pq')
def run():
for i in range(100_000):
print(i)
df2.select([pl.col('^reg.*$').list.eval(pl.element().round(3))])
Thread(target=read_pq).start()
Thread(target=run).start() Should I create a separate issue for this? And more important question, is it a bug at all? |
I cannot reproduce this on the current Polars version, so I'm closing this as unable to reproduce. Please open a new issue with a reproducible example if you're still encountering a related issue. |
Checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.
Reproducible example
Log output
No response
Issue description
When all conditions in
Reproducible example
are met - a GIL deadlock occurs.The print(i) will freeze.
Expected behavior
No GIL deadlock occurring.
Installed versions
The text was updated successfully, but these errors were encountered: