Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GIL deadlock in to_numpy and list.eval() #10970

Closed
2 tasks done
s-b90 opened this issue Sep 7, 2023 · 7 comments
Closed
2 tasks done

GIL deadlock in to_numpy and list.eval() #10970

s-b90 opened this issue Sep 7, 2023 · 7 comments
Labels
bug Something isn't working P-medium Priority: medium python Related to Python Polars

Comments

@s-b90
Copy link

s-b90 commented Sep 7, 2023

Checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

from random import randint
from threading import Thread
import polars as pl

df1 = pl.DataFrame({
    'A': [randint(1, 100) for _ in range(100)],
    'B': [randint(1, 100) for _ in range(100)],
    'C': [randint(1, 100) for _ in range(100)],
    'D': [randint(1, 100) for _ in range(100)],
})

df2 = pl.DataFrame({
    'reg_A': [[6.2 for _ in range(2)]],
    'reg_B': [[0.1 for _ in range(2)]],
    'C': [randint(1, 100) / 1000],
    'C_1': ['text'],
    'C_2': [1],
    'C_3': [2],
    'C_4': [3],
    'C_5': [4],
    'C_6': ['text'],
})


def pq_to_np():
    for i in range(100_000):
        df1.to_numpy()


def run():
    for i in range(10_000):
        print(i)
        df2.select([pl.col('^reg.*$').list.eval(pl.element().round(3))])


Thread(target=pq_to_np).start()
Thread(target=run).start()

Log output

No response

Issue description

When all conditions in Reproducible example are met - a GIL deadlock occurs.
The print(i) will freeze.

Expected behavior

No GIL deadlock occurring.

Installed versions

Polars:              0.19.2
Index type:          UInt32
Platform:            Linux-6.3.7-060307-generic-x86_64-with-glibc2.35
Python:              3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0]

----Optional dependencies----
adbc_driver_sqlite:  <not installed>
cloudpickle:         <not installed>
connectorx:          <not installed>
deltalake:           <not installed>
fsspec:              <not installed>
matplotlib:          3.7.2
numpy:               1.24.2
pandas:              2.0.3
pyarrow:             12.0.1
pydantic:            <not installed>
sqlalchemy:          <not installed>
xlsx2csv:            <not installed>
xlsxwriter:          <not installed>
None
@s-b90 s-b90 added bug Something isn't working python Related to Python Polars labels Sep 7, 2023
@orlp orlp changed the title GIL in to_numpy and list.eval() GIL deadlock in to_numpy and list.eval() Sep 7, 2023
@orlp orlp added the accepted Ready for implementation label Sep 8, 2023
@s-b90
Copy link
Author

s-b90 commented Oct 2, 2023

Hi guys, thanks for accepting this bug report.
Is there any chance to fix this in near future?

@s-b90
Copy link
Author

s-b90 commented Oct 4, 2023

At the moment, I've found some workaround.

# instead of
df1.to_numpy()
# use
np.stack([df1[c].to_numpy() for c in df1.columns], axis=1)

Seems that series.to_numpy() executes in a single thread and not causes GIL deadlock. It is a bit slower than df.to_numpy(), but it works.

@qci-amos
Copy link

qci-amos commented Oct 4, 2023

for my 2d array, I've been able to get np.array(arr2d.rows()) to work as an alternative to arr2d.to_numpy()

@s-b90
Copy link
Author

s-b90 commented Oct 4, 2023

Yep, it also will work, but it will be much slower. Because all data converted through python.

from random import randint

import numpy as np
import polars as pl

df1 = pl.DataFrame({
    'A': [randint(1, 100) for _ in range(1000000)],
    'B': [randint(1, 100) for _ in range(1000000)],
    'C': [randint(1, 100) for _ in range(1000000)],
    'D': [randint(1, 100) for _ in range(1000000)],
})
%%timeit
np.array(df1.rows())
471 ms ± 12.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
np.stack([df1[c].to_numpy() for c in df1.columns], axis=1)
10.4 ms ± 646 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
df1.to_numpy()
3.15 ms ± 655 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@qci-amos
Copy link

qci-amos commented Oct 4, 2023

Thank you! I'll use your version instead!

@s-b90
Copy link
Author

s-b90 commented Nov 13, 2023

@orlp I've found that same GIL deadlock also appears in combination read_parquet() and list.eval()
The problem here is when you're using several list.eval expression in one select or with_columns.

from random import randint
from threading import Thread
import polars as pl

df1 = pl.DataFrame({
    'A': [randint(1, 100) for _ in range(10)],
    'B': [randint(1, 100) for _ in range(10)],
    'C': [randint(1, 100) for _ in range(10)],
    'D': [randint(1, 100) for _ in range(10)],
})
df1.write_parquet('test.pq', compression='zstd')

df2 = pl.DataFrame({
    'reg_A': [[6.2 for _ in range(2)]],
    'reg_B': [[0.1 for _ in range(2)]],
    'C': [randint(1, 100) / 1000],
    'C_1': ['text'],
    'C_2': [1],
    'C_3': [2],
    'C_4': [3],
    'C_5': [4],
    'C_6': ['text'],
})


def read_pq():
    for i in range(500_000):
        pl.read_parquet('test.pq')


def run():
    for i in range(100_000):
        print(i)
        df2.select([pl.col('^reg.*$').list.eval(pl.element().round(3))])


Thread(target=read_pq).start()
Thread(target=run).start()

Should I create a separate issue for this?

And more important question, is it a bug at all?
I saw #5347, and I've read mp page in user guide.
But how polars should behave with threading library? Is there some restrictions?
Is it unexpected behavior that GIL deadlock appears while using several threads with polars?
Maybe you or @ritchie46 can provide some details here?

@stinodego stinodego added P-medium Priority: medium and removed accepted Ready for implementation labels Jan 12, 2024
@stinodego
Copy link
Member

I cannot reproduce this on the current Polars version, so I'm closing this as unable to reproduce. Please open a new issue with a reproducible example if you're still encountering a related issue.

@stinodego stinodego closed this as not planned Won't fix, can't repro, duplicate, stale May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P-medium Priority: medium python Related to Python Polars
Projects
Archived in project
Development

No branches or pull requests

4 participants