GIL deadlock in to_numpy and list.eval() #10970

s-b90 · 2023-09-07T11:54:38Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

from random import randint
from threading import Thread
import polars as pl

df1 = pl.DataFrame({
    'A': [randint(1, 100) for _ in range(100)],
    'B': [randint(1, 100) for _ in range(100)],
    'C': [randint(1, 100) for _ in range(100)],
    'D': [randint(1, 100) for _ in range(100)],
})

df2 = pl.DataFrame({
    'reg_A': [[6.2 for _ in range(2)]],
    'reg_B': [[0.1 for _ in range(2)]],
    'C': [randint(1, 100) / 1000],
    'C_1': ['text'],
    'C_2': [1],
    'C_3': [2],
    'C_4': [3],
    'C_5': [4],
    'C_6': ['text'],
})


def pq_to_np():
    for i in range(100_000):
        df1.to_numpy()


def run():
    for i in range(10_000):
        print(i)
        df2.select([pl.col('^reg.*$').list.eval(pl.element().round(3))])


Thread(target=pq_to_np).start()
Thread(target=run).start()

Log output

No response

Issue description

When all conditions in Reproducible example are met - a GIL deadlock occurs.
The print(i) will freeze.

Expected behavior

No GIL deadlock occurring.

Installed versions

Polars:              0.19.2
Index type:          UInt32
Platform:            Linux-6.3.7-060307-generic-x86_64-with-glibc2.35
Python:              3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0]

----Optional dependencies----
adbc_driver_sqlite:  <not installed>
cloudpickle:         <not installed>
connectorx:          <not installed>
deltalake:           <not installed>
fsspec:              <not installed>
matplotlib:          3.7.2
numpy:               1.24.2
pandas:              2.0.3
pyarrow:             12.0.1
pydantic:            <not installed>
sqlalchemy:          <not installed>
xlsx2csv:            <not installed>
xlsxwriter:          <not installed>
None

The text was updated successfully, but these errors were encountered:

s-b90 · 2023-10-02T07:19:10Z

Hi guys, thanks for accepting this bug report.
Is there any chance to fix this in near future?

s-b90 · 2023-10-04T14:17:45Z

At the moment, I've found some workaround.

# instead of
df1.to_numpy()
# use
np.stack([df1[c].to_numpy() for c in df1.columns], axis=1)

Seems that series.to_numpy() executes in a single thread and not causes GIL deadlock. It is a bit slower than df.to_numpy(), but it works.

qci-amos · 2023-10-04T14:23:36Z

for my 2d array, I've been able to get np.array(arr2d.rows()) to work as an alternative to arr2d.to_numpy()

s-b90 · 2023-10-04T14:36:55Z

Yep, it also will work, but it will be much slower. Because all data converted through python.

from random import randint

import numpy as np
import polars as pl

df1 = pl.DataFrame({
    'A': [randint(1, 100) for _ in range(1000000)],
    'B': [randint(1, 100) for _ in range(1000000)],
    'C': [randint(1, 100) for _ in range(1000000)],
    'D': [randint(1, 100) for _ in range(1000000)],
})

%%timeit
np.array(df1.rows())
471 ms ± 12.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
np.stack([df1[c].to_numpy() for c in df1.columns], axis=1)
10.4 ms ± 646 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
df1.to_numpy()
3.15 ms ± 655 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

qci-amos · 2023-10-04T16:55:44Z

Thank you! I'll use your version instead!

s-b90 · 2023-11-13T17:48:35Z

@orlp I've found that same GIL deadlock also appears in combination read_parquet() and list.eval()
The problem here is when you're using several list.eval expression in one select or with_columns.

from random import randint
from threading import Thread
import polars as pl

df1 = pl.DataFrame({
    'A': [randint(1, 100) for _ in range(10)],
    'B': [randint(1, 100) for _ in range(10)],
    'C': [randint(1, 100) for _ in range(10)],
    'D': [randint(1, 100) for _ in range(10)],
})
df1.write_parquet('test.pq', compression='zstd')

df2 = pl.DataFrame({
    'reg_A': [[6.2 for _ in range(2)]],
    'reg_B': [[0.1 for _ in range(2)]],
    'C': [randint(1, 100) / 1000],
    'C_1': ['text'],
    'C_2': [1],
    'C_3': [2],
    'C_4': [3],
    'C_5': [4],
    'C_6': ['text'],
})


def read_pq():
    for i in range(500_000):
        pl.read_parquet('test.pq')


def run():
    for i in range(100_000):
        print(i)
        df2.select([pl.col('^reg.*$').list.eval(pl.element().round(3))])


Thread(target=read_pq).start()
Thread(target=run).start()

Should I create a separate issue for this?

And more important question, is it a bug at all?
I saw #5347, and I've read mp page in user guide.
But how polars should behave with threading library? Is there some restrictions?
Is it unexpected behavior that GIL deadlock appears while using several threads with polars?
Maybe you or @ritchie46 can provide some details here?

stinodego · 2024-05-22T18:03:15Z

I cannot reproduce this on the current Polars version, so I'm closing this as unable to reproduce. Please open a new issue with a reproducible example if you're still encountering a related issue.

s-b90 added bug Something isn't working python Related to Python Polars labels Sep 7, 2023

orlp changed the title ~~GIL in to_numpy and list.eval()~~ GIL deadlock in to_numpy and list.eval() Sep 7, 2023

orlp added the accepted Ready for implementation label Sep 8, 2023

qci-amos mentioned this issue Oct 3, 2023

Deadlocking with mulitple calls to concat in compute graph #9310

Open

2 tasks

stinodego added P-medium Priority: medium and removed accepted Ready for implementation labels Jan 12, 2024

stinodego closed this as not planned Won't fix, can't repro, duplicate, stale May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GIL deadlock in to_numpy and list.eval() #10970

GIL deadlock in to_numpy and list.eval() #10970

s-b90 commented Sep 7, 2023 •

edited by orlp

Loading

s-b90 commented Oct 2, 2023

s-b90 commented Oct 4, 2023

qci-amos commented Oct 4, 2023

s-b90 commented Oct 4, 2023

qci-amos commented Oct 4, 2023

s-b90 commented Nov 13, 2023 •

edited

Loading

stinodego commented May 22, 2024

GIL deadlock in to_numpy and list.eval() #10970

GIL deadlock in to_numpy and list.eval() #10970

Comments

s-b90 commented Sep 7, 2023 • edited by orlp Loading

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

s-b90 commented Oct 2, 2023

s-b90 commented Oct 4, 2023

qci-amos commented Oct 4, 2023

s-b90 commented Oct 4, 2023

qci-amos commented Oct 4, 2023

s-b90 commented Nov 13, 2023 • edited Loading

stinodego commented May 22, 2024

s-b90 commented Sep 7, 2023 •

edited by orlp

Loading

s-b90 commented Nov 13, 2023 •

edited

Loading