Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Polars hangs for certain math ops inside a when expression #15975

Closed
2 tasks done
mcrumiller opened this issue Apr 30, 2024 · 6 comments · Fixed by #15995
Closed
2 tasks done

Polars hangs for certain math ops inside a when expression #15975

mcrumiller opened this issue Apr 30, 2024 · 6 comments · Fixed by #15995
Assignees
Labels
accepted Ready for implementation bug Something isn't working P-high Priority: high python Related to Python Polars

Comments

@mcrumiller
Copy link
Contributor

mcrumiller commented Apr 30, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Issue description

When the following expression is run, polars hangs indefinitely. This only occurs when int & float are used in conjunction as in the example, and only when inside a when/then clause. If the when is removed, the expected result is shown. There is probably a simpler example but I haven't found one so far.

import polars as pl
from polars import col, when

df = pl.DataFrame({"a": [1, 2, 3]})

df.with_columns(
    # hangs indefinitely
    when(True).then(1**col("a") + 1.0*col("a"))

    # succeeds
    (1**col("a") + 1.0*col("a"))
)

Installed versions

--------Version info---------
Polars:               0.20.23
Index type:           UInt32 
Platform:             Windows-10-10.0.19045-SP0
Python:               3.11.7 (tags/v3.11.7:fa7a6f2, Dec  4 2023, 19:24:49) [MSC v.1937 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:  0.11.0
cloudpickle:          <not installed>
connectorx:           0.3.2
deltalake:            <not installed>
fastexcel:            0.10.4
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.8.2
nest_asyncio:         1.5.8
numpy:                1.26.4
openpyxl:             3.1.2
pandas:               2.1.4
pyarrow:              16.0.0
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           2.0.23
xlsx2csv:             0.8.2
xlsxwriter:           3.1.9
@mcrumiller mcrumiller added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Apr 30, 2024
@orlp
Copy link
Collaborator

orlp commented Apr 30, 2024

Can not reproduce on Apple M2, Polars 0.20.21, can reproduce on 0.20.23. So a recent bug.

@orlp orlp added P-high Priority: high and removed needs triage Awaiting prioritization by a maintainer labels Apr 30, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog Apr 30, 2024
@orlp orlp self-assigned this Apr 30, 2024
@orlp
Copy link
Collaborator

orlp commented Apr 30, 2024

It hangs in the optimizer, likely due to the new Python integer schema resolving steps.

@orlp
Copy link
Collaborator

orlp commented Apr 30, 2024

I will leave this one to @ritchie46 , he did those resolving steps and a simple fix causes other tests to fail. My research so far is below.

There is an infinite loop because in crates/polars-plan/src/logical_plan/optimizer/type_coercion/binary.rs we compute a supertype, and then insert casts when necessary:

let new_node_left = if type_left != st {
    expr_arena.add(AExpr::Cast {
        expr: node_left,
        data_type: st.clone(),
        strict: false,
    })
} else {
    node_left
};

However, the PartialEq implementation of DataType in crates/polars-core/src/datatypes/dtype.rs is incorrect, it needs the following match added:

    // Only consider unknown types equal if their inner types match.
    (Unknown(l), Unknown(r)) => l == r,
}

Right now any comparison between two Unknown types will succeed, even if one is integer and the other is float. This results in no substitution and thus an infinite loop.

Also, the supertype we compute is likely not correct, as in crates/polars-core/src/utils/supertype.rs we have the following match which makes no sense to me:

            (dt, Unknown(kind)) => {
                match kind {
                    UnknownKind::Float | UnknownKind::Int(_) if dt.is_float() | dt.is_string() => Some(dt.clone()),

This results in the supertype of UnknownKind::Int(1) and UnknownKind::Float being UnknownKind::Int(1).

@orlp orlp removed their assignment Apr 30, 2024
@mcrumiller
Copy link
Contributor Author

Thanks for the detailed analysis @orlp. I can easily get around this for now by simply using "1.0" instead of "1" so on my side it's not pressing.

@bralvarenga
Copy link

I have a similar problem with when/then clauses mixing integers and floats. Example:

df = pl.DataFrame({"foo": [1, 3, 4], "bar": [3, 4, 0]})

df = df.with_columns(
    pl.when(pl.col("foo") == 1)
    .then(1)
    .when(pl.col("foo") == 2)
    .then(4)
    .when(pl.col("foo") == 3)
    .then(1.5)
    .when(pl.col("foo") == 4)
    .then(16)
    .otherwise(0)
    .alias("val")
)

returns:

foo	bar	val
i64	i64	i32
1	3	1
3	4	1
4	0	16

Instead of:

foo	bar	val
i64	i64	i32
1	3	1
3	4	1.5
4	0	16

@ritchie46
Copy link
Member

main issue fixed by #15992

@github-project-automation github-project-automation bot moved this from Ready to Done in Backlog May 2, 2024
@c-peters c-peters added the accepted Ready for implementation label May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation bug Something isn't working P-high Priority: high python Related to Python Polars
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants