Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quirk in polars expression division when numpy float in numerator #6666

Closed
2 tasks done
dylanhmorris opened this issue Feb 3, 2023 · 1 comment · Fixed by #6675
Closed
2 tasks done

Quirk in polars expression division when numpy float in numerator #6666

dylanhmorris opened this issue Feb 3, 2023 · 1 comment · Fixed by #6675
Labels
bug Something isn't working python Related to Python Polars

Comments

@dylanhmorris
Copy link

dylanhmorris commented Feb 3, 2023

Polars version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Issue description

When including a numpy array or numpy float in a polars expression to create a new column, there is a quirk when the numpy array or float is the numerator of a quotient with a polars column in the denominator. The reciprocal of the correct answer is returned. Using pl.lit or casting the numerator to a float avoids this issue with the naive approach. There is no equivalent problem with multiplication (commutative) or with division problems in which the polars column is in the numerator. This suggests that the quirk has something to do with how numpy floats are treated as numerators in polars expression division.

Reproducible example

import polars as pl
import numpy as np

data = pl.DataFrame({
    "a": [0.5, 1.0, 2.0]
})

# examples of failures (compared to similar
# approaches that yield the desired result)
data.with_columns(
    [
        (np.float64(2.0) / pl.col("a")).alias(
            "Fails with float, yields reciprocal"),
        (np.array([2, 2, 2]) / pl.col("a")).alias(
            "Fails with array of same size "
            "as polars column"),
        (2.0 / pl.col("a")).alias(
            "works with regular float"),
        (float(np.float64(2.0)) / pl.col("a")).alias(
            "works if cast numpy to float"),
        (pl.lit(np.float64(2.0)) / pl.col("a")).alias(
            "works with polars literal")
    ])

# numpy floats work as expected in multiplication
data.with_columns(
    [
        (np.float64(2.0) * pl.col("a")).alias(
            "Works with multiplication"),
    ])

# numpy floats work as expected as denominators
data.with_columns(
    [
        (pl.col("a") / np.float64(2.0)).alias(
            "Works with division by "
            "numpy float"),
    ])


# numpy int in numerator throws error
data.with_columns(
    [
        (np.int64(2) / pl.col("a")).alias(
            "This throws an error"),
    ])


# python int in numerator behaves as expected
data.with_columns(
    [
        (int(2) / pl.col("a")).alias(
            "works with python int in numerator"),
    ])


# numpy array of ints gets coerced to float,
# without error, and then has same reciprocal
# issue
data.with_columns(
    [
        (np.array([2, 2, 2]).astype('int') / pl.col("a")).alias(
            "same float division quirk"),
    ])

Output:

>>> # examples of failures (compared to similar
>>> # approaches that yield the desired result)
>>> data.with_columns(
...     [
...         (np.float64(2.0) / pl.col("a")).alias(
...             "Fails with float, yields reciprocal"),
...         (np.array([2, 2, 2]) / pl.col("a")).alias(
...             "Fails with array of same size "
...             "as polars column"),
...         (2.0 / pl.col("a")).alias(
...             "works with regular float"),
...         (float(np.float64(2.0)) / pl.col("a")).alias(
...             "works if cast numpy to float"),
...         (pl.lit(np.float64(2.0)) / pl.col("a")).alias(
...             "works with polars literal")
...     ])
shape: (3, 6)
┌─────┬──────────────────────┬──────────────────────┬─────────────────────┬─────────────────────┬─────────────────────┐
│ a   ┆ Fails with float,    ┆ Fails with array of  ┆ works with regular  ┆ works if cast numpy ┆ works with polars   │
│ --- ┆ yields recipro...    ┆ same size as...      ┆ float               ┆ to float            ┆ literal             │
│ f64 ┆ ---                  ┆ ---                  ┆ ---                 ┆ ---                 ┆ ---                 │
│     ┆ f64                  ┆ f64                  ┆ f64                 ┆ f64                 ┆ f64                 │
╞═════╪══════════════════════╪══════════════════════╪═════════════════════╪═════════════════════╪═════════════════════╡
│ 0.5 ┆ 0.25                 ┆ 0.25                 ┆ 4.0                 ┆ 4.0                 ┆ 4.0                 │
│ 1.0 ┆ 0.5                  ┆ 0.5                  ┆ 2.0                 ┆ 2.0                 ┆ 2.0                 │
│ 2.0 ┆ 1.0                  ┆ 1.0                  ┆ 1.0                 ┆ 1.0                 ┆ 1.0                 │
└─────┴──────────────────────┴──────────────────────┴─────────────────────┴─────────────────────┴─────────────────────┘
>>> 
>>> # numpy floats work as expected in multiplication
>>> data.with_columns(
...     [
...         (np.float64(2.0) * pl.col("a")).alias(
...             "Works with multiplication"),
...     ])
shape: (3, 2)
┌─────┬───────────────────────────┐
│ a   ┆ Works with multiplication │
│ --- ┆ ---                       │
│ f64 ┆ f64                       │
╞═════╪═══════════════════════════╡
│ 0.5 ┆ 1.0                       │
│ 1.0 ┆ 2.0                       │
│ 2.0 ┆ 4.0                       │
└─────┴───────────────────────────┘
>>> 
>>> # numpy floats work as expected as denominators
>>> data.with_columns(
...     [
...         (pl.col("a") / np.float64(2.0)).alias(
...             "Works with division by "
...             "numpy float"),
...     ])
shape: (3, 2)
┌─────┬─────────────────────────────────────┐
│ a   ┆ Works with division by numpy flo... │
│ --- ┆ ---                                 │
│ f64 ┆ f64                                 │
╞═════╪═════════════════════════════════════╡
│ 0.5 ┆ 0.25                                │
│ 1.0 ┆ 0.5                                 │
│ 2.0 ┆ 1.0                                 │
└─────┴─────────────────────────────────────┘
>>> 
>>> 
>>> # numpy int in numerator throws error
>>> data.with_columns(
...     [
...         (np.int64(2) / pl.col("a")).alias(
...             "This throws an error"),
...     ])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "***/frame.py", line 5792, in with_columns
    self.lazy().with_columns(exprs, **named_exprs).collect(no_optimization=True)
  File "***/frame.py", line 1146, in collect
    return pli.wrap_df(ldf.collect())
exceptions.ComputeError: ValueError: Unsupported type <class 'numpy.int64'> for 2.
>>> 
>>> 
>>> # python int in numerator behaves as expected
>>> data.with_columns(
...     [
...         (int(2) / pl.col("a")).alias(
...             "works with python int in numerator"),
...     ])
shape: (3, 2)
┌─────┬─────────────────────────────────────┐
│ a   ┆ works with python int in numerat... │
│ --- ┆ ---                                 │
│ f64 ┆ f64                                 │
╞═════╪═════════════════════════════════════╡
│ 0.5 ┆ 4.0                                 │
│ 1.0 ┆ 2.0                                 │
│ 2.0 ┆ 1.0                                 │
└─────┴─────────────────────────────────────┘
>>> 
>>> 
>>> # numpy array of ints gets coerced to float,
>>> # without error, and then has same reciprocal
>>> # issue
>>> data.with_columns(
...     [
...         (np.array([2, 2, 2]).astype('int') / pl.col("a")).alias(
...             "same float division quirk"),
...     ])
shape: (3, 2)
┌─────┬───────────────────────────┐
│ a   ┆ same float division quirk │
│ --- ┆ ---                       │
│ f64 ┆ f64                       │
╞═════╪═══════════════════════════╡
│ 0.5 ┆ 0.25                      │
│ 1.0 ┆ 0.5                       │
│ 2.0 ┆ 1.0                       │
└─────┴───────────────────────────┘

Expected behavior

These alternatives all yield the expected behavior:

data.with_columns(
    [
        (2.0 / pl.col("a")).alias(
            "works with regular float"),
        (float(np.float64(2.0)) / pl.col("a")).alias(
            "works if cast numpy to float"),
        (pl.lit(np.float64(2.0)) / pl.col("a")).alias(
            "works with polars literal")
    ])
shape: (3, 4)
┌─────┬──────────────────────────┬──────────────────────────────┬───────────────────────────┐
│ a   ┆ works with regular float ┆ works if cast numpy to float ┆ works with polars literal │
│ --- ┆ ---                      ┆ ---                          ┆ ---                       │
│ f64 ┆ f64                      ┆ f64                          ┆ f64                       │
╞═════╪══════════════════════════╪══════════════════════════════╪═══════════════════════════╡
│ 0.5 ┆ 4.0                      ┆ 4.0                          ┆ 4.0                       │
│ 1.0 ┆ 2.0                      ┆ 2.0                          ┆ 2.0                       │
│ 2.0 ┆ 1.0                      ┆ 1.0                          ┆ 1.0                       │
└─────┴──────────────────────────┴──────────────────────────────┴───────────────────────────┘

Installed versions

---Version info---
Polars: 0.16.2
Index type: UInt32
Platform: macOS-12.5.1-x86_64-i386-64bit
Python: 3.10.9 (main, Dec  7 2022, 02:03:23) [Clang 13.0.0 (clang-1300.0.29.30)]
---Optional dependencies---
pyarrow: 9.0.0
pandas: 1.4.4
numpy: 1.24.1
fsspec: <not installed>
connectorx: <not installed>
xlsx2csv: 0.8
deltalake: <not installed>
matplotlib: 3.6.1
@dylanhmorris dylanhmorris added bug Something isn't working python Related to Python Polars labels Feb 3, 2023
@dylanhmorris dylanhmorris changed the title Quirk in polars expression division when numpy scalar in numerator Quirk in polars expression division when numpy float in numerator Feb 3, 2023
@zundertj
Copy link
Collaborator

zundertj commented Feb 4, 2023

This behaviour occurs because it ends up call Expr.__array_ufunc__. Taking the first example,

data.with_columns(np.float64(2.) / pl.col("a"))

this unrolls into:

s = pl.Series("a", [
	0.5
	1.0
	2.0
]])
pl.col("a").map(divide(s, 2))

i.e. the other way around. The ufunc takes priority over Expr.__rtruediv__, using that directly works but is obviously not ideal.
It also explains why wrapping in a literal works, that avoids the ufunc.

I will think of a way to fix this, it seems we could fix this by not ignoring the position of the expression in the argument list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python Related to Python Polars
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants