Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

100 * pl.col("a").mean() returns 0 when a is a boolean column #7651

Closed
2 tasks done
CameronBieganek opened this issue Mar 19, 2023 · 4 comments · Fixed by #13725
Closed
2 tasks done

100 * pl.col("a").mean() returns 0 when a is a boolean column #7651

CameronBieganek opened this issue Mar 19, 2023 · 4 comments · Fixed by #13725
Labels
bug Something isn't working P-high Priority: high python Related to Python Polars

Comments

@CameronBieganek
Copy link

Polars version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Issue description

Multiplying the output of pl.col("a").mean() by 100, where "a" is a boolean column, results in 0.

Reproducible example

In [1]: import polars as pl

In [2]: pl.__version__
Out[2]: '0.16.14'

In [3]: df = pl.DataFrame({"a": [False, True, False]}); df
Out[3]: 
shape: (3, 1)
┌───────┐
│ a     │
│ ---   │
│ bool  │
╞═══════╡
│ false │
│ true  │
│ false │
└───────┘

In [4]: df.select(pl.col("a").mean())
Out[4]: 
shape: (1, 1)
┌──────────┐
│ a        │
│ ---      │
│ f64      │
╞══════════╡
│ 0.333333 │
└──────────┘

In [5]: df.select(100.0 * pl.col("a").mean())
Out[5]: 
shape: (1, 1)
┌───────────┐
│ literal   │
│ ---       │
│ f64       │
╞═══════════╡
│ 33.333333 │
└───────────┘

In [6]: df.select(100 * pl.col("a").mean())
Out[6]: 
shape: (1, 1)
┌─────────┐
│ literal │
│ ---     │
│ i32     │
╞═════════╡
│ 0       │
└─────────┘

Expected behavior

Output should be 33.3333.

Installed versions

In [7]: pl.show_versions()
---Version info---
Polars: 0.16.14
Index type: UInt32
Platform: macOS-10.16-x86_64-i386-64bit
Python: 3.9.12 (main, Apr  5 2022, 01:53:17) 
[Clang 12.0.0 ]
---Optional dependencies---
numpy: 1.21.5
pandas: 1.4.2
pyarrow: <not installed>
connectorx: <not installed>
deltalake: <not installed>
fsspec: 2022.02.0
matplotlib: 3.5.1
xlsx2csv: <not installed>
xlsxwriter: 3.0.3
@CameronBieganek CameronBieganek added bug Something isn't working python Related to Python Polars labels Mar 19, 2023
@slonik-az
Copy link
Contributor

Uneducated guess: pl.col("a").mean() which is 0.333_f64 was cast to i32 and became zero. Looks like a bug to me. In mixed-type arithmetic the operands should be cast to the type with largest dynamic range. In this case 100 should be promoted to f64 and the result should remain f64.

@advoet
Copy link
Contributor

advoet commented Mar 28, 2023

Seems like the casting is as you suspected, @slonik-az

ldf = pl.DataFrame({"a": [True, True, False]}).lazy()

ldf.select(pl.col("a").mean() * 100).describe_optimized_plan()
# ' SELECT [[(col("a").mean().cast(Int32)) * (100)]] FROM\n  DF ["a"]; PROJECT 1/1 COLUMNS; SELECTION: "None"'

ldf.select(pl.col("a").mean() * 100.).describe_optimized_plan()
# ' SELECT [[(col("a").mean().cast(Float64)) * (100.0)]] FROM\n  DF ["a"]; PROJECT 1/1 COLUMNS; SELECTION: "None"'

I'm probably not the best person for this fix, but may continue to poke at it as an excuse to read some more code.

Here's where I got:

In polars/polars-lazy/polars-plan/src/logical_plan/optimizer/type_coercion/binary.rs::process_binary

The function get_aexpr_and_type returns (Agg(Mean(Node(0))), bool) and (Literal(100), i32).

Seems like Agg(Mean(Node)) should (always?) be f64, rather than bool

@advoet
Copy link
Contributor

advoet commented Mar 31, 2023

Current:

// polars/polars/polars-lazy/polars-plan/src/logical_plan/aexpr/schema.rs
impl AExpr {
    pub fn to_field(
        match self {
            Agg(agg) => {
                Mean(expr) => {
                    let mut field = ... //comes in as Boolean type via Col
                    float_type(&mut field) // Remains Boolean: DataType::Boolean.is_numeric() == False
                    Ok(field)

To Fix:

Basic working code

// polars/polars/polars-lazy/polars-plan/src/logical_plan/aexpr/schema.rs
impl AExpr {
    pub fn to_field(
        match self {
            Agg(agg) => {
                Mean(expr) => {
                    let mut field = ... //comes in as Boolean type via Col
                    if matches!(&field.dtype, DataType::Boolean) {
                        field.coerce(DataType::Float64);
                    } else {
                        float_type(&mut field);
                    }
                    Ok(field)

Other Comment - Lazy Median(Boolean) cannot infer dtype

  • Without knowing the data, impossible to know ahead of time if it should be bool or float
pl.DataFrame({"a": [True, True, False, False]}).lazy().select(pl.col("a").median()).dtypes
# [Boolean]
pl.DataFrame({"a": [True, True, False, False]}).lazy().select(pl.col("a").median()).collect().dtypes
# [Float64]
  • Is the lazy.dtypes field an important guarantee?
  • Currently evaluates to float64, might as well be float32 (?)
  • Why? I guess you might use the median of bools to say "if more than half..."

@Wainberg
Copy link
Contributor

@ritchie46 could you accept this issue as well?

@stinodego stinodego added the needs triage Awaiting prioritization by a maintainer label Jan 13, 2024
@stinodego stinodego added P-high Priority: high and removed needs triage Awaiting prioritization by a maintainer labels Jan 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P-high Priority: high python Related to Python Polars
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants