`100 * pl.col("a").mean()` returns `0` when `a` is a boolean column #7651

CameronBieganek · 2023-03-19T22:07:42Z

Polars version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Issue description

Multiplying the output of pl.col("a").mean() by 100, where "a" is a boolean column, results in 0.

Reproducible example

In [1]: import polars as pl

In [2]: pl.__version__
Out[2]: '0.16.14'

In [3]: df = pl.DataFrame({"a": [False, True, False]}); df
Out[3]: 
shape: (3, 1)
┌───────┐
│ a     │
│ ---   │
│ bool  │
╞═══════╡
│ false │
│ true  │
│ false │
└───────┘

In [4]: df.select(pl.col("a").mean())
Out[4]: 
shape: (1, 1)
┌──────────┐
│ a        │
│ ---      │
│ f64      │
╞══════════╡
│ 0.333333 │
└──────────┘

In [5]: df.select(100.0 * pl.col("a").mean())
Out[5]: 
shape: (1, 1)
┌───────────┐
│ literal   │
│ ---       │
│ f64       │
╞═══════════╡
│ 33.333333 │
└───────────┘

In [6]: df.select(100 * pl.col("a").mean())
Out[6]: 
shape: (1, 1)
┌─────────┐
│ literal │
│ ---     │
│ i32     │
╞═════════╡
│ 0       │
└─────────┘

Expected behavior

Output should be 33.3333.

Installed versions

In [7]: pl.show_versions()
---Version info---
Polars: 0.16.14
Index type: UInt32
Platform: macOS-10.16-x86_64-i386-64bit
Python: 3.9.12 (main, Apr  5 2022, 01:53:17) 
[Clang 12.0.0 ]
---Optional dependencies---
numpy: 1.21.5
pandas: 1.4.2
pyarrow: <not installed>
connectorx: <not installed>
deltalake: <not installed>
fsspec: 2022.02.0
matplotlib: 3.5.1
xlsx2csv: <not installed>
xlsxwriter: 3.0.3

The text was updated successfully, but these errors were encountered:

slonik-az · 2023-03-20T14:10:43Z

Uneducated guess: pl.col("a").mean() which is 0.333_f64 was cast to i32 and became zero. Looks like a bug to me. In mixed-type arithmetic the operands should be cast to the type with largest dynamic range. In this case 100 should be promoted to f64 and the result should remain f64.

advoet · 2023-03-28T01:50:18Z

Seems like the casting is as you suspected, @slonik-az

ldf = pl.DataFrame({"a": [True, True, False]}).lazy()

ldf.select(pl.col("a").mean() * 100).describe_optimized_plan()
# ' SELECT [[(col("a").mean().cast(Int32)) * (100)]] FROM\n  DF ["a"]; PROJECT 1/1 COLUMNS; SELECTION: "None"'

ldf.select(pl.col("a").mean() * 100.).describe_optimized_plan()
# ' SELECT [[(col("a").mean().cast(Float64)) * (100.0)]] FROM\n  DF ["a"]; PROJECT 1/1 COLUMNS; SELECTION: "None"'

I'm probably not the best person for this fix, but may continue to poke at it as an excuse to read some more code.

Here's where I got:

In polars/polars-lazy/polars-plan/src/logical_plan/optimizer/type_coercion/binary.rs::process_binary

The function get_aexpr_and_type returns (Agg(Mean(Node(0))), bool) and (Literal(100), i32).

Seems like Agg(Mean(Node)) should (always?) be f64, rather than bool

advoet · 2023-03-31T06:47:54Z

Current:

// polars/polars/polars-lazy/polars-plan/src/logical_plan/aexpr/schema.rs
impl AExpr {
    pub fn to_field(
        match self {
            Agg(agg) => {
                Mean(expr) => {
                    let mut field = ... //comes in as Boolean type via Col
                    float_type(&mut field) // Remains Boolean: DataType::Boolean.is_numeric() == False
                    Ok(field)

To Fix:

Basic working code

// polars/polars/polars-lazy/polars-plan/src/logical_plan/aexpr/schema.rs
impl AExpr {
    pub fn to_field(
        match self {
            Agg(agg) => {
                Mean(expr) => {
                    let mut field = ... //comes in as Boolean type via Col
                    if matches!(&field.dtype, DataType::Boolean) {
                        field.coerce(DataType::Float64);
                    } else {
                        float_type(&mut field);
                    }
                    Ok(field)

Other Comment - Lazy Median(Boolean) cannot infer dtype

Without knowing the data, impossible to know ahead of time if it should be bool or float

pl.DataFrame({"a": [True, True, False, False]}).lazy().select(pl.col("a").median()).dtypes
# [Boolean]
pl.DataFrame({"a": [True, True, False, False]}).lazy().select(pl.col("a").median()).collect().dtypes
# [Float64]

Is the lazy.dtypes field an important guarantee?
Currently evaluates to float64, might as well be float32 (?)
Why? I guess you might use the median of bools to say "if more than half..."

Wainberg · 2023-12-31T19:05:50Z

@ritchie46 could you accept this issue as well?

CameronBieganek added bug Something isn't working python Related to Python Polars labels Mar 19, 2023

zundertj mentioned this issue Apr 9, 2023

Wrong result when multiplying an expression by a literal #7988

Closed

2 tasks

stinodego added the needs triage Awaiting prioritization by a maintainer label Jan 13, 2024

Wainberg mentioned this issue Jan 14, 2024

fix(rust): make 100 * pl.col(pl.Boolean).mean() work #13725

Merged

stinodego added P-high Priority: high and removed needs triage Awaiting prioritization by a maintainer labels Jan 18, 2024

ritchie46 closed this as completed in #13725 Jan 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`100 * pl.col("a").mean()` returns `0` when `a` is a boolean column #7651

`100 * pl.col("a").mean()` returns `0` when `a` is a boolean column #7651

CameronBieganek commented Mar 19, 2023

slonik-az commented Mar 20, 2023

advoet commented Mar 28, 2023

advoet commented Mar 31, 2023

Wainberg commented Dec 31, 2023

100 * pl.col("a").mean() returns 0 when a is a boolean column #7651

100 * pl.col("a").mean() returns 0 when a is a boolean column #7651

Comments

CameronBieganek commented Mar 19, 2023

Polars version checks

Issue description

Reproducible example

Expected behavior

Installed versions

slonik-az commented Mar 20, 2023

advoet commented Mar 28, 2023

advoet commented Mar 31, 2023

Current:

To Fix:

Other Comment - Lazy Median(Boolean) cannot infer dtype

Wainberg commented Dec 31, 2023

`100 * pl.col("a").mean()` returns `0` when `a` is a boolean column #7651

`100 * pl.col("a").mean()` returns `0` when `a` is a boolean column #7651