Casting of boolean values #15102

cbilot · 2024-03-17T00:50:08Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

Let's start with a simple data frame, and count the frequency of the number of null values in each row using sum_horizontal:

import polars as pl
import polars.selectors as cs

data = {
    "x": [None, 'two', None],
    "y": ['one', 'two', None],
}
pl.from_dict(data)

(
    pl.from_dict(data)
    .with_columns(nbr_nulls=pl.sum_horizontal(cs.all().is_null()))
    .group_by("nbr_nulls")
    .len()
)

I get the results I expect:

>>> pl.from_dict(data)
shape: (3, 2)
┌──────┬──────┐
│ x    ┆ y    │
│ ---  ┆ ---  │
│ str  ┆ str  │
╞══════╪══════╡
│ null ┆ one  │
│ two  ┆ two  │
│ null ┆ null │
└──────┴──────┘
>>> (
...     pl.from_dict(data)
...     .with_columns(nbr_nulls=pl.sum_horizontal(cs.all().is_null()))
...     .group_by("nbr_nulls")
...     .len()
... 
... )
DATAFRAME < 1000 rows: running default HASH AGGREGATION
shape: (3, 2)
┌───────────┬─────┐
│ nbr_nulls ┆ len │
│ ---       ┆ --- │
│ u32       ┆ u32 │
╞═══════════╪═════╡
│ 0         ┆ 1   │
│ 2         ┆ 1   │
│ 1         ┆ 1   │
└───────────┴─────┘

Now, let's eliminate column y, and re-run. Note how the dtype of the nbr_nulls column changes to boolean:

data = {
    "x": [None, 'two', None],
}
pl.from_dict(data)

(
    pl.from_dict(data)
    .with_columns(nbr_nulls=pl.sum_horizontal(cs.all().is_null()))
    .group_by("nbr_nulls")
    .len()
)

>>> pl.from_dict(data)
shape: (3, 1)
┌──────┐
│ x    │
│ ---  │
│ str  │
╞══════╡
│ null │
│ two  │
│ null │
└──────┘
>>> (
...     pl.from_dict(data)
...     .with_columns(nbr_nulls=pl.sum_horizontal(cs.all().is_null()))
...     .group_by("nbr_nulls")
...     .len()
... 
... )
DATAFRAME < 1000 rows: running default HASH AGGREGATION
shape: (2, 2)
┌───────────┬─────┐
│ nbr_nulls ┆ len │
│ ---       ┆ --- │
│ bool      ┆ u32 │
╞═══════════╪═════╡
│ true      ┆ 2   │
│ false     ┆ 1   │
└───────────┴─────┘

Log output

No response

Issue description

Things get somewhat weirder when we introduce lazy mode. Let's perform the same type of operation with a somewhat larger data frame. In eager mode, the following gives the results I expect:

nbr_records = 334
data = {
    "x": [None, 'two', None] * nbr_records,
    "y": ['one', 'two', None] * nbr_records,
    "z": [None, 'two', None] * nbr_records,
}
pl.from_dict(data)
(
    pl.from_dict(data)
    # .lazy()
    .with_columns(nbr_nulls=pl.sum_horizontal(cs.all().is_null()))
    .group_by("nbr_nulls")
    .len()
    # .collect()
)

>>> pl.from_dict(data)
shape: (1_002, 3)
┌──────┬──────┬──────┐
│ x    ┆ y    ┆ z    │
│ ---  ┆ ---  ┆ ---  │
│ str  ┆ str  ┆ str  │
╞══════╪══════╪══════╡
│ null ┆ one  ┆ null │
│ two  ┆ two  ┆ two  │
│ null ┆ null ┆ null │
│ null ┆ one  ┆ null │
│ two  ┆ two  ┆ two  │
│ …    ┆ …    ┆ …    │
│ two  ┆ two  ┆ two  │
│ null ┆ null ┆ null │
│ null ┆ one  ┆ null │
│ two  ┆ two  ┆ two  │
│ null ┆ null ┆ null │
└──────┴──────┴──────┘
>>> (
...     pl.from_dict(data)
...     # .lazy()
...     .with_columns(nbr_nulls=pl.sum_horizontal(cs.all().is_null()))
...     .group_by("nbr_nulls")
...     .len()
...     # .collect()
... 
... )
estimated unique values: 3
run STREAMING HASH AGGREGATION
RUN STREAMING PIPELINE
df -> primitive_group_by -> ordered_sink
RefCell { value: [] }
shape: (3, 2)
┌───────────┬─────┐
│ nbr_nulls ┆ len │
│ ---       ┆ --- │
│ u32       ┆ u32 │
╞═══════════╪═════╡
│ 2         ┆ 334 │
│ 3         ┆ 334 │
│ 0         ┆ 334 │
└───────────┴─────┘

Now, let's put the data frame in lazy mode, and conduct the same operation. Note how the dtype of nbr_nulls becomes boolean.

nbr_records = 334
data = {
    "x": [None, 'two', None] * nbr_records,
    "y": ['one', 'two', None] * nbr_records,
    "z": [None, 'two', None] * nbr_records,
}
pl.from_dict(data)
(
    pl.from_dict(data)
    .lazy()
    .with_columns(nbr_nulls=pl.sum_horizontal(cs.all().is_null()))
    .group_by("nbr_nulls")
    .len()
    .collect()
)

>>> pl.from_dict(data)
shape: (1_002, 3)
┌──────┬──────┬──────┐
│ x    ┆ y    ┆ z    │
│ ---  ┆ ---  ┆ ---  │
│ str  ┆ str  ┆ str  │
╞══════╪══════╪══════╡
│ null ┆ one  ┆ null │
│ two  ┆ two  ┆ two  │
│ null ┆ null ┆ null │
│ null ┆ one  ┆ null │
│ two  ┆ two  ┆ two  │
│ …    ┆ …    ┆ …    │
│ two  ┆ two  ┆ two  │
│ null ┆ null ┆ null │
│ null ┆ one  ┆ null │
│ two  ┆ two  ┆ two  │
│ null ┆ null ┆ null │
└──────┴──────┴──────┘
>>> (
...     pl.from_dict(data)
...     .lazy()
...     .with_columns(nbr_nulls=pl.sum_horizontal(cs.all().is_null()))
...     .group_by("nbr_nulls")
...     .len()
...     .collect()
... 
... )
estimated unique values: 3
run STREAMING HASH AGGREGATION
RUN STREAMING PIPELINE
df -> generic-group_by -> ordered_sink
RefCell { value: [] }
finish streaming aggregation with local in-memory table
shape: (3, 2)
┌───────────┬─────┐
│ nbr_nulls ┆ len │
│ ---       ┆ --- │
│ bool      ┆ u32 │
╞═══════════╪═════╡
│ false     ┆ 334 │
│ false     ┆ 334 │
│ false     ┆ 334 │
└───────────┴─────┘

Now let's change the nbr_records variable from 334 to 333, and re-run in lazy mode. The dtype of nbr_nulls switches back to u32.

nbr_records = 333
data = {
    "x": [None, 'two', None] * nbr_records,
    "y": ['one', 'two', None] * nbr_records,
    "z": [None, 'two', None] * nbr_records,
}
pl.from_dict(data)
(
    pl.from_dict(data)
    .lazy()
    .with_columns(nbr_nulls=pl.sum_horizontal(cs.all().is_null()))
    .group_by("nbr_nulls")
    .len()
    .collect()
)

>>> nbr_records = 333
>>> data = {
...     "x": [None, 'two', None] * nbr_records,
...     "y": ['one', 'two', None] * nbr_records,
...     "z": [None, 'two', None] * nbr_records,
... 
... }
>>> pl.from_dict(data)
shape: (999, 3)
┌──────┬──────┬──────┐
│ x    ┆ y    ┆ z    │
│ ---  ┆ ---  ┆ ---  │
│ str  ┆ str  ┆ str  │
╞══════╪══════╪══════╡
│ null ┆ one  ┆ null │
│ two  ┆ two  ┆ two  │
│ null ┆ null ┆ null │
│ null ┆ one  ┆ null │
│ two  ┆ two  ┆ two  │
│ …    ┆ …    ┆ …    │
│ two  ┆ two  ┆ two  │
│ null ┆ null ┆ null │
│ null ┆ one  ┆ null │
│ two  ┆ two  ┆ two  │
│ null ┆ null ┆ null │
└──────┴──────┴──────┘
>>> (
...     pl.from_dict(data)
...     .lazy()
...     .with_columns(nbr_nulls=pl.sum_horizontal(cs.all().is_null()))
...     .group_by("nbr_nulls")
...     .len()
...     .collect()
... 
... )
DATAFRAME < 1000 rows: running default HASH AGGREGATION
shape: (3, 2)
┌───────────┬─────┐
│ nbr_nulls ┆ len │
│ ---       ┆ --- │
│ u32       ┆ u32 │
╞═══════════╪═════╡
│ 0         ┆ 333 │
│ 3         ┆ 333 │
│ 2         ┆ 333 │
└───────────┴─────┘

From the log, it seems that Polars' engine is using a different optimization, and arriving at a different dtype for nbr_nulls.

Expected behavior

I would expect the dtype of nbr_nulls to remain the same.

I realize that summing boolean values (without first casting) may not be best practices. But I didn't expect to see dtypes of columns shifting between eager and lazy mode, nor based on the number of columns involved. (Indeed, this was a rather cumbersome problem to replicate with simple examples when I first discovered these shifting dtypes in a much larger query.)

Installed versions

--------Version info---------
Polars:               0.20.15
Index type:           UInt32
Platform:             Linux-6.5.0-25-generic-x86_64-with-glibc2.35
Python:               3.11.8 (main, Feb 25 2024, 16:39:33) [GCC 11.4.0]

----Optional dependencies----
adbc_driver_manager:  0.10.0
cloudpickle:          3.0.0
connectorx:           0.3.2
deltalake:            0.16.0
fastexcel:            0.9.1
fsspec:               2023.12.2
gevent:               24.2.1
hvplot:               0.9.2
matplotlib:           <not installed>
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.1
pyarrow:              15.0.1
pydantic:             2.6.4
pyiceberg:            0.6.0
pyxlsb:               <not installed>
sqlalchemy:           2.0.28
xlsx2csv:             0.8.2
xlsxwriter:           3.2.0

The text was updated successfully, but these errors were encountered:

mcrumiller · 2024-03-17T17:01:08Z

Can you summarize the actual issue before diving in to all of your examples, and perhaps improve your title to describe the issue in one phrase? It's very hard to follow a "let's start with X" followed by a long example when we don't even know what type of error to expect, and your issue description only says that "things get weirder when we move to lazy mode".

mcrumiller · 2024-03-17T17:42:27Z

Ok, I went through your examples, I see your issue: grouping by a horizontally-summed series of boolean columns returns the original dtype (boolean) of the grouped columns when the number of records is large enough.

I can't replicate the 333 issue: this issue for me occurs when I have more than 4 records:

import polars as pl

nbr_records = 4 # does not cause issue
# nbr_records = 5 # causes issue
data = {
    "x": [None, "two", None] * nbr_records,
    "y": ["one", "two", None] * nbr_records,
    "z": [None, "two", None] * nbr_records,
}

df = pl.DataFrame(data)

lf = df.lazy().select(
    pl.sum_horizontal(pl.all().is_null()).alias("num_null")
).group_by("num_null").len()

print(lf.collect())

shape: (3, 2)
┌──────────┬─────┐
│ num_null ┆ len │
│ ---      ┆ --- │
│ bool     ┆ u32 │
╞══════════╪═════╡
│ false    ┆ 10  │
│ false    ┆ 10  │
│ false    ┆ 10  │
└──────────┴─────┘

mcrumiller · 2024-03-17T17:47:57Z

I think the root issue is that pl.sum_horizontal() on a boolean column returns bool if we only have one column:

import polars as pl

df = pl.DataFrame({
    "a": [True, False, True],
})

df.select(pl.sum_horizontal(pl.col("a")))

shape: (3, 1)
┌───────┐
│ a     │
│ ---   │
│ bool  │
╞═══════╡
│ true  │
│ false │
│ true  │
└───────┘

Most likely the dtype for sum_horizontal is not properly converting to int.

Edit: also seeing this for mean_horizontal.

mcrumiller · 2024-03-17T17:58:36Z

Seeing a few more issues, both in the schema plan and the execution:

import polars as pl

lf = pl.LazyFrame({"a": [True, False, True]})
a = pl.col("a")

lf.select(pl.mean_horizontal(a)).schema    # OrderedDict({'a': Float64})
lf.select(pl.mean_horizontal(a)).collect() # Boolean

lf.select(pl.sum_horizontal(a)).schema     # OrderedDict({'a': Boolean})
lf.select(pl.sum_horizontal(a)).collect(0  # Boolean

cbilot added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Mar 17, 2024

This was referenced Mar 17, 2024

fix: Output u32 when sum_horizontal provided with single boolean column #15114

Merged

mean_horizontal outputs wrong dtype when single column supplied #15115

Closed

ritchie46 closed this as completed in #15114 Mar 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Casting of boolean values #15102

Casting of boolean values #15102

cbilot commented Mar 17, 2024 •

edited

Loading

mcrumiller commented Mar 17, 2024

mcrumiller commented Mar 17, 2024 •

edited

Loading

mcrumiller commented Mar 17, 2024 •

edited

Loading

mcrumiller commented Mar 17, 2024

Casting of boolean values #15102

Casting of boolean values #15102

Comments

cbilot commented Mar 17, 2024 • edited Loading

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

mcrumiller commented Mar 17, 2024

mcrumiller commented Mar 17, 2024 • edited Loading

mcrumiller commented Mar 17, 2024 • edited Loading

mcrumiller commented Mar 17, 2024

cbilot commented Mar 17, 2024 •

edited

Loading

mcrumiller commented Mar 17, 2024 •

edited

Loading

mcrumiller commented Mar 17, 2024 •

edited

Loading