Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Casting of boolean values #15102

Closed
2 tasks done
cbilot opened this issue Mar 17, 2024 · 4 comments · Fixed by #15114
Closed
2 tasks done

Casting of boolean values #15102

cbilot opened this issue Mar 17, 2024 · 4 comments · Fixed by #15114
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@cbilot
Copy link

cbilot commented Mar 17, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

Let's start with a simple data frame, and count the frequency of the number of null values in each row using sum_horizontal:

import polars as pl
import polars.selectors as cs

data = {
    "x": [None, 'two', None],
    "y": ['one', 'two', None],
}
pl.from_dict(data)

(
    pl.from_dict(data)
    .with_columns(nbr_nulls=pl.sum_horizontal(cs.all().is_null()))
    .group_by("nbr_nulls")
    .len()
)

I get the results I expect:

>>> pl.from_dict(data)
shape: (3, 2)
┌──────┬──────┐
│ x    ┆ y    │
│ ---  ┆ ---  │
│ str  ┆ str  │
╞══════╪══════╡
│ null ┆ one  │
│ two  ┆ two  │
│ null ┆ null │
└──────┴──────┘
>>> (
...     pl.from_dict(data)
...     .with_columns(nbr_nulls=pl.sum_horizontal(cs.all().is_null()))
...     .group_by("nbr_nulls")
...     .len()
... 
... )
DATAFRAME < 1000 rows: running default HASH AGGREGATION
shape: (3, 2)
┌───────────┬─────┐
│ nbr_nulls ┆ len │
│ ---       ┆ --- │
│ u32       ┆ u32 │
╞═══════════╪═════╡
│ 0         ┆ 1   │
│ 2         ┆ 1   │
│ 1         ┆ 1   │
└───────────┴─────┘

Now, let's eliminate column y, and re-run. Note how the dtype of the nbr_nulls column changes to boolean:

data = {
    "x": [None, 'two', None],
}
pl.from_dict(data)

(
    pl.from_dict(data)
    .with_columns(nbr_nulls=pl.sum_horizontal(cs.all().is_null()))
    .group_by("nbr_nulls")
    .len()
)
>>> pl.from_dict(data)
shape: (3, 1)
┌──────┐
│ x    │
│ ---  │
│ str  │
╞══════╡
│ null │
│ two  │
│ null │
└──────┘
>>> (
...     pl.from_dict(data)
...     .with_columns(nbr_nulls=pl.sum_horizontal(cs.all().is_null()))
...     .group_by("nbr_nulls")
...     .len()
... 
... )
DATAFRAME < 1000 rows: running default HASH AGGREGATION
shape: (2, 2)
┌───────────┬─────┐
│ nbr_nulls ┆ len │
│ ---       ┆ --- │
│ bool      ┆ u32 │
╞═══════════╪═════╡
│ true      ┆ 2   │
│ false     ┆ 1   │
└───────────┴─────┘

Log output

No response

Issue description

Things get somewhat weirder when we introduce lazy mode. Let's perform the same type of operation with a somewhat larger data frame. In eager mode, the following gives the results I expect:

nbr_records = 334
data = {
    "x": [None, 'two', None] * nbr_records,
    "y": ['one', 'two', None] * nbr_records,
    "z": [None, 'two', None] * nbr_records,
}
pl.from_dict(data)
(
    pl.from_dict(data)
    # .lazy()
    .with_columns(nbr_nulls=pl.sum_horizontal(cs.all().is_null()))
    .group_by("nbr_nulls")
    .len()
    # .collect()
)
>>> pl.from_dict(data)
shape: (1_002, 3)
┌──────┬──────┬──────┐
│ x    ┆ y    ┆ z    │
│ ---  ┆ ---  ┆ ---  │
│ str  ┆ str  ┆ str  │
╞══════╪══════╪══════╡
│ null ┆ one  ┆ null │
│ two  ┆ two  ┆ two  │
│ null ┆ null ┆ null │
│ null ┆ one  ┆ null │
│ two  ┆ two  ┆ two  │
│ …    ┆ …    ┆ …    │
│ two  ┆ two  ┆ two  │
│ null ┆ null ┆ null │
│ null ┆ one  ┆ null │
│ two  ┆ two  ┆ two  │
│ null ┆ null ┆ null │
└──────┴──────┴──────┘
>>> (
...     pl.from_dict(data)
...     # .lazy()
...     .with_columns(nbr_nulls=pl.sum_horizontal(cs.all().is_null()))
...     .group_by("nbr_nulls")
...     .len()
...     # .collect()
... 
... )
estimated unique values: 3
run STREAMING HASH AGGREGATION
RUN STREAMING PIPELINE
df -> primitive_group_by -> ordered_sink
RefCell { value: [] }
shape: (3, 2)
┌───────────┬─────┐
│ nbr_nulls ┆ len │
│ ---       ┆ --- │
│ u32       ┆ u32 │
╞═══════════╪═════╡
│ 2         ┆ 334 │
│ 3         ┆ 334 │
│ 0         ┆ 334 │
└───────────┴─────┘

Now, let's put the data frame in lazy mode, and conduct the same operation. Note how the dtype of nbr_nulls becomes boolean.

nbr_records = 334
data = {
    "x": [None, 'two', None] * nbr_records,
    "y": ['one', 'two', None] * nbr_records,
    "z": [None, 'two', None] * nbr_records,
}
pl.from_dict(data)
(
    pl.from_dict(data)
    .lazy()
    .with_columns(nbr_nulls=pl.sum_horizontal(cs.all().is_null()))
    .group_by("nbr_nulls")
    .len()
    .collect()
)
>>> pl.from_dict(data)
shape: (1_002, 3)
┌──────┬──────┬──────┐
│ x    ┆ y    ┆ z    │
│ ---  ┆ ---  ┆ ---  │
│ str  ┆ str  ┆ str  │
╞══════╪══════╪══════╡
│ null ┆ one  ┆ null │
│ two  ┆ two  ┆ two  │
│ null ┆ null ┆ null │
│ null ┆ one  ┆ null │
│ two  ┆ two  ┆ two  │
│ …    ┆ …    ┆ …    │
│ two  ┆ two  ┆ two  │
│ null ┆ null ┆ null │
│ null ┆ one  ┆ null │
│ two  ┆ two  ┆ two  │
│ null ┆ null ┆ null │
└──────┴──────┴──────┘
>>> (
...     pl.from_dict(data)
...     .lazy()
...     .with_columns(nbr_nulls=pl.sum_horizontal(cs.all().is_null()))
...     .group_by("nbr_nulls")
...     .len()
...     .collect()
... 
... )
estimated unique values: 3
run STREAMING HASH AGGREGATION
RUN STREAMING PIPELINE
df -> generic-group_by -> ordered_sink
RefCell { value: [] }
finish streaming aggregation with local in-memory table
shape: (3, 2)
┌───────────┬─────┐
│ nbr_nulls ┆ len │
│ ---       ┆ --- │
│ bool      ┆ u32 │
╞═══════════╪═════╡
│ false     ┆ 334 │
│ false     ┆ 334 │
│ false     ┆ 334 │
└───────────┴─────┘

Now let's change the nbr_records variable from 334 to 333, and re-run in lazy mode. The dtype of nbr_nulls switches back to u32.

nbr_records = 333
data = {
    "x": [None, 'two', None] * nbr_records,
    "y": ['one', 'two', None] * nbr_records,
    "z": [None, 'two', None] * nbr_records,
}
pl.from_dict(data)
(
    pl.from_dict(data)
    .lazy()
    .with_columns(nbr_nulls=pl.sum_horizontal(cs.all().is_null()))
    .group_by("nbr_nulls")
    .len()
    .collect()
)
>>> nbr_records = 333
>>> data = {
...     "x": [None, 'two', None] * nbr_records,
...     "y": ['one', 'two', None] * nbr_records,
...     "z": [None, 'two', None] * nbr_records,
... 
... }
>>> pl.from_dict(data)
shape: (999, 3)
┌──────┬──────┬──────┐
│ x    ┆ y    ┆ z    │
│ ---  ┆ ---  ┆ ---  │
│ str  ┆ str  ┆ str  │
╞══════╪══════╪══════╡
│ null ┆ one  ┆ null │
│ two  ┆ two  ┆ two  │
│ null ┆ null ┆ null │
│ null ┆ one  ┆ null │
│ two  ┆ two  ┆ two  │
│ …    ┆ …    ┆ …    │
│ two  ┆ two  ┆ two  │
│ null ┆ null ┆ null │
│ null ┆ one  ┆ null │
│ two  ┆ two  ┆ two  │
│ null ┆ null ┆ null │
└──────┴──────┴──────┘
>>> (
...     pl.from_dict(data)
...     .lazy()
...     .with_columns(nbr_nulls=pl.sum_horizontal(cs.all().is_null()))
...     .group_by("nbr_nulls")
...     .len()
...     .collect()
... 
... )
DATAFRAME < 1000 rows: running default HASH AGGREGATION
shape: (3, 2)
┌───────────┬─────┐
│ nbr_nulls ┆ len │
│ ---       ┆ --- │
│ u32       ┆ u32 │
╞═══════════╪═════╡
│ 0         ┆ 333 │
│ 3         ┆ 333 │
│ 2         ┆ 333 │
└───────────┴─────┘

From the log, it seems that Polars' engine is using a different optimization, and arriving at a different dtype for nbr_nulls.

Expected behavior

I would expect the dtype of nbr_nulls to remain the same.

I realize that summing boolean values (without first casting) may not be best practices. But I didn't expect to see dtypes of columns shifting between eager and lazy mode, nor based on the number of columns involved. (Indeed, this was a rather cumbersome problem to replicate with simple examples when I first discovered these shifting dtypes in a much larger query.)

Installed versions

--------Version info---------
Polars:               0.20.15
Index type:           UInt32
Platform:             Linux-6.5.0-25-generic-x86_64-with-glibc2.35
Python:               3.11.8 (main, Feb 25 2024, 16:39:33) [GCC 11.4.0]

----Optional dependencies----
adbc_driver_manager:  0.10.0
cloudpickle:          3.0.0
connectorx:           0.3.2
deltalake:            0.16.0
fastexcel:            0.9.1
fsspec:               2023.12.2
gevent:               24.2.1
hvplot:               0.9.2
matplotlib:           <not installed>
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.1
pyarrow:              15.0.1
pydantic:             2.6.4
pyiceberg:            0.6.0
pyxlsb:               <not installed>
sqlalchemy:           2.0.28
xlsx2csv:             0.8.2
xlsxwriter:           3.2.0
@cbilot cbilot added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Mar 17, 2024
@mcrumiller
Copy link
Contributor

Can you summarize the actual issue before diving in to all of your examples, and perhaps improve your title to describe the issue in one phrase? It's very hard to follow a "let's start with X" followed by a long example when we don't even know what type of error to expect, and your issue description only says that "things get weirder when we move to lazy mode".

@mcrumiller
Copy link
Contributor

mcrumiller commented Mar 17, 2024

Ok, I went through your examples, I see your issue: grouping by a horizontally-summed series of boolean columns returns the original dtype (boolean) of the grouped columns when the number of records is large enough.

I can't replicate the 333 issue: this issue for me occurs when I have more than 4 records:

import polars as pl

nbr_records = 4 # does not cause issue
# nbr_records = 5 # causes issue
data = {
    "x": [None, "two", None] * nbr_records,
    "y": ["one", "two", None] * nbr_records,
    "z": [None, "two", None] * nbr_records,
}

df = pl.DataFrame(data)

lf = df.lazy().select(
    pl.sum_horizontal(pl.all().is_null()).alias("num_null")
).group_by("num_null").len()

print(lf.collect())
shape: (3, 2)
┌──────────┬─────┐
│ num_null ┆ len │
│ ---      ┆ --- │
│ bool     ┆ u32 │
╞══════════╪═════╡
│ false    ┆ 10  │
│ false    ┆ 10  │
│ false    ┆ 10  │
└──────────┴─────┘

@mcrumiller
Copy link
Contributor

mcrumiller commented Mar 17, 2024

I think the root issue is that pl.sum_horizontal() on a boolean column returns bool if we only have one column:

import polars as pl

df = pl.DataFrame({
    "a": [True, False, True],
})

df.select(pl.sum_horizontal(pl.col("a")))
shape: (3, 1)
┌───────┐
│ a     │
│ ---   │
│ bool  │
╞═══════╡
│ true  │
│ false │
│ true  │
└───────┘

Most likely the dtype for sum_horizontal is not properly converting to int.

Edit: also seeing this for mean_horizontal.

@mcrumiller
Copy link
Contributor

Seeing a few more issues, both in the schema plan and the execution:

import polars as pl

lf = pl.LazyFrame({"a": [True, False, True]})
a = pl.col("a")

lf.select(pl.mean_horizontal(a)).schema    # OrderedDict({'a': Float64})
lf.select(pl.mean_horizontal(a)).collect() # Boolean

lf.select(pl.sum_horizontal(a)).schema     # OrderedDict({'a': Boolean})
lf.select(pl.sum_horizontal(a)).collect(0  # Boolean

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants