-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Casting of boolean values #15102
Comments
Can you summarize the actual issue before diving in to all of your examples, and perhaps improve your title to describe the issue in one phrase? It's very hard to follow a "let's start with X" followed by a long example when we don't even know what type of error to expect, and your issue description only says that "things get weirder when we move to lazy mode". |
Ok, I went through your examples, I see your issue: grouping by a horizontally-summed series of boolean columns returns the original dtype (boolean) of the grouped columns when the number of records is large enough. I can't replicate the 333 issue: this issue for me occurs when I have more than 4 records: import polars as pl
nbr_records = 4 # does not cause issue
# nbr_records = 5 # causes issue
data = {
"x": [None, "two", None] * nbr_records,
"y": ["one", "two", None] * nbr_records,
"z": [None, "two", None] * nbr_records,
}
df = pl.DataFrame(data)
lf = df.lazy().select(
pl.sum_horizontal(pl.all().is_null()).alias("num_null")
).group_by("num_null").len()
print(lf.collect())
|
I think the root issue is that import polars as pl
df = pl.DataFrame({
"a": [True, False, True],
})
df.select(pl.sum_horizontal(pl.col("a")))
Most likely the Edit: also seeing this for |
Seeing a few more issues, both in the schema plan and the execution: import polars as pl
lf = pl.LazyFrame({"a": [True, False, True]})
a = pl.col("a")
lf.select(pl.mean_horizontal(a)).schema # OrderedDict({'a': Float64})
lf.select(pl.mean_horizontal(a)).collect() # Boolean
lf.select(pl.sum_horizontal(a)).schema # OrderedDict({'a': Boolean})
lf.select(pl.sum_horizontal(a)).collect(0 # Boolean |
Checks
Reproducible example
Let's start with a simple data frame, and count the frequency of the number of null values in each row using
sum_horizontal
:I get the results I expect:
Now, let's eliminate column
y
, and re-run. Note how the dtype of thenbr_nulls
column changes to boolean:Log output
No response
Issue description
Things get somewhat weirder when we introduce lazy mode. Let's perform the same type of operation with a somewhat larger data frame. In eager mode, the following gives the results I expect:
Now, let's put the data frame in lazy mode, and conduct the same operation. Note how the dtype of
nbr_nulls
becomes boolean.Now let's change the
nbr_records
variable from 334 to 333, and re-run in lazy mode. The dtype ofnbr_nulls
switches back to u32.From the log, it seems that Polars' engine is using a different optimization, and arriving at a different dtype for
nbr_nulls
.Expected behavior
I would expect the dtype of
nbr_nulls
to remain the same.I realize that summing boolean values (without first casting) may not be best practices. But I didn't expect to see dtypes of columns shifting between eager and lazy mode, nor based on the number of columns involved. (Indeed, this was a rather cumbersome problem to replicate with simple examples when I first discovered these shifting dtypes in a much larger query.)
Installed versions
The text was updated successfully, but these errors were encountered: