`cast(pl.Enum, strict=False)` fails when casting from another enum/categorical #14900

mcrumiller · 2024-03-07T14:16:34Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

# succeeds
pl.Series(["a", "b"]).cast(pl.Enum(["a"]), strict=False)

# ComputeError: value 'b' is not present in Enum: Utf8ViewArray[a]
pl.Series(["a", "b"], dtype=pl.Categorical).cast(pl.Enum(["a"]), strict=False)

# ComputeError: value 'b' is not present in Enum: Utf8ViewArray[a]
pl.Series(["a", "b"], dtype=pl.Enum(["a", "b"])).cast(pl.Enum(["a"]), strict=False)

Issue description

#14728 enabled non-strict casting to an Enum, whereby elements not in the Enum set are set to null. However, this fails when the source series is already a categorical or enum.

Installed versions

--------Version info---------
Polars:               0.20.14
Index type:           UInt32
Platform:             Windows-10-10.0.19045-SP0
Python:               3.11.7 (tags/v3.11.7:fa7a6f2, Dec  4 2023, 19:24:49) [MSC v.1937 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           0.3.2
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.8.2
numpy:                1.26.2
openpyxl:             3.1.2
pandas:               2.1.4
pyarrow:              14.0.1
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           2.0.23
xlsx2csv:             0.8.2
xlsxwriter:           3.1.9

The text was updated successfully, but these errors were encountered:

mcrumiller · 2024-03-07T14:17:39Z

@c-peters

deanm0000 · 2024-03-07T19:28:12Z

I put p-low since it seems this is a corner case that wouldn't be too tough to work around. As a tangent: Is it better (from the issue author's perspective) to be in "needs triage' or "P-low"

mcrumiller · 2024-03-07T19:35:56Z

It's not super easy to work around without going back through String which is a pretty big performance hit. I need to do this quite a bit when I'm mapping one enum type to another that has, say, a subset of the original column's values.

mcrumiller · 2024-03-07T19:39:38Z

Re: your question, I'm not sure how to answer that. I would obviously personally prefer "needs triage" (haven't assigned a priority) to "low priority", since the latter is basically a worst-case of the former from a "this is going to be resolved quickly" perspective. But the priority itself should be based on the assessment of the maintainers regarding how important they feel this update would be to the general population, and not on my desire for it to be resolved.

deanm0000 · 2024-03-07T21:06:01Z

I'm assuming that your real goal isn't to get the null values and that you're more interested in getting them together without casting them to string and back.

Does this idea help?

s1=pl.Series(["a", "b","c"], dtype=pl.Enum(["a","b","c"]))
s2=pl.Series(["d", "e","f"], dtype=pl.Enum(["d", "e","f"]))

def enumconcat(ss):
    cats=set()
    for s in ss:
        for cat in s.cat.get_categories():
            cats.add(cat)
    super_enum=pl.Enum(list(cats))
    return pl.concat([s.cast(super_enum) for s in ss])
enumconcat([s1,s2])
shape: (6,)
Series: '' [enum]
[
	"a"
	"b"
	"c"
	"d"
	"e"
	"f"
]

As to the priority I'm rethinking it given that it should really respect the strict setting independent of anything else.

mcrumiller · 2024-03-07T21:32:06Z

My use case is building a new enum from an old list of values with slightly different categories:

import polars as pl
from polars import col, when

orig_enum = pl.Enum(["a", "b", "c"])
new_enum = pl.Enum(["a", "b", "d"])

df = pl.DataFrame({"a": pl.Series(["a", "b", "c"], dtype=old_enum)})
df.with_columns(
    when(col("a") == "c").then(pl.lit("d", dtype=new_enum))
    .otherwise(col("a").cast(new_enum))
)

Then when/then depends on a separate column, but that's basically the gist.

deanm0000 · 2024-03-07T22:35:18Z

alright I got carried away

def make_super(*args, cats=pl.Enum([])):
    for arg in args:
        if isinstance(arg, list):
            cats=make_super(*arg, cats=cats)
        elif isinstance(df, [pl.DataFrame, pl.LazyFrame]):
            cats=make_super(*[dtype.categories for dtype in df.schema.values() if isinstance(dtype, pl.Enum)], cats=cats)
        elif isinstance(arg, pl.Series) and isinstance(arg.dtype, pl.Categorical):
            cats=make_super(arg.cat.get_categories(), cats=cats)
        elif isinstance(arg, pl.Series) and isinstance(arg.dtype, pl.Enum):
            cats=make_super(arg.dtype.categories, cats=cats)
        elif isinstance(arg, pl.Series) and arg.dtype==pl.String:
            cats=pl.Enum(pl.concat([cats.categories, arg]).unique())
    return pl.Enum(sorted(cats.categories))

so with that then you can do

super_enum=make_super(orig_enum, new_enum, df)
df.with_columns(
    when(col("a") == "c").then(pl.lit("d", dtype=super_enum))
    .otherwise(col("a").cast(super_enum))
)

Since the Enum is part of the schema it works with eager and lazy

mcrumiller · 2024-03-07T22:37:03Z

Haha thanks @deanm0000. I actually have an incoming PR, but there might be discussion on it in which case I may resort to your carried-away solutions :D.

mcrumiller added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Mar 7, 2024

deanm0000 added A-dtype-categorical Area: categorical data type P-low Priority: low and removed needs triage Awaiting prioritization by a maintainer labels Mar 7, 2024

deanm0000 added P-medium Priority: medium and removed P-low Priority: low labels Mar 7, 2024

mcrumiller mentioned this issue Mar 7, 2024

fix(rust, python): allow nonstrict cast of categorical/enum to enum #14910

Merged

ritchie46 closed this as completed in #14910 Mar 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`cast(pl.Enum, strict=False)` fails when casting from another enum/categorical #14900

`cast(pl.Enum, strict=False)` fails when casting from another enum/categorical #14900

mcrumiller commented Mar 7, 2024 •

edited

Loading

mcrumiller commented Mar 7, 2024

deanm0000 commented Mar 7, 2024

mcrumiller commented Mar 7, 2024

mcrumiller commented Mar 7, 2024

deanm0000 commented Mar 7, 2024

mcrumiller commented Mar 7, 2024 •

edited

Loading

deanm0000 commented Mar 7, 2024 •

edited

Loading

mcrumiller commented Mar 7, 2024

cast(pl.Enum, strict=False) fails when casting from another enum/categorical #14900

cast(pl.Enum, strict=False) fails when casting from another enum/categorical #14900

Comments

mcrumiller commented Mar 7, 2024 • edited Loading

Checks

Reproducible example

Issue description

Installed versions

mcrumiller commented Mar 7, 2024

deanm0000 commented Mar 7, 2024

mcrumiller commented Mar 7, 2024

mcrumiller commented Mar 7, 2024

deanm0000 commented Mar 7, 2024

mcrumiller commented Mar 7, 2024 • edited Loading

deanm0000 commented Mar 7, 2024 • edited Loading

mcrumiller commented Mar 7, 2024

`cast(pl.Enum, strict=False)` fails when casting from another enum/categorical #14900

`cast(pl.Enum, strict=False)` fails when casting from another enum/categorical #14900

mcrumiller commented Mar 7, 2024 •

edited

Loading

mcrumiller commented Mar 7, 2024 •

edited

Loading

deanm0000 commented Mar 7, 2024 •

edited

Loading