Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cast(pl.Enum, strict=False) fails when casting from another enum/categorical #14900

Closed
2 tasks done
mcrumiller opened this issue Mar 7, 2024 · 8 comments · Fixed by #14910
Closed
2 tasks done

cast(pl.Enum, strict=False) fails when casting from another enum/categorical #14900

mcrumiller opened this issue Mar 7, 2024 · 8 comments · Fixed by #14910
Labels
A-dtype-categorical Area: categorical data type bug Something isn't working P-medium Priority: medium python Related to Python Polars

Comments

@mcrumiller
Copy link
Contributor

mcrumiller commented Mar 7, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

# succeeds
pl.Series(["a", "b"]).cast(pl.Enum(["a"]), strict=False)

# ComputeError: value 'b' is not present in Enum: Utf8ViewArray[a]
pl.Series(["a", "b"], dtype=pl.Categorical).cast(pl.Enum(["a"]), strict=False)

# ComputeError: value 'b' is not present in Enum: Utf8ViewArray[a]
pl.Series(["a", "b"], dtype=pl.Enum(["a", "b"])).cast(pl.Enum(["a"]), strict=False)

Issue description

#14728 enabled non-strict casting to an Enum, whereby elements not in the Enum set are set to null. However, this fails when the source series is already a categorical or enum.

Installed versions

--------Version info---------
Polars:               0.20.14
Index type:           UInt32
Platform:             Windows-10-10.0.19045-SP0
Python:               3.11.7 (tags/v3.11.7:fa7a6f2, Dec  4 2023, 19:24:49) [MSC v.1937 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           0.3.2
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.8.2
numpy:                1.26.2
openpyxl:             3.1.2
pandas:               2.1.4
pyarrow:              14.0.1
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           2.0.23
xlsx2csv:             0.8.2
xlsxwriter:           3.1.9
@mcrumiller mcrumiller added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Mar 7, 2024
@mcrumiller
Copy link
Contributor Author

@c-peters

@deanm0000 deanm0000 added A-dtype-categorical Area: categorical data type P-low Priority: low and removed needs triage Awaiting prioritization by a maintainer labels Mar 7, 2024
@deanm0000
Copy link
Collaborator

I put p-low since it seems this is a corner case that wouldn't be too tough to work around. As a tangent: Is it better (from the issue author's perspective) to be in "needs triage' or "P-low"

@mcrumiller
Copy link
Contributor Author

It's not super easy to work around without going back through String which is a pretty big performance hit. I need to do this quite a bit when I'm mapping one enum type to another that has, say, a subset of the original column's values.

@mcrumiller
Copy link
Contributor Author

Re: your question, I'm not sure how to answer that. I would obviously personally prefer "needs triage" (haven't assigned a priority) to "low priority", since the latter is basically a worst-case of the former from a "this is going to be resolved quickly" perspective. But the priority itself should be based on the assessment of the maintainers regarding how important they feel this update would be to the general population, and not on my desire for it to be resolved.

@deanm0000
Copy link
Collaborator

I'm assuming that your real goal isn't to get the null values and that you're more interested in getting them together without casting them to string and back.

Does this idea help?

s1=pl.Series(["a", "b","c"], dtype=pl.Enum(["a","b","c"]))
s2=pl.Series(["d", "e","f"], dtype=pl.Enum(["d", "e","f"]))

def enumconcat(ss):
    cats=set()
    for s in ss:
        for cat in s.cat.get_categories():
            cats.add(cat)
    super_enum=pl.Enum(list(cats))
    return pl.concat([s.cast(super_enum) for s in ss])
enumconcat([s1,s2])
shape: (6,)
Series: '' [enum]
[
	"a"
	"b"
	"c"
	"d"
	"e"
	"f"
]

As to the priority I'm rethinking it given that it should really respect the strict setting independent of anything else.

@deanm0000 deanm0000 added P-medium Priority: medium and removed P-low Priority: low labels Mar 7, 2024
@mcrumiller
Copy link
Contributor Author

mcrumiller commented Mar 7, 2024

My use case is building a new enum from an old list of values with slightly different categories:

import polars as pl
from polars import col, when

orig_enum = pl.Enum(["a", "b", "c"])
new_enum = pl.Enum(["a", "b", "d"])

df = pl.DataFrame({"a": pl.Series(["a", "b", "c"], dtype=old_enum)})
df.with_columns(
    when(col("a") == "c").then(pl.lit("d", dtype=new_enum))
    .otherwise(col("a").cast(new_enum))
)

Then when/then depends on a separate column, but that's basically the gist.

@deanm0000
Copy link
Collaborator

deanm0000 commented Mar 7, 2024

alright I got carried away

def make_super(*args, cats=pl.Enum([])):
    for arg in args:
        if isinstance(arg, list):
            cats=make_super(*arg, cats=cats)
        elif isinstance(df, [pl.DataFrame, pl.LazyFrame]):
            cats=make_super(*[dtype.categories for dtype in df.schema.values() if isinstance(dtype, pl.Enum)], cats=cats)
        elif isinstance(arg, pl.Series) and isinstance(arg.dtype, pl.Categorical):
            cats=make_super(arg.cat.get_categories(), cats=cats)
        elif isinstance(arg, pl.Series) and isinstance(arg.dtype, pl.Enum):
            cats=make_super(arg.dtype.categories, cats=cats)
        elif isinstance(arg, pl.Series) and arg.dtype==pl.String:
            cats=pl.Enum(pl.concat([cats.categories, arg]).unique())
    return pl.Enum(sorted(cats.categories))

so with that then you can do

super_enum=make_super(orig_enum, new_enum, df)
df.with_columns(
    when(col("a") == "c").then(pl.lit("d", dtype=super_enum))
    .otherwise(col("a").cast(super_enum))
)

Since the Enum is part of the schema it works with eager and lazy

@mcrumiller
Copy link
Contributor Author

Haha thanks @deanm0000. I actually have an incoming PR, but there might be discussion on it in which case I may resort to your carried-away solutions :D.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-dtype-categorical Area: categorical data type bug Something isn't working P-medium Priority: medium python Related to Python Polars
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants