Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SchemaError: failed to determine supertype of enum and u32 #17618

Closed
2 tasks done
bzm3r opened this issue Jul 13, 2024 · 2 comments · Fixed by #17622
Closed
2 tasks done

SchemaError: failed to determine supertype of enum and u32 #17618

bzm3r opened this issue Jul 13, 2024 · 2 comments · Fixed by #17622
Assignees
Labels
accepted Ready for implementation bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@bzm3r
Copy link

bzm3r commented Jul 13, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

%env POLARS_VERBOSE=1
%env RUST_BACKTRACE=full

import polars as pl

pl.show_versions()

df = pl.DataFrame()
dtype = pl.Enum(categories=["HBS"])
df = df.insert_column(0, pl.Series("category", [], dtype=dtype))
print(df)

relevant_keys = ["category"]
filter_dtype = pl.Struct({"category": df["category"].dtype})
filter_list = [{"category": "HBS"}]

print(f"{filter_dtype=}")
print(f"{filter_list=}")

filtered_df = df.filter(
    pl.struct(relevant_keys).is_in(
        pl.Series(
            filter_list,
            dtype=filter_dtype,
        )
    )
)
print(f"{filtered_df=}")
shape: (0, 1)
┌──────────┐
│ category │
│ ---      │
│ enum     │
╞══════════╡
└──────────┘
filter_dtype=Struct({'category': Enum(categories=['HBS'])})
filter_list=[{'category': 'HBS'}]
dataframe filtered

Log output

env: POLARS_VERBOSE=1
env: RUST_BACKTRACE=full

---------------------------------------------------------------------------
SchemaError                               Traceback (most recent call last)
Cell In[11], line 20
     17 print(f"{filter_dtype=}")
     18 print(f"{filter_list=}")
---> 20 filtered_df = df.filter(
     21     pl.struct(relevant_keys).is_in(
     22         pl.Series(
     23             filter_list,
     24             dtype=filter_dtype,
     25         )
     26     )
     27 )
     28 print(filtered_df)

File ~/.venv/lib/python3.12/site-packages/polars/dataframe/frame.py:4521, in DataFrame.filter(self, *predicates, **constraints)
   4421 def filter(
   4422     self,
   4423     *predicates: (
   (...)
   4430     **constraints: Any,
   4431 ) -> DataFrame:
   4432     """
   4433     Filter the rows in the DataFrame based on one or more predicate expressions.
   4434 
   (...)
   4519     └─────┴─────┴─────┘
   4520     """
-> 4521     return self.lazy().filter(*predicates, **constraints).collect(_eager=True)

File ~/.venv/lib/python3.12/site-packages/polars/lazyframe/frame.py:1942, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, no_optimization, streaming, background, _eager, **_kwargs)
   1939 # Only for testing purposes atm.
   1940 callback = _kwargs.get("post_opt_callback")
-> 1942 return wrap_df(ldf.collect(callback))

SchemaError: failed to determine supertype of enum and u32

Issue description

The example was created by whittling down the larger problem case into the smallest possible case which still produces the error. So, the fact that the DataFrame has shape (0, 1) is not the issue. The issue occurs even with a larger DataFrame that has data in it:

%env POLARS_VERBOSE=1
%env RUST_BACKTRACE=full

import polars as pl

pl.show_versions()

df = pl.DataFrame()
dtype = pl.Enum(categories=["HBS", "XYZ"])
df = df.insert_column(
    0, pl.Series("category", ["HBS", "XYZ", "HBS"], dtype=dtype)
)
print(df)

relevant_keys = ["category"]
filter_dtype = pl.Struct({"category": df["category"].dtype})
filter_list = [{"category": "HBS"}]

print(f"{filter_dtype=}")
print(f"{filter_list=}")

filtered_df = df.filter(
    pl.struct(relevant_keys).is_in(
        pl.Series(
            filter_list,
            dtype=filter_dtype,
        )
    )
)
print(f"{filtered_df=}")
env: POLARS_VERBOSE=1
env: RUST_BACKTRACE=full
--------Version info---------
Polars:               1.1.0
Index type:           UInt32
Platform:             Linux-5.15.0-113-generic-x86_64-with-glibc2.35
Python:               3.12.3 (main, Apr 15 2024, 18:25:56) [Clang 17.0.6 ]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
great_tables:         <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         1.6.0
numpy:                2.0.0
openpyxl:             <not installed>
pandas:               2.2.2
pyarrow:              16.1.0
pydantic:             <not installed>
pyiceberg:            <not installed>
sqlalchemy:           <not installed>
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
shape: (3, 1)
┌──────────┐
│ category │
│ ---      │
│ enum     │
╞══════════╡
│ HBS      │
│ XYZ      │
│ HBS      │
└──────────┘
filter_dtype=Struct({'category': Enum(categories=['HBS', 'XYZ'])})
filter_list=[{'category': 'HBS'}]
dataframe filtered
---------------------------------------------------------------------------
SchemaError                               Traceback (most recent call last)
Cell In[3], line 22
     19 print(f"{filter_dtype=}")
     20 print(f"{filter_list=}")
---> 22 filtered_df = df.filter(
     23     pl.struct(relevant_keys).is_in(
     24         pl.Series(
     25             filter_list,
     26             dtype=filter_dtype,
     27         )
     28     )
     29 )
     30 print(f"{filtered_df=}")

File ~/.venv/lib/python3.12/site-packages/polars/dataframe/frame.py:4521, in DataFrame.filter(self, *predicates, **constraints)
   4421 def filter(
   4422     self,
   4423     *predicates: (
   (...)
   4430     **constraints: Any,
   4431 ) -> DataFrame:
   4432     """
   4433     Filter the rows in the DataFrame based on one or more predicate expressions.
   4434 
   (...)
   4519     └─────┴─────┴─────┘
   4520     """
-> 4521     return self.lazy().filter(*predicates, **constraints).collect(_eager=True)

File ~/.venv/lib/python3.12/site-packages/polars/lazyframe/frame.py:1942, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, no_optimization, streaming, background, _eager, **_kwargs)
   1939 # Only for testing purposes atm.
   1940 callback = _kwargs.get("post_opt_callback")
-> 1942 return wrap_df(ldf.collect(callback))

SchemaError: failed to determine supertype of enum and u32

Expected behavior

Successful filtering using structs.

Installed versions

env: POLARS_VERBOSE=1
env: RUST_BACKTRACE=full
--------Version info---------
Polars:               1.1.0
Index type:           UInt32
Platform:             Linux-5.15.0-113-generic-x86_64-with-glibc2.35
Python:               3.12.3 (main, Apr 15 2024, 18:25:56) [Clang 17.0.6 ]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
great_tables:         <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         1.6.0
numpy:                2.0.0
openpyxl:             <not installed>
pandas:               2.2.2
pyarrow:              16.1.0
pydantic:             <not installed>
pyiceberg:            <not installed>
sqlalchemy:           <not installed>
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@bzm3r bzm3r added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jul 13, 2024
@dguidara
Copy link

Change it to pl.struct(relevant_keys).is_in(filter_list) and it should work.

@bzm3r
Copy link
Author

bzm3r commented Jul 14, 2024

@dguidara Is that the preferred solution?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants