Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect result when filtering a Parquet file on categorical columns #17744

Closed
2 tasks done
agossard opened this issue Jul 19, 2024 · 7 comments · Fixed by #17950
Closed
2 tasks done

Incorrect result when filtering a Parquet file on categorical columns #17744

agossard opened this issue Jul 19, 2024 · 7 comments · Fixed by #17950
Assignees
Labels
A-io-parquet Area: reading/writing Parquet files A-optimizer Area: plan optimization accepted Ready for implementation P-high Priority: high python Related to Python Polars

Comments

@agossard
Copy link
Contributor

agossard commented Jul 19, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
import numpy as np
import pandas as pd
import random
import string

Helper functions to generate random data

def random_string(length=5):
letters = string.ascii_uppercase
return ''.join(random.choice(letters) for i in range(length))

def random_category(categories):
return random.choice(categories)

def random_float():
return random.uniform(1e6, 1e11)

Define the categories for categorical columns

gics_industry_group = ["Materials", "Capital Goods"]
pricing_currency = ["USD", "CAD", "EUR", "AUD", "INR", "KRW"]
country = ["AU", "US", "CA", "JP", "IN", "KR"]
gics_sector_name = ["Materials", "Industrials"]
gics_industry_name = ["Metals & Mining", "Machinery", "Banks", "Retail"]

Number of rows

num_rows = 42265

Create the DataFrame

df = pl.DataFrame({
"dt": np.random.randint(20240101, 20240630, size=num_rows),
"global_id": np.random.randint(1, 11260000, size=num_rows),
"gics_industry_group": [random_category(gics_industry_group) for _ in range(num_rows)],
"pricing_currency": [random_category(pricing_currency) for _ in range(num_rows)],
"ticker": [random_string() for _ in range(num_rows)],
"country": [random_category(country) for _ in range(num_rows)],
"gics_sector_name": [random_category(gics_sector_name) for _ in range(num_rows)],
"market_cap_usd": [random_float() for _ in range(num_rows)],
"name": ["Company " + random_string(10) for _ in range(num_rows)],
"gics_industry_name": [random_category(gics_industry_name) for _ in range(num_rows)]
})

df = df.with_columns(pl.col('gics_industry_name').cast(pl.Categorical))

df.write_parquet('test_2.parquet')

Filtering before collect -> Returns 0 rows (incorrect)

print(len(pl.scan_parquet('test_2.parquet').filter(pl.col('gics_industry_name').is_in(['Metals & Mining', 'Machinery'])).collect()))

Filtering after collect -> Returns >0 rows (correct)

print(len(pl.scan_parquet('test_2.parquet').collect().filter(pl.col('gics_industry_name').is_in(['Metals & Mining', 'Machinery']))))

Log output

No response

Issue description

This seems to be a regression, as this behavior was working for me in 1.1 and is no longer working in 1.2. (Note, I observed this same bug many versions back, also a regression, which was subsequently corrected).

I have a set of parquet files on s3, which contain a column which is a categorical string. If I do this, I get the expected outcome:

pl.scan_parquet(....).collect().filter(pl.col('my_categorical_column').is_in(['a category']))

However, if I do this (attempting to get the filter applied during the reading process)

pl.scan_parquet(....).filter(pl.col('my_categorical_column').is_in(['a category'])).collect()

Then I get no rows returned in the data frame.

Expected behavior

The results should be the same regardless of where collect is called. The filter when run with lazy frame should retain rows that pass the filter.

Installed versions

--------Version info---------

Polars:               1.1.0

Index type:           UInt32

Platform:             Linux-4.14.281-212.502.amzn2.x86_64-x86_64-with-glibc2.26

Python:               3.11.5 (main, Sep 11 2023, 13:23:44) [GCC 11.2.0]

 

----Optional dependencies----

adbc_driver_manager:  <not installed>

cloudpickle:          2.2.1

connectorx:           <not installed>

deltalake:            <not installed>

fastexcel:            <not installed>

fsspec:               2023.10.0

gevent:               <not installed>

great_tables:         <not installed>

hvplot:               0.9.1

matplotlib:           3.8.4

nest_asyncio:         1.5.6

numpy:                1.26.3

openpyxl:             3.1.2

pandas:               2.0.3

pyarrow:              15.0.0

pydantic:             2.5.3

pyiceberg:            <not installed>

sqlalchemy:           1.4.52

torch:                2.3.0

xlsx2csv:             <not installed>

xlsxwriter:           3.1.1
@agossard agossard added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jul 19, 2024
@agossard
Copy link
Contributor Author

Sorry... just a correction on the pasted in versions. The bug is present in polars 1.2 and 1.2.1. In polars 1.1, the behavior is correct.

@alonme
Copy link
Contributor

alonme commented Jul 20, 2024

This issue is of course harder to solve with no example,
Does this still happen if you filter the parquet to a single line?
Does it happen if you remove the other columns?

@ritchie46
Copy link
Member

Have you got a repro? We cannot do anything if we cant reproduce it.

@ritchie46 ritchie46 added the needs repro Bug does not yet have a reproducible example label Jul 21, 2024
@agossard
Copy link
Contributor Author

Ok… I was able to create a reproducible example actually… apologies for not trying in the first place!

might consider looking at #17475

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jul 25, 2024

Have you got a repro? We cannot do anything if we cant reproduce it.

@ritchie46, @coastalwhite: looks like strings aren't being adapted to match categoricals with is_in filters any more. Here's a simple/streamlined repro 👌

import polars as pl

pl.DataFrame(
    data={
        "n": [1,2,3],
        "ccy": ["USD", "JPY", "EUR"],
    },
    schema_overrides={"ccy": pl.Categorical("lexical")},
).write_parquet(pq := "test.parquet")

This still works (==):

pl.scan_parquet(pq).filter(pl.col("ccy") == "USD").collect()
# shape: (1, 2)
# ┌─────┬─────┐
# │ n   ┆ ccy │
# │ --- ┆ --- │
# │ i64 ┆ cat │
# ╞═════╪═════╡
# │ 1   ┆ USD │
# └─────┴─────┘

This now fails (is_in):

pl.scan_parquet(pq).filter(pl.col("ccy").is_in(["USD"])).collect()
# shape: (0, 2)
# ┌─────┬─────┐
# │ n   ┆ ccy │
# │ --- ┆ --- │
# │ i64 ┆ cat │
# ╞═════╪═════╡
# └─────┴─────┘

@alexander-beedie alexander-beedie added A-io-parquet Area: reading/writing Parquet files and removed needs repro Bug does not yet have a reproducible example labels Jul 25, 2024
@coastalwhite
Copy link
Collaborator

I will take a look at this. I am currently throwing quite a bit refactor over the parquet code that effects Categoricals as well. So probably after that.

@nameexhaustion nameexhaustion added accepted Ready for implementation P-high Priority: high A-optimizer Area: plan optimization and removed bug Something isn't working needs triage Awaiting prioritization by a maintainer labels Jul 30, 2024
@nameexhaustion nameexhaustion changed the title Problem filtering categorical string columns with lazy frame and scan_parquet Incorrect result when filtering a Parquet file on categorical columns Jul 30, 2024
@nameexhaustion
Copy link
Collaborator

Thanks @alexander-beedie for the repro 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-parquet Area: reading/writing Parquet files A-optimizer Area: plan optimization accepted Ready for implementation P-high Priority: high python Related to Python Polars
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

6 participants