Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not skipping hive partitioned data when using is_in #13358

Closed
2 tasks done
bchalk101 opened this issue Jan 1, 2024 · 0 comments
Closed
2 tasks done

Not skipping hive partitioned data when using is_in #13358

bchalk101 opened this issue Jan 1, 2024 · 0 comments
Labels
bug Something isn't working python Related to Python Polars

Comments

@bchalk101
Copy link
Contributor

bchalk101 commented Jan 1, 2024

Checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

df = pl.DataFrame({"d": pl.arange(0, 10_000, eager=True)}).with_columns(
    a=pl.col("d") % 100
)
df.write_parquet(
    "./test_partitions",
    use_pyarrow=True,
    pyarrow_options={"partition_cols": ["a"]},
)

print(
    pl.scan_parquet("./test_partitions/**/*.parquet")
    .filter(pl.col("a").is_in([1, 99]))
    .explain()
)

Log output

parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
hive partitioning: skipped 1 files, first file : test_partitions/a=0/d96bb478e0d84b228a703739b7d1f143-0.parquet

  Parquet SCAN 99 files: first file: test_partitions/a=1/d96bb478e0d84b228a703739b7d1f143-0.parquet
  PROJECT */2 COLUMNS
  SELECTION: col("a").is_in([Series])

Issue description

When using is_in for the partition column, all the files between the first partition and the last partition in the is_in columns are labeled as parquet file must be read, statistics not sufficient for predicate.

Expected behavior

There is no reason to be reading so many files, only the first partition and the 100th partition should be read, and even with those, there should be no reason to read the actual file as there is no further filter than what is contained in the partition and the statistics of the parquet file.

Installed versions

--------Version info---------
Polars:               0.20.2
Index type:           UInt32
Platform:             macOS-13.6.2-arm64-arm-64bit
Python:               3.10.13 | packaged by conda-forge | (main, Oct 26 2023, 18:09:17) [Clang 16.0.6 ]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
matplotlib:           3.8.2
numpy:                1.26.2
openpyxl:             <not installed>
pandas:               2.1.3
pyarrow:              14.0.2
pydantic:             2.5.2
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>```

</details>
@bchalk101 bchalk101 added bug Something isn't working python Related to Python Polars labels Jan 1, 2024
@alexander-beedie alexander-beedie changed the title Not skipping hive partitioned data when using isin Not skipping hive partitioned data when using is_in Jan 1, 2024
bchalk101 pushed a commit to bchalk101/polars that referenced this issue Jan 4, 2024
bchalk101 pushed a commit to bchalk101/polars that referenced this issue Jan 4, 2024
fix(rust): Fix hive partitioned files not being skipped (pola-rs#13358)
bchalk101 pushed a commit to bchalk101/polars that referenced this issue Jan 7, 2024
fix(rust): Fix hive partitioned files not being skipped (pola-rs#13358)
bchalk101 pushed a commit to bchalk101/polars that referenced this issue Jan 8, 2024
fix(rust): Fix hive partitioned files not being skipped (pola-rs#13358)
ritchie46 pushed a commit that referenced this issue Jan 8, 2024
Co-authored-by: Boruch Chalk <boruch.chalk@mobileye.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

1 participant