[Python] Scan_pyarrow_dataset() no longer pushes down partition filtering to pyarrow #14343

lmocsi · 2024-02-07T13:34:47Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
import pyarrow.dataset as ds
tr = pl.scan_pyarrow_dataset(ds.dataset(parq_path+"my_table", partitioning='hive'))

df = (tr.filter((pl.col('CALENDAR_DATE').is_between(pl.lit('2023-07-21'), pl.lit('2024-01-22'))) &
                     (pl.col('CRE_FL') == 'I')
                     )
             .select('PARTY_ID').unique()
             .rename({'PARTY_ID':'PART_ID'})
             .with_columns(pl.lit(1).alias('NEXT_FL'))
             .collect()
         )

Log output

No log.

Issue description

As of polars==0.20.4 scan_pyarrow_dataset() no longer pushes down partition filtering to pyarrow. In polars==0.20.3 it was working fine, as it was described in this issue: #13908

Polars==0.20.3 + scan_pyarrow_dataset:

Explain:
 WITH_COLUMNS:
 [1.alias("NEXT_FL")]
  RENAME
    UNIQUE BY None
      FAST_PROJECT: [PARTY_ID]

          PYTHON SCAN 
          PROJECT 3/4 COLUMNS
          SELECTION: (((pa.compute.field(\'CALENDAR_DATE\') >= \'2023-07-21\') & (pa.compute.field(\'CALENDAR_DATE\') <= \'2024-01-22\')) & (pa.compute.field(\'CREDIT_FL\') == \'Y\'))

Polars==0.20.6 + scan_pyarrow_dataset:

Explain:
 WITH_COLUMNS:
 [1.alias("NEXT_FL")]
  RENAME
    UNIQUE BY None
      FAST_PROJECT: [PARTY_ID]
        FILTER [(col("CALENDAR_DATE").is_between([String(2023-07-21), String(2024-01-22)])) & ([(col("CREDIT_FL")) == (String(Y))])] FROM
          PYTHON SCAN 
          PROJECT 3/4 COLUMNS

Expected behavior

Push down partition filtering to pyarrow.

Installed versions

-------Version info---------
Polars:               0.20.7
Index type:           UInt32
Platform:             Linux-4.18.0-372.71.1.el8_6.x86_64-x86_64-with-glibc2.28
Python:               3.9.13 (main, Oct 13 2022, 21:15:33) 
[GCC 11.2.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          2.0.0
connectorx:           <not installed>
deltalake:            <not installed>
fsspec:               2022.02.0
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.8.0
numpy:                1.23.5
openpyxl:             3.0.9
pandas:               2.2.0
pyarrow:              15.0.0
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           1.4.27
xlsx2csv:             <not installed>
xlsxwriter:           3.1.3

The text was updated successfully, but these errors were encountered:

taki-mekhalfa · 2024-02-07T14:41:13Z

you can try this in the meantime:

>>> print(
        tr.filter((pl.col('CALENDAR_DATE') >= (pl.lit('2023-07-21'))) &
                  (pl.col('CALENDAR_DATE') <= (pl.lit('2024-01-22')))).explain(optimized=True)
    )

PYTHON SCAN 
  PROJECT */4 COLUMNS
  SELECTION: ((pa.compute.field('CALENDAR_DATE') >= '2023-07-21') & (pa.compute.field('CALENDAR_DATE') <= '2024-01-22'))

ritchie46 · 2024-02-07T16:05:45Z

Can we get a bisect on this one? Which commit introduced the regression?

deanm0000 · 2024-02-09T01:18:11Z

@imocsi did you see that 0.20.7 has the is_in fix for scan_parquet?

lmocsi · 2024-02-09T07:09:00Z

Yes, I did see it. Though, the codes I wrote use scan_pyarrow_dataset('myfolder'), so I'll wait for either:

scan_pyarrow_dataset for this regression to be fixed (and I won't touch my code, just upgrade polars)
scan_parquet to support folder as an input parameter (and add on the fly the '/**/*.parquet' file spec - I've submitted it as an enhancement request: [Python] Scan_parquet() should be able to accept directory as input #14342 )

lmocsi · 2024-02-20T09:16:18Z

Can we get a bisect on this one? Which commit introduced the regression?

My guess is this:
#11945

lmocsi · 2024-02-21T09:20:43Z

It seems, that in polars==0.20.3 the expressionpl.col('CALENDAR_DATE').is_between(a, b) was automatically rewritten to the pyarrow expression of pa.compute.field(\'CALENDAR_DATE\') >= \'a\') & (pa.compute.field(\'CALENDAR_DATE\') <= \'b\')).
Since polars==0.20.4 introduced .is_between() in Rust, chances are that this new .is_between() is not re-written to pyarrow like above, and since CALENDAR_DATE is the partitioning column, partition filtering is no longer used here.

ritchie46 · 2024-08-16T09:24:29Z

@lmocsi is this resolved?

lmocsi · 2024-08-16T10:45:21Z

@lmocsi is this resolved?

Yes. As of polars==1.5.0 it is working fine (maybe with earlier versions as well).

ritchie46 · 2024-08-16T11:01:28Z

Great!

lmocsi added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Feb 7, 2024

lmocsi closed this as completed Aug 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Scan_pyarrow_dataset() no longer pushes down partition filtering to pyarrow #14343

[Python] Scan_pyarrow_dataset() no longer pushes down partition filtering to pyarrow #14343

lmocsi commented Feb 7, 2024

taki-mekhalfa commented Feb 7, 2024

ritchie46 commented Feb 7, 2024

deanm0000 commented Feb 9, 2024

lmocsi commented Feb 9, 2024 •

edited

Loading

lmocsi commented Feb 20, 2024

lmocsi commented Feb 21, 2024

ritchie46 commented Aug 16, 2024

lmocsi commented Aug 16, 2024

ritchie46 commented Aug 16, 2024

[Python] Scan_pyarrow_dataset() no longer pushes down partition filtering to pyarrow #14343

[Python] Scan_pyarrow_dataset() no longer pushes down partition filtering to pyarrow #14343

Comments

lmocsi commented Feb 7, 2024

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

taki-mekhalfa commented Feb 7, 2024

ritchie46 commented Feb 7, 2024

deanm0000 commented Feb 9, 2024

lmocsi commented Feb 9, 2024 • edited Loading

lmocsi commented Feb 20, 2024

lmocsi commented Feb 21, 2024

ritchie46 commented Aug 16, 2024

lmocsi commented Aug 16, 2024

ritchie46 commented Aug 16, 2024

lmocsi commented Feb 9, 2024 •

edited

Loading