Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Scan_pyarrow_dataset() no longer pushes down partition filtering to pyarrow #14343

Closed
2 tasks done
lmocsi opened this issue Feb 7, 2024 · 9 comments
Closed
2 tasks done
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@lmocsi
Copy link

lmocsi commented Feb 7, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
import pyarrow.dataset as ds
tr = pl.scan_pyarrow_dataset(ds.dataset(parq_path+"my_table", partitioning='hive'))

df = (tr.filter((pl.col('CALENDAR_DATE').is_between(pl.lit('2023-07-21'), pl.lit('2024-01-22'))) &
                     (pl.col('CRE_FL') == 'I')
                     )
             .select('PARTY_ID').unique()
             .rename({'PARTY_ID':'PART_ID'})
             .with_columns(pl.lit(1).alias('NEXT_FL'))
             .collect()
         )

Log output

No log.

Issue description

As of polars==0.20.4 scan_pyarrow_dataset() no longer pushes down partition filtering to pyarrow. In polars==0.20.3 it was working fine, as it was described in this issue: #13908

Polars==0.20.3 + scan_pyarrow_dataset:

Explain:
 WITH_COLUMNS:
 [1.alias("NEXT_FL")]
  RENAME
    UNIQUE BY None
      FAST_PROJECT: [PARTY_ID]

          PYTHON SCAN 
          PROJECT 3/4 COLUMNS
          SELECTION: (((pa.compute.field(\'CALENDAR_DATE\') >= \'2023-07-21\') & (pa.compute.field(\'CALENDAR_DATE\') <= \'2024-01-22\')) & (pa.compute.field(\'CREDIT_FL\') == \'Y\'))

Polars==0.20.6 + scan_pyarrow_dataset:

Explain:
 WITH_COLUMNS:
 [1.alias("NEXT_FL")]
  RENAME
    UNIQUE BY None
      FAST_PROJECT: [PARTY_ID]
        FILTER [(col("CALENDAR_DATE").is_between([String(2023-07-21), String(2024-01-22)])) & ([(col("CREDIT_FL")) == (String(Y))])] FROM
          PYTHON SCAN 
          PROJECT 3/4 COLUMNS

Expected behavior

Push down partition filtering to pyarrow.

Installed versions

-------Version info---------
Polars:               0.20.7
Index type:           UInt32
Platform:             Linux-4.18.0-372.71.1.el8_6.x86_64-x86_64-with-glibc2.28
Python:               3.9.13 (main, Oct 13 2022, 21:15:33) 
[GCC 11.2.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          2.0.0
connectorx:           <not installed>
deltalake:            <not installed>
fsspec:               2022.02.0
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.8.0
numpy:                1.23.5
openpyxl:             3.0.9
pandas:               2.2.0
pyarrow:              15.0.0
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           1.4.27
xlsx2csv:             <not installed>
xlsxwriter:           3.1.3
@lmocsi lmocsi added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Feb 7, 2024
@taki-mekhalfa
Copy link
Contributor

you can try this in the meantime:

>>> print(
        tr.filter((pl.col('CALENDAR_DATE') >= (pl.lit('2023-07-21'))) &
                  (pl.col('CALENDAR_DATE') <= (pl.lit('2024-01-22')))).explain(optimized=True)
    )

PYTHON SCAN 
  PROJECT */4 COLUMNS
  SELECTION: ((pa.compute.field('CALENDAR_DATE') >= '2023-07-21') & (pa.compute.field('CALENDAR_DATE') <= '2024-01-22'))

@ritchie46
Copy link
Member

Can we get a bisect on this one? Which commit introduced the regression?

@deanm0000
Copy link
Collaborator

@imocsi did you see that 0.20.7 has the is_in fix for scan_parquet?

@lmocsi
Copy link
Author

lmocsi commented Feb 9, 2024

Yes, I did see it. Though, the codes I wrote use scan_pyarrow_dataset('myfolder'), so I'll wait for either:

@lmocsi
Copy link
Author

lmocsi commented Feb 20, 2024

Can we get a bisect on this one? Which commit introduced the regression?

My guess is this:
#11945

@lmocsi
Copy link
Author

lmocsi commented Feb 21, 2024

It seems, that in polars==0.20.3 the expressionpl.col('CALENDAR_DATE').is_between(a, b) was automatically rewritten to the pyarrow expression of pa.compute.field(\'CALENDAR_DATE\') >= \'a\') & (pa.compute.field(\'CALENDAR_DATE\') <= \'b\')).
Since polars==0.20.4 introduced .is_between() in Rust, chances are that this new .is_between() is not re-written to pyarrow like above, and since CALENDAR_DATE is the partitioning column, partition filtering is no longer used here.

@lmocsi lmocsi closed this as completed Aug 16, 2024
@ritchie46
Copy link
Member

@lmocsi is this resolved?

@lmocsi
Copy link
Author

lmocsi commented Aug 16, 2024

@lmocsi is this resolved?

Yes. As of polars==1.5.0 it is working fine (maybe with earlier versions as well).

@ritchie46
Copy link
Member

Great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

4 participants