-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance degradation in scan_parquet() vs. scan_pyarrow_dataset() #13908
Comments
Can you make a reproducable example? There is nothing here that we can test. |
I guess I cannot create / upload several GB-s of partitioned parquet file here... |
Run the examples with the trace log on please. Which column(s) are part of the hive partition? |
Partitioned only on calendar_date. |
You can create dummy data in your script. And you likely don't need several GB to you show the effect. Please make an effort to produce something we can reproduce if you want it fixed. |
before your import polars line, run these, then do everything else
You probably don't need the rust one here but just for reference. |
With logging turned on, I cannot see more details: Traceback (most recent call last): |
Are you saying that before it was just slow but now it's generating an error? |
Scan_parquet() was slow in 0.20.3 and is slow in 0.20.4/0.20.5 |
Weird thing is that with synthetic data it is vica versa:
|
@lmocsi can you share the plan with .explain() with older and new Polars version? |
With a synthetic dataset (created for 100000 customers, see here: apache/arrow#39768):
df.shape: (100_000, 2) Polars==0.20.6 + scan_parquet:
df.shape: (100_000, 2) Polars==0.20.3 + scan_pyarrow_dataset:
df.shape: (100_000, 2) Polars==0.20.3 + scan_parquet:
df.shape: (100_000, 2) |
With the real dataset:
df.shape: - Polars==0.20.6 + scan_parquet:
df.shape: (1_075_661, 2) Polars==0.20.3 + scan_pyarrow_dataset:
df.shape: (1_075_661, 2) Polars==0.20.3 + scan_parquet: <no plan here since df.explain() is running for more than 1h 43 minutes!!!> df.shape: (1_075_661, 2) Lots of warnings here: Then lots of lines like this: Then mixed lines like: And finally: |
@lmocsi it looks like the filter in scan_pyarrow_dataset is not pushed down to pyarrow, the selection should show the pyarrow.compute expressions in v0.20.6 but it doesn't. @ritchie46 seems like there is no more filter pushdown to pyarrow dataset and the filter only happens in polars after everything is loaded into memory |
Is your calendar date stored as a string and not date? |
Calendar date is returned as a string column, since that is the partition column. |
if you do scan_parquet on just the one file, does it give you a column with 2019-01-01 00%3A00%3A00 or 2019-01-01 00:00:00 Try doing something like:
|
<polars==0.20.6> This:
Gives me: But the actual directory looks like this: The above join statement leads to kernel-die somewhere above 17 GB of ram usage (in a 32 GB environment). |
What about...
Ultimately, I think there's a good amount of potential improvement to do in the polars hive partitioning optimization workhorse. While the current state isn't as robust as the pyarrow's hive dataset scanner, I think the effort is going to be in making those aforementioned improvements rather than strengthening those pyarrow linkages. As such, assuming the last snippet doesn't work, I think you probably want to just do more of the filtering directly in pyarrow. So something like
|
The .to_series().to_list() version runs fast using scan_parquet():
Max memory occupation: 2.7 GB But it is cumbersome to filter like that... :( |
yeah it only seems to know how to skip files with exact matches otherwise it wants to read the file. It also only works if all the files are 00:00:00, if you've got some 00:12:00, etc then it'd miss those entirely. Another thing you can do, which isn't any less code but it is more resilient:
In this way you do a first scan just to get the file list then you use have to use |
@deanm0000 these are all workarounds though. Polars should be able to parse those characters properly |
agreed, I'm not saying otherwise. |
I think what it needs would go here... polars/crates/polars-lazy/src/physical_plan/expressions/apply.rs Lines 489 to 528 in 608d6ac
We need another match on |
@ritchie46 see my previous comment. I think that's the fix but I don't know how to implement it. |
Checks
Reproducible example
Log output
Issue description
The above code with scan_pyarrow_dataset() runs in about 6 second.
The same code with scan_parquet() runs in 1 min 52 secs!!!
But both of them does this in less than 1.5 GB of RAM.
If I upgrade polars to the current 0.20.5 version, then the scan_pyarrow_dataset() way runs out of 32 GB memory, as if it was not able to filter on partition column of CALENDAR_DATE.
The scan_parquet() version runs for roughly the same time as it does on 0.20.3 polars.
Expected behavior
Scan_parquet() should complete in about the same time as scan_pyarrow_dateset() does.
Scan_pyarrow_dataset() should be able to filter on partition column on 0.20.5 polars, as well.
Installed versions
The text was updated successfully, but these errors were encountered: