Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: prune parquet row groups when is_not_null is used #14260

Merged
merged 1 commit into from
Feb 4, 2024

Conversation

taki-mekhalfa
Copy link
Contributor

@taki-mekhalfa taki-mekhalfa commented Feb 4, 2024

Filtering on non null values is more used in the wild than filtering on null values.

I changed the BatchStats to include the number of rows in that rg and compare it with the number of nulls statistic if available.

I also tried to simplify not(is_null) to is_not_null as I see this pattern a lot, this will allow to apply the pruning in these cases. [I also transformed not(is_not_null) to is_null]

>>> pl.scan_parquet('examples/datasets/null_nutriscore.parquet').collect()
shape: (27, 5)
┌──────────┬──────────┬────────┬──────────┬─────────────┐
│ categorycaloriesfats_gsugars_gnutri_score │
│ ---------------         │
│ stri64f64i64str         │
╞══════════╪══════════╪════════╪══════════╪═════════════╡
│ seafood1179.00null        │
│ seafood2016.01null        │
│ fruit591.014null        │
│ meat976.00null        │
│ meat12412.01null        │
│ …        ┆ …        ┆ …      ┆ …        ┆ …           │
│ seafood1555.00null        │
│ fruit1330.027null        │
│ seafood2059.00null        │
│ fruit724.57null        │
│ fruit601.07null        │
└──────────┴──────────┴────────┴──────────┴─────────────┘

>>> pl.scan_parquet('examples/datasets/null_nutriscore.parquet').filter(pl.col('nutri_score').is_not_null()).collect()
parquet file can be skipped, the statistics were sufficient to apply the predicate.
shape: (0, 5)
┌──────────┬──────────┬────────┬──────────┬─────────────┐
│ categorycaloriesfats_gsugars_gnutri_score │
│ ---------------         │
│ stri64f64i64str         │
╞══════════╪══════════╪════════╪══════════╪═════════════╡
└──────────┴──────────┴────────┴──────────┴─────────────┘

>>> pl.scan_parquet('examples/datasets/null_nutriscore.parquet').filter(~pl.col('nutri_score').is_null()).collect()
parquet file can be skipped, the statistics were sufficient to apply the predicate.
shape: (0, 5)
┌──────────┬──────────┬────────┬──────────┬─────────────┐
│ categorycaloriesfats_gsugars_gnutri_score │
│ ---------------         │
│ stri64f64i64str         │
╞══════════╪══════════╪════════╪══════════╪═════════════╡
└──────────┴──────────┴────────┴──────────┴─────────────┘

edit: typos and layout

@github-actions github-actions bot added performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars labels Feb 4, 2024
Filtering on non null values is more used in the wild than filtering on null values.

I changed the `BatchStats` to include the number of rows in that rg and compare it with
the number of nulls statistic if avaialable.

I also tried to simplify `not(is_null)` to `is_not_null` as I see this pattern a lot,
this will allow to apply the pruning in these cases.
Copy link
Member

@ritchie46 ritchie46 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good addition. Thank you @taki-mekhalfa

@@ -81,6 +81,26 @@ pub(super) fn optimize_functions(
AExpr::Literal(LiteralValue::Boolean(b)) => {
Some(AExpr::Literal(LiteralValue::Boolean(!b)))
},
// not(x.is_null) => x.is_not_null
AExpr::Function {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap, good one. 👍

@ritchie46 ritchie46 merged commit 0f72634 into pola-rs:main Feb 4, 2024
18 checks passed
@taki-mekhalfa taki-mekhalfa deleted the perf/prune_rg_is_not_null branch February 5, 2024 09:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants