-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: Added optimizer rules for is_null().all()
and similar expressions to use null_count()
#18359
Conversation
…count_optimization
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #18359 +/- ##
==========================================
+ Coverage 79.84% 79.86% +0.01%
==========================================
Files 1497 1497
Lines 201828 201959 +131
Branches 2867 2867
==========================================
+ Hits 161141 161286 +145
+ Misses 40141 40127 -14
Partials 546 546 ☔ View full report in Codecov by Sentry. |
This PR looks good to me in general, although I am not the person who decides whether it gets merged. Two things that stuck out to me.
In general, very nice PR. |
Nice work! I've left a few comments. |
I fixed all @nameexhaustion comments, so I will be glad for review. Regarding @coastalwhite comment about Thanks. |
Fixed also @coastalwhite comment, now it indeed looks nicer. |
LGTM - will need final approval from @ritchie46 to merge There are a few benchmarks in the comments from #17605 (comment) to take note of - in some edge cases this performed slightly slower, but in most cases this should lead to good speedups. |
…count_optimization
Thanks, I rebased against main since I saw the files I edited were changes. Edit: Final performance test in release mode: import polars as pl
import timeit
lf = pl.select(pl.when(pl.int_range(1_000_000) % 8 < 8).then(1)).lazy()
q = lf.select(pl.first().is_null().all()).collect
t = timeit.timeit(q, number=10_000)
print("is_null().all() :: ", t)
q = lf.select(pl.first().null_count() == pl.len()).collect
t = timeit.timeit(q, number=10_000)
print(f"null_count() == len() :: {t}\n")
q = lf.select(pl.first().is_null().any()).collect
t = timeit.timeit(q, number=10_000)
print("is_null().any() :: ", t)
q = lf.select(pl.first().null_count() > 0).collect
t = timeit.timeit(q, number=10_000)
print(f"null_count() > 0 :: {t}\n")
q = lf.select(pl.first().is_not_null().all()).collect
t = timeit.timeit(q, number=10_000)
print("is_not_null().all() :: ", t)
q = lf.select(pl.first().null_count() == 0).collect
t = timeit.timeit(q, number=10_000)
print(f"null_count() == 0 :: {t}\n")
q = lf.select(pl.first().is_not_null().any()).collect
t = timeit.timeit(q, number=10_000)
print("is_not_null().any() :: ", t)
q = lf.select(pl.len() > pl.first().null_count()).collect
t = timeit.timeit(q, number=10_000)
print(f"len() > null_count() :: {t}\n")
q = lf.select(pl.first().is_null().sum()).collect
t = timeit.timeit(q, number=10_000)
print("is_null().sum() :: ", t)
q = lf.select(pl.first().null_count()).collect
t = timeit.timeit(q, number=10_000)
print(f"null_count() :: {t}\n")
q = lf.select(pl.first().is_not_null().sum()).collect
t = timeit.timeit(q, number=10_000)
print("is_not_null().sum() :: ", t)
q = lf.select(pl.len() - pl.first().null_count()).collect
t = timeit.timeit(q, number=10_000)
print(f"len() - null_count() :: {t}\n")
q = lf.select(pl.first().drop_nulls().len()).collect
t = timeit.timeit(q, number=10_000)
print("drop_nulls().len() :: ", t)
q = lf.select(pl.len() - pl.first().null_count()).collect
t = timeit.timeit(q, number=10_000)
print(f"len() - null_count() :: {t}\n")
q = lf.select(pl.first().drop_nulls().count()).collect
t = timeit.timeit(q, number=10_000)
print("drop_nulls().count() :: ", t)
q = lf.select(pl.len() - pl.first().null_count()).collect
t = timeit.timeit(q, number=10_000)
print("len() - null_count() :: ", t)
|
Fixes #17605.
Added rules:
Thanks.