-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(rust,py): add additional control to write_parquet::statistics
parameter
#16575
Conversation
write_parquet::statistics
parameter
8c38cfa
to
7ef6f45
Compare
I think the CI is having a small problem. |
7ef6f45
to
85cd7b7
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #16575 +/- ##
==========================================
- Coverage 81.49% 81.46% -0.03%
==========================================
Files 1414 1416 +2
Lines 185561 186837 +1276
Branches 2997 3021 +24
==========================================
+ Hits 151219 152213 +994
- Misses 33826 34091 +265
- Partials 516 533 +17 ☔ View full report in Codecov by Sentry. |
07f969a
to
5a335d0
Compare
a2c1c04
to
0d12846
Compare
CodSpeed Performance ReportMerging #16575 will not alter performanceComparing Summary
|
Adds additional control over which statistics are written into Parquet files through the `write_parquet` parameter `statistics`. It is now possible to specify `"full"` to also attempt to add the `distinct_count` statistic (currently only added for `Booleans`). It is also possible to give a `dict[str, bool]` to specify individual statistics `min`, `max`, `distinct_count` and `null_count`. Fixes pola-rs#16441
0d12846
to
a8b7b7c
Compare
.flatten() | ||
.min_by(|x, y| ord_binary(x, y)) | ||
.map(|x| x.to_vec()), | ||
max_value: options |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we depend on polars-compute
, can we use the kernels directly here? There are SIMD :)
.map(|x| x.to_vec()) | ||
}) | ||
.flatten(), | ||
min_value: options |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idem
.flatten() | ||
.min_by(|x, y| ord_binary(x, y)) | ||
.map(|x| x.to_vec()), | ||
max_value: options |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idem
.map(|x| x.to_vec()), | ||
max_value: options | ||
.max_value | ||
.then(|| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we got a kernel for this one. 🤔
Adds additional control over which statistics are written into Parquet files through the
write_parquet
parameterstatistics
.It is now possible to specify
"full"
to also attempt to add thedistinct_count
statistic (currently only added forBooleans
). It is also possible to give adict[str, bool]
to specify individual statisticsmin
,max
,distinct_count
andnull_count
.Fixes #16441