-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: Fast decision for Parquet dictionary encoding #19256
perf: Fast decision for Parquet dictionary encoding #19256
Conversation
This PR adds two things: 1. HyperLogLog to determine estimate the cardinality of an array. If the estimated cardinality is too high, no group-by has to be done. This speeds up Parquet writing by ~2x for high cardinality data. 2. An extension of the fast path for integers where if the min and the max are close enough, a bitmask is created to determine the cardinality. If cardinality is low enough or the cardinality is too high, the HyperLogLog path can be skipped. This can also lead to more than 2x improvements while writing.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #19256 +/- ##
==========================================
+ Coverage 79.68% 79.97% +0.29%
==========================================
Files 1532 1529 -3
Lines 209211 209848 +637
Branches 2416 2416
==========================================
+ Hits 166710 167828 +1118
+ Misses 41953 41471 -482
- Partials 548 549 +1 ☔ View full report in Codecov by Sentry. |
for v in array.iter() { | ||
let v = v.copied().unwrap_or_default(); | ||
hll.add(&v.to_total_ord()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: Not sure how big these arrays are, but if they are full column length, does it make sense to partition the work of computing the cardinality across rows, producing a partial estimate in each thread and then using HyperLogLog::merge
to reduce across threads?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These arrays are equally large as a row group in parquet.
I don't really see much benefit in parallelizing here, because we already parallelize over columns and row groups. So unless you have a short dataframe with few columns, you probably won't see any sleeping threads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sure.
This PR adds two things: