Skip to content

Commit

Permalink
Utility trait for stats-based skipping logic (#357)
Browse files Browse the repository at this point in the history
Parquet footer stats allow data skipping, very similar to Delta file
stats. Except parquet isn't quite as convenient to work with and
arrow-parquet doesn't even try to help (it can't, because arrow-compute
expressions are opaque, so there's no way to traverse and rewrite them
into stats-based skipping predicates).

We implement row group skipping support by traversing the same push-down
predicate that delta-kernel already uses to extract a for Delta file
skipping predicate. But instead of rewriting the expression, we evaluate
it bottom-up (no-copy, O(n) work where n is the number of nodes in the
expression).

This PR does not attempt to actually incorporate the new skipping logic
into the default reader. That (plus testing the integration) should
be a follow-up PR.
  • Loading branch information
scovich authored Oct 3, 2024
1 parent c81da02 commit 092ee67
Show file tree
Hide file tree
Showing 9 changed files with 1,415 additions and 12 deletions.
3 changes: 3 additions & 0 deletions kernel/src/engine/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,9 @@ pub mod arrow_expression;
#[cfg(any(feature = "default-engine", feature = "sync-engine"))]
pub mod arrow_data;

#[cfg(any(feature = "default-engine", feature = "sync-engine"))]
pub mod parquet_stats_skipping;

#[cfg(any(feature = "default-engine", feature = "sync-engine"))]
pub(crate) mod arrow_get_data;

Expand Down
406 changes: 406 additions & 0 deletions kernel/src/engine/parquet_stats_skipping.rs

Large diffs are not rendered by default.

Loading

0 comments on commit 092ee67

Please sign in to comment.