Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet: Support predicate pushdown using column chunk statistics, dictionaries, etc #968

Closed
rcaudy opened this issue Aug 4, 2021 · 1 comment
Assignees
Labels
2023_triagedNoMilestone feature request New feature or request parquet Related to the Parquet integration
Milestone

Comments

@rcaudy
Copy link
Member

rcaudy commented Aug 4, 2021

We should add an interface to ColumnSource like:

    /**
     * Result of a {@link #prefilter(Index, SelectFilter)}.
     */
    final class PrefilteringResult implements SafeCloseable {

        /**
         * The index of keys that are included in the filter and do not need further filtering.
         */
        public final ReadOnlyIndex included;
        /**
         * The index of keys that <em>may</em> be included in the filter and need further filtering.
         */
        public final ReadOnlyIndex possible;

        public PrefilteringResult(@NotNull final Index included, @NotNull final Index possible) {
            this.included = included;
            this.possible = possible;
        }

        public void close() {
            included.close();
            possible.close();
        }
    }

    /**
     * Perform optional coarse filtering of this ColumnSource for all rows included in {@code sourceIndex}.
     *
     * @param sourceIndex The {@link Index} to consider
     * @param filter       The {@link SelectFilter} filter to apply
     * @return A {@link PrefilteringResult} to apply, or {@code null} if the entire {@code sourceIndex} should be treated
     * as possible
     */
    default PrefilteringResult prefilter(@NotNull final Index sourceIndex, @NotNull final SelectFilter filter) {
        return null;
    }

Implementation notes:

  • This should then be integrated into QueryTable.where(...), and supported by RegionedColumnSource and ColumnRegion implementations.
  • We can follow the same iterator advance-oriented recursive pattern as in the implementation of SymbolTableSource found in RegionedColumnSourceObjectWithDictionary.

Design notes:

  • There may be buggy statistics in some Parquet files, and so anything we do needs to be optionally disabled via ParquetInstructions. In particular, String statistics might be bad in many files, and their use should be enabled separately.
  • Also note that dictionary-based filtering only applies to dictionary-encoded pages within a column chunk, and that plain pages must be examined separately.
  • Look into column indices and bloom filters as well.
@malhotrashivam
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2023_triagedNoMilestone feature request New feature or request parquet Related to the Parquet integration
Projects
None yet
Development

No branches or pull requests

4 participants