Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable reading Parquet's bloomfilter statistics for hive connector #14428

Merged
merged 4 commits into from
Jan 6, 2023

Conversation

leetcode-1533
Copy link
Contributor

@leetcode-1533 leetcode-1533 commented Oct 3, 2022

Description

Enable hive connector to read parquet file's bloomfilter statistics. Related RB: apache/parquet-java@806037c#diff-8da24c84aef62e6e836d073938f7843d289785baaeddf446f3afeae6d4ef4b10.

This feature can be controlled via hive config: "parquet.use-bloom-filter"

Implementation limitations:

  1. Limited types supported: Will add more trino types once this merged in.
  2. Only support Hive connector. hudi, delta-lake and iceberg can be supported after this merged in.

Non-technical explanation

Enable hive connector to read parquet file's bloomfilter statistics. For more details: https://github.com/apache/parquet-format/blob/master/BloomFilter.md.

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Hive
* Improve performance of queries with filters when bloom filter indexes are present in parquet files. Usage of bloom filters from parquet files can be disabled using the catalog configuration property `parquet.use-bloom-filter` or the catalog session property `parquet_use_bloom_filter`. ({issue}`9471`)

@cla-bot cla-bot bot added the cla-signed label Oct 3, 2022
@leetcode-1533 leetcode-1533 marked this pull request as ready for review October 3, 2022 05:29
@leetcode-1533
Copy link
Contributor Author

@raunaqmorarka can you help take a look? As well as review: https://gist.github.com/leetcode-1533/2fb1cf64d386c5bef4c26f4f37c9c714.

@leetcode-1533 leetcode-1533 force-pushed the ParquetBloomFilter branch 2 times, most recently from a00a0ee to ef04576 Compare October 3, 2022 07:33
@mosabua
Copy link
Member

mosabua commented Oct 8, 2022

@claudiusli @cpard and @findepi should probably discuss this.

@findinpath
Copy link
Contributor

Can we eventually have a product test where spark writes into a hive table with bloom filter enabled for a specific column and Trino does read a specific value from the bloomfiltered column?

(just as a sample - you can create a custom table for the purpose of the test)
https://github.com/apache/spark/blob/778acd411e31839429b3c6964b4082fcedf69342/docs/sql-ref-syntax-ddl-create-table-datasource.md?plain=1

CREATE TABLE student_parquet(id INT, name STRING, age INT) USING PARQUET
    OPTIONS (
        'parquet.bloom.filter.enabled'='true',
        'parquet.bloom.filter.enabled#age'='false'
        );

Use io.trino.tests.product.hive.TestHiveSparkCompatibility to get you started.

In io.trino.tests.product.launcher.env.environment.EnvSinglenodeSparkHive#createSpark if you experience any issues, switch from spark3.0-iceberg to spark3-iceberg (because the image has been renamed in the meantime https://github.com/trinodb/docker-images/blob/master/testing/spark3-iceberg/Dockerfile).

@leetcode-1533
Copy link
Contributor Author

Can we eventually have a product test where spark writes into a hive table with bloom filter enabled for a specific column and Trino does read a specific value from the bloomfiltered column?

(just as a sample - you can create a custom table for the purpose of the test) https://github.com/apache/spark/blob/778acd411e31839429b3c6964b4082fcedf69342/docs/sql-ref-syntax-ddl-create-table-datasource.md?plain=1

CREATE TABLE student_parquet(id INT, name STRING, age INT) USING PARQUET
    OPTIONS (
        'parquet.bloom.filter.enabled'='true',
        'parquet.bloom.filter.enabled#age'='false'
        );

Use io.trino.tests.product.hive.TestHiveSparkCompatibility to get you started.

In io.trino.tests.product.launcher.env.environment.EnvSinglenodeSparkHive#createSpark if you experience any issues, switch from spark3.0-iceberg to spark3-iceberg (because the image has been renamed in the meantime https://github.com/trinodb/docker-images/blob/master/testing/spark3-iceberg/Dockerfile).

Hi, I confirmed the spark release in the test can generate parquet bloom filter files. Will update the PR shortly.

Copy link
Member

@raunaqmorarka raunaqmorarka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check if the code for the resolved comments was pushed

@leetcode-1533
Copy link
Contributor Author

leetcode-1533 commented Oct 25, 2022

Major changes:

I didn't do "We can derive a length for the read based on whether the bloom filter offset is less than start of 1st row group or greater than end of last row group.
In the former case we can read upto starts of the selected row groups, in the latter we can read upto start of page indexes or footer (which comes first).
So this is doable without resorting to streaming read APIs."

This is due to the first case (bloom filter stored before start of every row group) does not has an implementation and the code can't test the correctness of offset calculation.
The implementation didn't assume where the offset is, it assumed the relative small size of bloom filter header and read in those headers with a fixed amount of memory.

Please let me know if it is an issue.

  • This change tried to deprecate class HdfsParquetDataSource.java, since all its implementations are covered by TrinoParquetDataSource.java: this is a standalone change, please let me know to move those changes into a standalone PR.

  • I fixed most of comments, only pending action item is implementing a product test that read in spark generated tables.

@osscm
Copy link
Contributor

osscm commented Nov 1, 2022

@leetcode-1533 this is great, if you want I can also try to add the integration test with Spark writing and Trino reads. thanks!

@leetcode-1533
Copy link
Contributor Author

@leetcode-1533 this is great, if you want I can also try to add the integration test with Spark writing and Trino reads. thanks!

Hey, I can do it! Should be able to send out by end of today. Do you have other comments for the implementation?

@leetcode-1533
Copy link
Contributor Author

@leetcode-1533 this is great, if you want I can also try to add the integration test with Spark writing and Trino reads. thanks!

Hey, I can do it! Should be able to send out by end of today. Do you have other comments for the implementation?

Hi, sorry for the delay, I had updated the PR with product test. Only remaining part for this PR as far as I can see is how to read the bloom filter using the offset index. I am okey to change the implementation per @raunaqmorarka suggested, if he think it is better. @raunaqmorarka

@leetcode-1533
Copy link
Contributor Author

Updated, please take a look!

@leetcode-1533
Copy link
Contributor Author

Details about domain compaction thresholds:

Hive: Read from config
DeltaLake: Read from config.
Iceberg: using existing static global variable.
Hudi: hard code as 1000 default value.

@leetcode-1533 leetcode-1533 force-pushed the ParquetBloomFilter branch 3 times, most recently from 8d3e2e3 to d2257bd Compare December 25, 2022 07:50
@findinpath
Copy link
Contributor

Maven checks are red:

Error:  /home/runner/work/trino/trino/lib/trino-parquet/src/main/java/io/trino/parquet/BloomFilterStore.java:61: Use buildOrThrow() instead, as it makes it clear that it will throw on duplicated values

@leetcode-1533
Copy link
Contributor Author

Hi, the unit tests have been updated and the delta lake's parquet bloom filter option has been disabled, please take a further look

leetcode-1533 and others added 4 commits January 6, 2023 17:45
DeltaLake uses ParquetPageSourceFactory#createPageSource, therefore
we need to explicitly disable bloom filter in ParquetReaderOptions
in delta lake connector to avoid enabling this feature unintentionally.
@github-actions github-actions bot added the docs label Jan 6, 2023
@raunaqmorarka raunaqmorarka merged commit 7b01054 into trinodb:master Jan 6, 2023
@github-actions github-actions bot added this to the 406 milestone Jan 6, 2023
@mosabua
Copy link
Member

mosabua commented Jan 6, 2023

💥 Well done everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

6 participants