Enable reading Parquet's bloomfilter statistics for hive connector #14428

leetcode-1533 · 2022-10-03T05:28:22Z

Description

Enable hive connector to read parquet file's bloomfilter statistics. Related RB: apache/parquet-java@806037c#diff-8da24c84aef62e6e836d073938f7843d289785baaeddf446f3afeae6d4ef4b10.

This feature can be controlled via hive config: "parquet.use-bloom-filter"

Implementation limitations:

Limited types supported: Will add more trino types once this merged in.
Only support Hive connector. hudi, delta-lake and iceberg can be supported after this merged in.

Non-technical explanation

Enable hive connector to read parquet file's bloomfilter statistics. For more details: https://github.com/apache/parquet-format/blob/master/BloomFilter.md.

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Hive
* Improve performance of queries with filters when bloom filter indexes are present in parquet files. Usage of bloom filters from parquet files can be disabled using the catalog configuration property `parquet.use-bloom-filter` or the catalog session property `parquet_use_bloom_filter`. ({issue}`9471`)

leetcode-1533 · 2022-10-03T05:33:03Z

@raunaqmorarka can you help take a look? As well as review: https://gist.github.com/leetcode-1533/2fb1cf64d386c5bef4c26f4f37c9c714.

mosabua · 2022-10-08T00:03:39Z

@claudiusli @cpard and @findepi should probably discuss this.

plugin/trino-hive/src/test/java/io/trino/plugin/hive/parquet/TestBloomFilter.java

plugin/trino-hive/src/main/java/io/trino/plugin/hive/parquet/TrinoParquetDataSource.java

findinpath · 2022-10-12T08:56:28Z

Can we eventually have a product test where spark writes into a hive table with bloom filter enabled for a specific column and Trino does read a specific value from the bloomfiltered column?

(just as a sample - you can create a custom table for the purpose of the test)
https://github.com/apache/spark/blob/778acd411e31839429b3c6964b4082fcedf69342/docs/sql-ref-syntax-ddl-create-table-datasource.md?plain=1

CREATE TABLE student_parquet(id INT, name STRING, age INT) USING PARQUET
    OPTIONS (
        'parquet.bloom.filter.enabled'='true',
        'parquet.bloom.filter.enabled#age'='false'
        );

Use io.trino.tests.product.hive.TestHiveSparkCompatibility to get you started.

In io.trino.tests.product.launcher.env.environment.EnvSinglenodeSparkHive#createSpark if you experience any issues, switch from spark3.0-iceberg to spark3-iceberg (because the image has been renamed in the meantime https://github.com/trinodb/docker-images/blob/master/testing/spark3-iceberg/Dockerfile).

leetcode-1533 · 2022-10-17T19:11:31Z

Can we eventually have a product test where spark writes into a hive table with bloom filter enabled for a specific column and Trino does read a specific value from the bloomfiltered column?

(just as a sample - you can create a custom table for the purpose of the test) https://github.com/apache/spark/blob/778acd411e31839429b3c6964b4082fcedf69342/docs/sql-ref-syntax-ddl-create-table-datasource.md?plain=1
CREATE TABLE student_parquet(id INT, name STRING, age INT) USING PARQUET
    OPTIONS (
        'parquet.bloom.filter.enabled'='true',
        'parquet.bloom.filter.enabled#age'='false'
        );
Use io.trino.tests.product.hive.TestHiveSparkCompatibility to get you started.

In io.trino.tests.product.launcher.env.environment.EnvSinglenodeSparkHive#createSpark if you experience any issues, switch from spark3.0-iceberg to spark3-iceberg (because the image has been renamed in the meantime https://github.com/trinodb/docker-images/blob/master/testing/spark3-iceberg/Dockerfile).

Hi, I confirmed the spark release in the test can generate parquet bloom filter files. Will update the PR shortly.

lib/trino-parquet/src/main/java/io/trino/parquet/BloomFilterStore.java

lib/trino-parquet/src/main/java/io/trino/parquet/predicate/TupleDomainParquetPredicate.java

plugin/trino-hive/src/main/java/io/trino/plugin/hive/parquet/ParquetPageSourceFactory.java

lib/trino-parquet/src/main/java/io/trino/parquet/predicate/PredicateUtils.java

plugin/trino-hive/pom.xml

lib/trino-parquet/src/main/java/io/trino/parquet/BloomFilterStore.java

raunaqmorarka

Please check if the code for the resolved comments was pushed

lib/trino-parquet/src/main/java/io/trino/parquet/BloomFilterStore.java

plugin/trino-hive/src/main/java/io/trino/plugin/hive/parquet/ParquetPageSourceFactory.java

leetcode-1533 · 2022-10-25T05:57:05Z

Major changes:

load bloomfilter statistics based on https://gist.github.com/leetcode-1533/2fb1cf64d386c5bef4c26f4f37c9c714. (Via a two rounds of planRead):

I didn't do "We can derive a length for the read based on whether the bloom filter offset is less than start of 1st row group or greater than end of last row group.
In the former case we can read upto starts of the selected row groups, in the latter we can read upto start of page indexes or footer (which comes first).
So this is doable without resorting to streaming read APIs."

This is due to the first case (bloom filter stored before start of every row group) does not has an implementation and the code can't test the correctness of offset calculation.
The implementation didn't assume where the offset is, it assumed the relative small size of bloom filter header and read in those headers with a fixed amount of memory.

Please let me know if it is an issue.

This change tried to deprecate class HdfsParquetDataSource.java, since all its implementations are covered by TrinoParquetDataSource.java: this is a standalone change, please let me know to move those changes into a standalone PR.
I fixed most of comments, only pending action item is implementing a product test that read in spark generated tables.

osscm · 2022-11-01T16:42:28Z

@leetcode-1533 this is great, if you want I can also try to add the integration test with Spark writing and Trino reads. thanks!

leetcode-1533 · 2022-11-01T16:58:48Z

@leetcode-1533 this is great, if you want I can also try to add the integration test with Spark writing and Trino reads. thanks!

Hey, I can do it! Should be able to send out by end of today. Do you have other comments for the implementation?

leetcode-1533 · 2022-11-06T01:43:19Z

@leetcode-1533 this is great, if you want I can also try to add the integration test with Spark writing and Trino reads. thanks!

Hey, I can do it! Should be able to send out by end of today. Do you have other comments for the implementation?

Hi, sorry for the delay, I had updated the PR with product test. Only remaining part for this PR as far as I can see is how to read the bloom filter using the offset index. I am okey to change the implementation per @raunaqmorarka suggested, if he think it is better. @raunaqmorarka

lib/trino-parquet/src/main/java/io/trino/parquet/predicate/TupleDomainParquetPredicate.java

plugin/trino-hive/src/test/java/io/trino/plugin/hive/TestParquetPageSkipping.java

lib/trino-parquet/src/main/java/io/trino/parquet/BloomFilterStore.java

plugin/trino-hive/src/main/java/io/trino/plugin/hive/parquet/HdfsParquetDataSource.java

lib/trino-parquet/src/main/java/io/trino/parquet/predicate/TupleDomainParquetPredicate.java

plugin/trino-hive/src/test/java/io/trino/plugin/hive/parquet/TestTrinoParquetDataSource.java

lib/trino-parquet/src/main/java/io/trino/parquet/BloomFilterStore.java

lib/trino-parquet/src/test/java/io/trino/parquet/TestTupleDomainParquetPredicate.java

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveSessionProperties.java

lib/trino-parquet/src/main/java/io/trino/parquet/ParquetReaderOptions.java

plugin/trino-hive/src/main/java/io/trino/plugin/hive/parquet/ParquetReaderConfig.java

plugin/trino-hive/src/test/java/io/trino/plugin/hive/parquet/TestBloomFilter.java

leetcode-1533 · 2022-11-18T10:39:05Z

Updated, please take a look!

leetcode-1533 · 2022-11-19T00:33:45Z

Details about domain compaction thresholds:

Hive: Read from config
DeltaLake: Read from config.
Iceberg: using existing static global variable.
Hudi: hard code as 1000 default value.

lib/trino-parquet/src/main/java/io/trino/parquet/BloomFilterStore.java

...in/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakePageSourceProvider.java

findinpath · 2022-12-28T06:13:18Z

Maven checks are red:

Error:  /home/runner/work/trino/trino/lib/trino-parquet/src/main/java/io/trino/parquet/BloomFilterStore.java:61: Use buildOrThrow() instead, as it makes it clear that it will throw on duplicated values

lib/trino-parquet/src/main/java/io/trino/parquet/predicate/TupleDomainParquetPredicate.java

...n/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TableSnapshot.java

lib/trino-parquet/src/main/java/io/trino/parquet/BloomFilterStore.java

leetcode-1533 · 2023-01-06T03:41:33Z

Hi, the unit tests have been updated and the delta lake's parquet bloom filter option has been disabled, please take a further look

DeltaLake uses ParquetPageSourceFactory#createPageSource, therefore we need to explicitly disable bloom filter in ParquetReaderOptions in delta lake connector to avoid enabling this feature unintentionally.

mosabua · 2023-01-06T15:36:20Z

💥 Well done everyone!

cla-bot bot added the cla-signed label Oct 3, 2022

github-actions bot added the tests:hive label Oct 3, 2022

leetcode-1533 marked this pull request as ready for review October 3, 2022 05:29

leetcode-1533 mentioned this pull request Oct 3, 2022

Implement Parquet bloom filter #9471

Closed

leetcode-1533 force-pushed the ParquetBloomFilter branch 2 times, most recently from a00a0ee to ef04576 Compare October 3, 2022 07:33

findinpath reviewed Oct 12, 2022

View reviewed changes

plugin/trino-hive/src/test/java/io/trino/plugin/hive/parquet/TestBloomFilter.java Outdated Show resolved Hide resolved

findinpath reviewed Oct 12, 2022

View reviewed changes

plugin/trino-hive/src/main/java/io/trino/plugin/hive/parquet/TrinoParquetDataSource.java Outdated Show resolved Hide resolved

This was referenced Oct 15, 2022

Implement Parquet bloom filter prestodb/presto#18230

Open

Add Parquet BloomFilter prestodb/presto#18283

Open

raunaqmorarka mentioned this pull request Oct 17, 2022

Add Parquet column index filtering to Iceberg #13584

Closed

raunaqmorarka reviewed Oct 17, 2022

View reviewed changes

raunaqmorarka reviewed Oct 23, 2022

View reviewed changes

leetcode-1533 force-pushed the ParquetBloomFilter branch from ef04576 to acb5e69 Compare October 25, 2022 05:50

leetcode-1533 force-pushed the ParquetBloomFilter branch from acb5e69 to e63ee6e Compare October 25, 2022 07:28

leetcode-1533 force-pushed the ParquetBloomFilter branch from ba99907 to 0317eb2 Compare November 6, 2022 01:51

damnMeddlingKid reviewed Nov 6, 2022

View reviewed changes

lib/trino-parquet/src/main/java/io/trino/parquet/predicate/TupleDomainParquetPredicate.java Outdated Show resolved Hide resolved

raunaqmorarka reviewed Nov 7, 2022

View reviewed changes

raunaqmorarka reviewed Nov 14, 2022

View reviewed changes

leetcode-1533 force-pushed the ParquetBloomFilter branch from ec3a1c8 to 67787e4 Compare November 18, 2022 10:38

raunaqmorarka reviewed Dec 24, 2022

View reviewed changes

lib/trino-parquet/src/main/java/io/trino/parquet/BloomFilterStore.java Outdated Show resolved Hide resolved

raunaqmorarka reviewed Dec 24, 2022

View reviewed changes

...in/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakePageSourceProvider.java Show resolved Hide resolved

leetcode-1533 force-pushed the ParquetBloomFilter branch 3 times, most recently from 8d3e2e3 to d2257bd Compare December 25, 2022 07:50

findinpath reviewed Dec 28, 2022

View reviewed changes

lib/trino-parquet/src/main/java/io/trino/parquet/predicate/TupleDomainParquetPredicate.java Show resolved Hide resolved

findinpath reviewed Dec 28, 2022

View reviewed changes

lib/trino-parquet/src/main/java/io/trino/parquet/predicate/TupleDomainParquetPredicate.java Outdated Show resolved Hide resolved

findinpath reviewed Dec 28, 2022

View reviewed changes

...n/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TableSnapshot.java Show resolved Hide resolved

leetcode-1533 commented Jan 3, 2023

View reviewed changes

lib/trino-parquet/src/main/java/io/trino/parquet/BloomFilterStore.java Outdated Show resolved Hide resolved

leetcode-1533 force-pushed the ParquetBloomFilter branch from 27b4340 to 8ac560d Compare January 6, 2023 03:41

leetcode-1533 force-pushed the ParquetBloomFilter branch from 8ac560d to 8e9969a Compare January 6, 2023 03:52

raunaqmorarka force-pushed the ParquetBloomFilter branch from 8e9969a to 808b098 Compare January 6, 2023 08:02

raunaqmorarka approved these changes Jan 6, 2023

View reviewed changes

raunaqmorarka force-pushed the ParquetBloomFilter branch from 808b098 to 5bcdeb9 Compare January 6, 2023 12:15

leetcode-1533 and others added 4 commits January 6, 2023 17:45

Refactor toNativeContainerValue function to HiveTestUtils

2c65d15

Enable reading Parquet's bloomfilter statistics for hive connector

96de59f

Explicitly disable usage of parquet bloomfilter in DeltaLake

22b0e5b

DeltaLake uses ParquetPageSourceFactory#createPageSource, therefore we need to explicitly disable bloom filter in ParquetReaderOptions in delta lake connector to avoid enabling this feature unintentionally.

Add documentation for parquet.use-bloom-filter

82a27dd

raunaqmorarka force-pushed the ParquetBloomFilter branch from 5bcdeb9 to 82a27dd Compare January 6, 2023 12:15

github-actions bot added the docs label Jan 6, 2023

raunaqmorarka merged commit 7b01054 into trinodb:master Jan 6, 2023

github-actions bot added this to the 406 milestone Jan 6, 2023

raunaqmorarka mentioned this pull request Jan 6, 2023

Release notes for 406 #15563

Closed

leetcode-1533 mentioned this pull request Jan 6, 2023

Implement Hive parquet bloomfilter prune test #15633

Merged

colebow mentioned this pull request Jan 10, 2023

Add Trino 406 release notes #15625

Merged

raunaqmorarka added the performance label Mar 14, 2023

leetcode-1533 mentioned this pull request Apr 23, 2023

Enable reading Parquet's bloomfilter statistics for Iceberg connector #17192

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable reading Parquet's bloomfilter statistics for hive connector #14428

Enable reading Parquet's bloomfilter statistics for hive connector #14428

leetcode-1533 commented Oct 3, 2022 •

edited by raunaqmorarka

Loading

leetcode-1533 commented Oct 3, 2022

mosabua commented Oct 8, 2022

findinpath commented Oct 12, 2022

leetcode-1533 commented Oct 17, 2022

raunaqmorarka left a comment

leetcode-1533 commented Oct 25, 2022 •

edited

Loading

osscm commented Nov 1, 2022

leetcode-1533 commented Nov 1, 2022

leetcode-1533 commented Nov 6, 2022

leetcode-1533 commented Nov 18, 2022

leetcode-1533 commented Nov 19, 2022

findinpath commented Dec 28, 2022

leetcode-1533 commented Jan 6, 2023

mosabua commented Jan 6, 2023

Enable reading Parquet's bloomfilter statistics for hive connector #14428

Enable reading Parquet's bloomfilter statistics for hive connector #14428

Conversation

leetcode-1533 commented Oct 3, 2022 • edited by raunaqmorarka Loading

Description

Non-technical explanation

Release notes

leetcode-1533 commented Oct 3, 2022

mosabua commented Oct 8, 2022

findinpath commented Oct 12, 2022

leetcode-1533 commented Oct 17, 2022

raunaqmorarka left a comment

Choose a reason for hiding this comment

leetcode-1533 commented Oct 25, 2022 • edited Loading

osscm commented Nov 1, 2022

leetcode-1533 commented Nov 1, 2022

leetcode-1533 commented Nov 6, 2022

leetcode-1533 commented Nov 18, 2022

leetcode-1533 commented Nov 19, 2022

findinpath commented Dec 28, 2022

leetcode-1533 commented Jan 6, 2023

mosabua commented Jan 6, 2023

leetcode-1533 commented Oct 3, 2022 •

edited by raunaqmorarka

Loading

leetcode-1533 commented Oct 25, 2022 •

edited

Loading