-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable reading Parquet's bloomfilter statistics for hive connector #14428
Enable reading Parquet's bloomfilter statistics for hive connector #14428
Conversation
@raunaqmorarka can you help take a look? As well as review: https://gist.github.com/leetcode-1533/2fb1cf64d386c5bef4c26f4f37c9c714. |
a00a0ee
to
ef04576
Compare
@claudiusli @cpard and @findepi should probably discuss this. |
plugin/trino-hive/src/test/java/io/trino/plugin/hive/parquet/TestBloomFilter.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/parquet/TrinoParquetDataSource.java
Outdated
Show resolved
Hide resolved
Can we eventually have a product test where spark writes into a hive table with bloom filter enabled for a specific column and Trino does read a specific value from the bloomfiltered column? (just as a sample - you can create a custom table for the purpose of the test)
Use In |
Hi, I confirmed the spark release in the test can generate parquet bloom filter files. Will update the PR shortly. |
lib/trino-parquet/src/main/java/io/trino/parquet/BloomFilterStore.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/predicate/TupleDomainParquetPredicate.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/predicate/TupleDomainParquetPredicate.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/predicate/TupleDomainParquetPredicate.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/predicate/TupleDomainParquetPredicate.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/parquet/ParquetPageSourceFactory.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/parquet/ParquetPageSourceFactory.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/predicate/PredicateUtils.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/BloomFilterStore.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check if the code for the resolved comments was pushed
lib/trino-parquet/src/main/java/io/trino/parquet/BloomFilterStore.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/BloomFilterStore.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/BloomFilterStore.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/BloomFilterStore.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/BloomFilterStore.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/BloomFilterStore.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/parquet/ParquetPageSourceFactory.java
Outdated
Show resolved
Hide resolved
ef04576
to
acb5e69
Compare
Major changes:
I didn't do "We can derive a length for the read based on whether the bloom filter offset is less than start of 1st row group or greater than end of last row group. This is due to the first case (bloom filter stored before start of every row group) does not has an implementation and the code can't test the correctness of offset calculation. Please let me know if it is an issue.
|
acb5e69
to
e63ee6e
Compare
@leetcode-1533 this is great, if you want I can also try to add the |
Hey, I can do it! Should be able to send out by end of today. Do you have other comments for the implementation? |
Hi, sorry for the delay, I had updated the PR with product test. Only remaining part for this PR as far as I can see is how to read the bloom filter using the offset index. I am okey to change the implementation per @raunaqmorarka suggested, if he think it is better. @raunaqmorarka |
ba99907
to
0317eb2
Compare
lib/trino-parquet/src/main/java/io/trino/parquet/predicate/TupleDomainParquetPredicate.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/test/java/io/trino/plugin/hive/TestParquetPageSkipping.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/BloomFilterStore.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/parquet/HdfsParquetDataSource.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/predicate/TupleDomainParquetPredicate.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/predicate/TupleDomainParquetPredicate.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/test/java/io/trino/plugin/hive/parquet/TestTrinoParquetDataSource.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/BloomFilterStore.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/BloomFilterStore.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/BloomFilterStore.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/test/java/io/trino/parquet/TestTupleDomainParquetPredicate.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveSessionProperties.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/ParquetReaderOptions.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/ParquetReaderOptions.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/parquet/ParquetReaderConfig.java
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/parquet/ParquetReaderConfig.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/test/java/io/trino/plugin/hive/parquet/TestBloomFilter.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/test/java/io/trino/plugin/hive/parquet/TestBloomFilter.java
Outdated
Show resolved
Hide resolved
ec3a1c8
to
67787e4
Compare
Updated, please take a look! |
Details about domain compaction thresholds: Hive: Read from config |
lib/trino-parquet/src/main/java/io/trino/parquet/BloomFilterStore.java
Outdated
Show resolved
Hide resolved
...in/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakePageSourceProvider.java
Show resolved
Hide resolved
8d3e2e3
to
d2257bd
Compare
Maven checks are red:
|
lib/trino-parquet/src/main/java/io/trino/parquet/predicate/TupleDomainParquetPredicate.java
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/predicate/TupleDomainParquetPredicate.java
Outdated
Show resolved
Hide resolved
...n/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TableSnapshot.java
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/BloomFilterStore.java
Outdated
Show resolved
Hide resolved
27b4340
to
8ac560d
Compare
Hi, the unit tests have been updated and the delta lake's parquet bloom filter option has been disabled, please take a further look |
8ac560d
to
8e9969a
Compare
8e9969a
to
808b098
Compare
808b098
to
5bcdeb9
Compare
DeltaLake uses ParquetPageSourceFactory#createPageSource, therefore we need to explicitly disable bloom filter in ParquetReaderOptions in delta lake connector to avoid enabling this feature unintentionally.
5bcdeb9
to
82a27dd
Compare
💥 Well done everyone! |
Description
Enable hive connector to read parquet file's bloomfilter statistics. Related RB: apache/parquet-java@806037c#diff-8da24c84aef62e6e836d073938f7843d289785baaeddf446f3afeae6d4ef4b10.
This feature can be controlled via hive config: "parquet.use-bloom-filter"
Implementation limitations:
Non-technical explanation
Enable hive connector to read parquet file's bloomfilter statistics. For more details: https://github.com/apache/parquet-format/blob/master/BloomFilter.md.
Release notes
( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text: