Add Parquet bloom filter write support to Iceberg connector #21602

jkylling · 2024-04-18T07:27:32Z

Description

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Section
* Support writing Parquet Bloom filters in Iceberg connector. ({issue}`21570`)

raunaqmorarka

We should add some compatibility testing with Spark in TestIcebergSparkCompatibility to verify that it honors bloom filter property set by us and can read files with bloom filters written by us.
Other than that lgtm

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergTableProperties.java

findinpath · 2024-04-18T09:03:45Z

@leetcode-1533 FYI

docs/src/main/sphinx/connector/iceberg.md

findinpath · 2024-05-03T09:08:46Z

@jkylling gentle reminder about going forward with this contribution.

jkylling · 2024-05-03T10:08:19Z

Added a product test which tests that Trino and Spark can read the Iceberg tables written by each other when the Bloom filter table properties are set. I've not verified if the files written by Spark contain Bloom filters.

Here's a little rant about the experience of writing product tests for this, with the hope that it might help improve the experience (there were more steps involved than the ones below):

I skim through the readme and figure out the command I probably need to run.
./testing/bin/ptl test run --environment singlenode-spark-iceberg -- -t io.trino.tests.product.iceberg.TestIcebergSparkCompatibility.testSparkReadingTrinoBloomFilters
The command crashes as I don't have the most recent images and have to pull them, but the download times out for the larger images when done through the product test launcher. I fix it by pulling with docker pull manually.
./testing/bin/ptl test run --environment singlenode-spark-iceberg -- -t io.trino.tests.product.iceberg.TestIcebergSparkCompatibility.testSparkReadingTrinoBloomFilters
The command crashes as I'm running on Linux with Podman instead of Docker, where it seems to be disallowed to disable the OOM killer. So I need to do my usual workaround of flipping the boolean in

trino/testing/trino-product-tests-launcher/src/main/java/io/trino/tests/product/launcher/env/Environment.java

Line 511 in bd7d3a3

hostConfig.withOomKillDisable(true);
./testing/bin/ptl test run --environment singlenode-spark-iceberg -- -t io.trino.tests.product.iceberg.TestIcebergSparkCompatibility.testSparkReadingTrinoBloomFilters
The commad crashes because some container(s) are failing health checks at startup. I'm not sure which so I edit the launcher to list which containers fail.
./testing/bin/ptl test run --environment singlenode-spark-iceberg -- -t io.trino.tests.product.iceberg.TestIcebergSparkCompatibility.testSparkReadingTrinoBloomFilters
The command crashes because of a java.net.NoRouteToHostException: No route to host in the Spark container.
./testing/bin/ptl test run --environment singlenode-spark-iceberg -- -t io.trino.tests.product.iceberg.TestIcebergSparkCompatibility.testSparkReadingTrinoBloomFilters
The commad crashes because some containers are failing health checks at startup. It's the Presto and Hadoop containers. No mention of the Spark container.
./testing/bin/ptl test run --environment singlenode-spark-iceberg -- -t io.trino.tests.product.iceberg.TestIcebergSparkCompatibility.testSparkReadingTrinoBloomFilters
The commad crashes because some container(s) are failing health checks at startup. It's the Presto and Hadoop containers. I notice the log line logs of container healthcheck: null. Nothing else in the logs seem suspicious, so I decide to just disable all the code of the product tests checking container health checks.
./testing/bin/ptl test run --environment singlenode-spark-iceberg -- -t io.trino.tests.product.iceberg.TestIcebergSparkCompatibility.testSparkReadingTrinoBloomFilters
Test starts to run! Fails on a syntax error in the test SQL. I fix the SQL syntax error.
./testing/bin/ptl test run --environment singlenode-spark-iceberg -- -t io.trino.tests.product.iceberg.TestIcebergSparkCompatibility.testSparkReadingTrinoBloomFilters
The command crashes because of a java.net.NoRouteToHostException: No route to host in the Spark container.
./testing/bin/ptl test run --environment singlenode-spark-iceberg -- -t io.trino.tests.product.iceberg.TestIcebergSparkCompatibility.testSparkReadingTrinoBloomFilters
Test starts to run! New syntax error in the test SQL. I fix the SQL syntax error.
./testing/bin/ptl test run --environment singlenode-spark-iceberg -- -t io.trino.tests.product.iceberg.TestIcebergSparkCompatibility.testSparkReadingTrinoBloomFilters
The command crashes because of a java.net.NoRouteToHostException: No route to host in the Spark container.
./testing/bin/ptl test run --environment singlenode-spark-iceberg -- -t io.trino.tests.product.iceberg.TestIcebergSparkCompatibility.testSparkReadingTrinoBloomFilters
Test succeeds!

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergUtil.java

findinpath · 2024-05-03T12:18:58Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergUtil.java

+    {
+        return properties.entrySet().stream()
+                .filter(entry -> entry.getKey().startsWith(PARQUET_BLOOM_FILTER_COLUMN_ENABLED_PREFIX) && "true".equals(entry.getValue()))
+                .map(entry -> entry.getKey().substring(PARQUET_BLOOM_FILTER_COLUMN_ENABLED_PREFIX.length()))


Do we need to lowercase the column names?
~~We'd probably need a spark compatibility test using case sensitive column names to check this~~
I see already testSparkReadingTrinoBloomFilters

findinpath · 2024-05-03T12:35:09Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergTableProperties.java

@@ -45,6 +45,7 @@ public class IcebergTableProperties
    public static final String FORMAT_VERSION_PROPERTY = "format_version";
    public static final String ORC_BLOOM_FILTER_COLUMNS_PROPERTY = "orc_bloom_filter_columns";
    public static final String ORC_BLOOM_FILTER_FPP_PROPERTY = "orc_bloom_filter_fpp";
+    public static final String PARQUET_BLOOM_FILTER_COLUMNS_PROPERTY = "parquet_bloom_filter_columns";


spark-sql (default)> CREATE TABLE t1 (testInteger INTEGER, testLong BIGINT, testString STRING, testDouble DOUBLE, testFloat REAL) > USING iceberg > TBLPROPERTIES ( > 'write.parquet.bloom-filter-enabled.column.testInteger' = true, > 'write.parquet.bloom-filter-enabled.column.testLong' = true, > 'write.parquet.bloom-filter-enabled.column.testString' = true, > 'write.parquet.bloom-filter-enabled.column.testDouble' = true, > 'write.parquet.bloom-filter-enabled.column.testFloat' = true > );

trino> show create table iceberg.default.t1; Create Table ------------------------------------------------------------------ CREATE TABLE iceberg.default.t1 ( testinteger integer, testlong bigint, teststring varchar, testdouble double, testfloat real ) WITH ( format = 'PARQUET', format_version = 2, location = 'hdfs://hadoop-master:9000/user/hive/warehouse/t1' ) (1 row)

Shouldn't we see in SHOW CREATE TABLE the bloom filter columns now that we're dealing with a supported table property?

Modify io.trino.plugin.iceberg.IcebergUtil#getIcebergTableProperties

i tried on the above scaffolding

SELECT COUNT(*) FROM iceberg.default.t1 where testInteger in (9444, -88777, 6711111);

and see the following

"queryStats" : { .... "physicalInputDataSize" : "656400B", "failedPhysicalInputDataSize" : "0B", "physicalInputPositions" : 5,

This seems not to overlap with the expectations from io.trino.testing.BaseTestParquetWithBloomFilters#testBloomFilterRowGroupPruning(io.trino.spi.connector.CatalogSchemaTableName, java.lang.String)

We could add a toLowerCase to getParquetBloomFilterColumns to handle this? It looks like we have the same issues for the Iceberg ORC Bloom filters. Should we handle case sensitivity in this PR, or handle it in a follow up?

let's rather fix the functionality in the existing PR instead of delivering a half-baked functionality which may potentially back-fire with bugs.

An alternative with less headaches would be to register a pre-created resource table and check the query stats on it similar to what has been done on https://github.com/trinodb/trino/blob/ca209630136eabda2449594ef2b6a4d82fb9c2e5/plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergReadVersionedTableByTemporal.java

Easy access to this would be useful to have in the product tests. It would allow the product tests in this PR to give more coverage. Unfortunately, product tests are not my cup of tea for Friday hacking 😅

We need a mechanism to get the query stats in the product tests to ensure that the bloom filter is actually effective and we don't introduce while refactoring regressions.

Would someone be able to help add this logic? I don't have much experience with the product tests and unfortunately don't have much capacity to follow up on this at the moment. It would be much appreciated!

@findinpath aren't we already testing effectiveness of bloom filter in query runner tests ? I'm not sure that we should block this PR over checking this in product tests as well, we don't do that with Apache Hive for bloom filters in hive connector as well.

findinpath · 2024-05-03T12:43:31Z

...roduct-tests/src/main/java/io/trino/tests/product/iceberg/TestIcebergSparkCompatibility.java

@@ -2924,6 +2924,67 @@ public void testSparkAlterStructColumnType(StorageFormat storageFormat)
        onSpark().executeQuery("DROP TABLE " + sparkTableName);
    }

+    @Test(groups = {ICEBERG, PROFILE_SPECIFIC_TESTS})
+    public void testSparkReadingTrinoBloomFilters()


pls add as well a test which creates the table through Spark with case sensitive column names.

findinpath

Case sensitive column names seem not to be handled as expected.

findinpath · 2024-05-04T04:31:05Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergFileWriterFactory.java

@@ -170,6 +172,7 @@ private IcebergFileWriter createParquetWriter(
                    .setMaxPageValueCount(getParquetWriterPageValueCount(session))
                    .setMaxBlockSize(getParquetWriterBlockSize(session))
                    .setBatchSize(getParquetWriterBatchSize(session))
+                    .setBloomFilterColumns(getParquetBloomFilterColumns(storageProperties))


a few lines above you have the original fileColumnNames - please use those in correlation with what is specified in the table properties (case insensitive name matching) in getParquetBloomFilterColumns.

Also a new test to add: schema evolution - create a table with a bunch of bloom filter columns, drop one of the columns which was specified as bloom filter column and make sure that you don't get any errors . I'm guessing we'd have to filter out in getParquetBloomFilterColumns the column names which don't exist anymore.

The writing logic ignores non-existent columns for which the Bloom filter property is set.

findinpath · 2024-05-04T05:39:41Z

...roduct-tests/src/main/java/io/trino/tests/product/iceberg/TestIcebergSparkCompatibility.java

@@ -2924,6 +2924,114 @@ public void testSparkAlterStructColumnType(StorageFormat storageFormat)
        onSpark().executeQuery("DROP TABLE " + sparkTableName);
    }

+    @Test(groups = {ICEBERG, PROFILE_SPECIFIC_TESTS})
+    public void testSparkReadingTrinoBloomFilters()


Either testSparkReadingTrinoParquetBloomFilters or add logic to run for both parquet and orc

For having an effective bloom filter in ORC you need over 10_000 rows.

trino/lib/trino-orc/src/main/java/io/trino/orc/StripeReader.java

Line 137 in 1998e09

if (rowsInRowGroup.isPresent() && stripe.getNumberOfRows() > rowsInRowGroup.getAsInt()) {

Let's not bring ORC tests in this PR, renaming to testSparkReadingTrinoParquetBloomFilters is fine

...roduct-tests/src/main/java/io/trino/tests/product/iceberg/TestIcebergSparkCompatibility.java

findinpath · 2024-05-04T06:01:06Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergUtil.java

+    {
+        return properties.entrySet().stream()
+                .filter(entry -> entry.getKey().startsWith(PARQUET_BLOOM_FILTER_COLUMN_ENABLED_PREFIX) && "true".equals(entry.getValue()))
+                .map(entry -> entry.getKey().substring(PARQUET_BLOOM_FILTER_COLUMN_ENABLED_PREFIX.length()).toLowerCase(Locale.ENGLISH))


this is incorrect - you are affecting the write parquet logic as well if you do lower case

Do we have other code in the Iceberg connector where we need to handle the case sensitivity(?) of Iceberg columns and the case insensitivity of Trino columns? This would be useful to understand how to handle this case.

findinpath · 2024-05-15T09:26:33Z

@raunaqmorarka , @jkylling sorry for delaying this work with my comments.

The scaffolding needed for getting query stats is not present at the moment in the product tests.

@jkylling pls wrap the work (I think SHOW CREATE TABLE part is still open) and we'll do both the query stats testing manually via the product test environment (and create a follow-up ticket to cover this topic with automated tests).
(optional) I still feel a cheap option would be to create an iceberg table with case sensitive column names in spark and add its contents as test resources and check the query stats within trino-iceberg (similarly to what we have done already for timetravel queries).

jkylling · 2024-05-15T10:56:39Z

@raunaqmorarka , @jkylling sorry for delaying this work with my comments.

The scaffolding needed for getting query stats is not present at the moment in the product tests.

@jkylling pls wrap the work (I think SHOW CREATE TABLE part is still open) and we'll do both the query stats testing manually via the product test environment (and create a follow-up ticket to cover this topic with automated tests). (optional) I still feel a cheap option would be to create an iceberg table with case sensitive column names in spark and add its contents as test resources and check the query stats within trino-iceberg (similarly to what we have done already for timetravel queries).

SHOW CREATE TABLE got fixed after your comments :) Please see TestIcebergParquetWithBloomFilters.testBloomFilterPropertiesArePersistedDuringCreate.

...roduct-tests/src/main/java/io/trino/tests/product/iceberg/TestIcebergSparkCompatibility.java

.

shohamyamin · 2024-06-03T19:26:34Z

@jkylling this will improve the read from table with bloom filter or that it only deal with creating bloom filter?

jkylling · 2024-06-03T21:27:30Z

@jkylling this will improve the read from table with bloom filter or that it only deal with creating bloom filter?

@shohamyamin This only adds write support. Read support for Bloom filters were already added in 406. Please see #9471 for the issue which tracked this.

cla-bot bot added the cla-signed label Apr 18, 2024

jkylling requested a review from raunaqmorarka April 18, 2024 07:27

github-actions bot added docs iceberg Iceberg connector labels Apr 18, 2024

raunaqmorarka reviewed Apr 18, 2024

View reviewed changes

raunaqmorarka added the performance label Apr 18, 2024

raunaqmorarka requested a review from findinpath April 18, 2024 07:47

findinpath requested a review from pajaks April 18, 2024 08:53

findinpath reviewed Apr 18, 2024

View reviewed changes

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergTableProperties.java Show resolved Hide resolved

findinpath reviewed Apr 18, 2024

View reviewed changes

docs/src/main/sphinx/connector/iceberg.md Show resolved Hide resolved

jkylling force-pushed the iceberg-bloom-filter-writer branch 2 times, most recently from 5a44eb2 to 619b0d9 Compare May 3, 2024 10:03

jkylling requested review from raunaqmorarka and findinpath May 3, 2024 10:09

raunaqmorarka previously approved these changes May 3, 2024

View reviewed changes

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergUtil.java Outdated Show resolved Hide resolved

jkylling force-pushed the iceberg-bloom-filter-writer branch from 619b0d9 to 2c5d886 Compare May 3, 2024 10:15

findinpath reviewed May 3, 2024

View reviewed changes

findinpath suggested changes May 3, 2024

View reviewed changes

jkylling force-pushed the iceberg-bloom-filter-writer branch from 2c5d886 to 7a59385 Compare May 3, 2024 16:18

jkylling requested a review from findinpath May 3, 2024 16:21

findinpath reviewed May 4, 2024

View reviewed changes

...roduct-tests/src/main/java/io/trino/tests/product/iceberg/TestIcebergSparkCompatibility.java Show resolved Hide resolved

findinpath reviewed May 4, 2024

View reviewed changes

jkylling force-pushed the iceberg-bloom-filter-writer branch from 7a59385 to 6612eff Compare May 15, 2024 10:58

findinpath reviewed May 16, 2024

View reviewed changes

...roduct-tests/src/main/java/io/trino/tests/product/iceberg/TestIcebergSparkCompatibility.java Show resolved Hide resolved

jkylling added 2 commits June 16, 2024 16:46

Add Parquet Bloom filter write support to Iceberg connector

cd69afd

List unhealthy containers when product tests health check fails

ac3b281

jkylling force-pushed the iceberg-bloom-filter-writer branch from 6612eff to ac3b281 Compare June 16, 2024 14:46

jkylling requested a review from findinpath June 16, 2024 14:49

findinpath approved these changes Jun 24, 2024

View reviewed changes

raunaqmorarka approved these changes Jun 24, 2024

View reviewed changes

raunaqmorarka merged commit 37311c6 into trinodb:master Jun 24, 2024
52 checks passed

github-actions bot added this to the 451 milestone Jun 24, 2024

colebow mentioned this pull request Jun 25, 2024

Add Trino 451 release notes #22504

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Parquet bloom filter write support to Iceberg connector #21602

Add Parquet bloom filter write support to Iceberg connector #21602

jkylling commented Apr 18, 2024

raunaqmorarka left a comment

findinpath commented Apr 18, 2024

findinpath commented May 3, 2024

jkylling commented May 3, 2024 •

edited

Loading

findinpath May 3, 2024 •

edited

Loading

findinpath May 3, 2024

findinpath May 3, 2024

findinpath May 3, 2024

jkylling May 3, 2024

findinpath May 3, 2024

findinpath May 3, 2024

jkylling May 3, 2024

findinpath May 4, 2024

jkylling May 4, 2024

raunaqmorarka May 9, 2024

findinpath May 3, 2024

findinpath left a comment •

edited

Loading

findinpath May 4, 2024

jkylling May 4, 2024

findinpath May 4, 2024 •

edited

Loading

raunaqmorarka May 9, 2024

findinpath May 4, 2024

jkylling May 4, 2024

findinpath commented May 15, 2024

jkylling commented May 15, 2024

shohamyamin commented Jun 3, 2024

jkylling commented Jun 3, 2024 •

edited

Loading

Add Parquet bloom filter write support to Iceberg connector #21602

Add Parquet bloom filter write support to Iceberg connector #21602

Conversation

jkylling commented Apr 18, 2024

Description

Additional context and related issues

Release notes

raunaqmorarka left a comment

Choose a reason for hiding this comment

findinpath commented Apr 18, 2024

findinpath commented May 3, 2024

jkylling commented May 3, 2024 • edited Loading

findinpath May 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

findinpath left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

findinpath May 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

findinpath commented May 15, 2024

jkylling commented May 15, 2024

shohamyamin commented Jun 3, 2024

jkylling commented Jun 3, 2024 • edited Loading

jkylling commented May 3, 2024 •

edited

Loading

findinpath May 3, 2024 •

edited

Loading

findinpath left a comment •

edited

Loading

findinpath May 4, 2024 •

edited

Loading

jkylling commented Jun 3, 2024 •

edited

Loading