Support id column mapping for Delta Lake #13678

ebyhr · 2022-08-15T10:11:39Z

Description

Documentation

(x) No documentation is needed.

Release notes

(x) Release notes entries required with the following suggested text:

# Delta Lake
* Support reading tables with the property `delta.columnMapping.mode=id`. ({issue}`13629`)

...elta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/DeltaLakeSchemaSupport.java

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

...elta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/DeltaLakeSchemaSupport.java

...uct-tests/src/main/java/io/trino/tests/product/deltalake/TestDeltaLakeColumnMappingMode.java

plugin/trino-hive/src/main/java/io/trino/plugin/hive/parquet/ParquetPageSourceFactory.java

ebyhr · 2022-08-16T08:15:37Z

@findinpath Addressed comments.

github-actions

Found commits that should not be merged: 1 commit(s) that need to be squashed.

plugin/trino-hive/src/main/java/io/trino/plugin/hive/parquet/ParquetPageSourceFactory.java

...elta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/DeltaLakeSchemaSupport.java

...uct-tests/src/main/java/io/trino/tests/product/deltalake/TestDeltaLakeColumnMappingMode.java

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeColumnHandle.java

...elta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/DeltaLakeSchemaSupport.java

plugin/trino-hive/src/main/java/io/trino/plugin/hive/parquet/ParquetPageSourceFactory.java

The job is suspended #13703

alexjo2144

Just skimming so far, but had one high level question.

Here we're passing the idea of column ids all the way through to the parquet reader. Would it be easier to leave the reader alone and use the existing index based approach? We could read the schema when setting up the reader and use the ids there to figure out the indexes in the Parquet file.

Does that make sense?

lib/trino-parquet/src/main/java/io/trino/parquet/ParquetTypeUtils.java

...n/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeUpdatablePageSource.java

ebyhr · 2022-08-26T08:10:57Z

Would it be easier to leave the reader alone and use the existing index based approach?

Let me try this approach.

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeColumnHandle.java

...in/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakePageSourceProvider.java

ebyhr · 2022-08-30T21:59:48Z

Addressed comments.

alexjo2144 · 2022-08-31T16:21:59Z

One last thing, but besides that looks good to me.

You don't have to store the column mapping mode in every split, since it is a table level property it can't change from file to file. Instead, the PageSourceProvider has access to the DeltaLakeTableHandle, you can get it from there

ebyhr · 2022-08-31T23:41:05Z

Thanks, updated.

...in/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakePageSourceProvider.java

findinpath · 2022-09-02T10:15:36Z

...in/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakePageSourceProvider.java

+        switch (columnMapping) {
+            case ID:
+                Integer fieldId = deltaLakeColumnHandle.getFieldId().orElseThrow(() -> new IllegalArgumentException("Field ID must exist"));
+                String fieldName = requireNonNull(fieldIdToName.get(fieldId), "Field name is null");


@Test(groups = {DELTA_LAKE_DATABRICKS, DELTA_LAKE_OSS, DELTA_LAKE_EXCLUDE_73, DELTA_LAKE_EXCLUDE_91, PROFILE_SPECIFIC_TESTS}) public void testColumnMappingModeNameIdAddColumn() { String tableName = "test_dl_column_mapping_mode_name_" + randomTableSuffix(); onDelta().executeQuery("" + "CREATE TABLE default." + tableName + " (a_number INT, a_varchar STRING)" + " USING delta " + " LOCATION 's3://" + bucketName + "/databricks-compatibility-test-" + tableName + "'" + " TBLPROPERTIES (" + " 'delta.columnMapping.mode'='id'," + " 'delta.minReaderVersion'='2'," + " 'delta.minWriterVersion'='5')"); try { onDelta().executeQuery("" + "INSERT INTO default." + tableName + " VALUES " + "(1, 'ala'), " + "(2, 'bala')"); List<Row> expectedRows = ImmutableList.of( row(1, "ala"), row(2, "bala")); assertThat(onDelta().executeQuery("SELECT a_number, a_varchar FROM default." + tableName)) .containsOnly(expectedRows); assertThat(onTrino().executeQuery("SELECT a_number, a_varchar FROM delta.default." + tableName)) .containsOnly(expectedRows); // Verify the connector can read renamed columns correctly //onDelta().executeQuery("ALTER TABLE default." + tableName + " RENAME COLUMN a_number TO new_a_number"); onDelta().executeQuery("ALTER TABLE default." + tableName + " ADD COLUMN another_varchar STRING"); assertThat(onTrino().executeQuery("DESCRIBE delta.default." + tableName)) .containsOnly(ImmutableList.of( row("a_number", "integer", "", ""), row("a_varchar", "varchar", "", ""), row("another_varchar", "varchar", "", ""))); onDelta().executeQuery("INSERT INTO default." + tableName + "(a_number, a_varchar, another_varchar) VALUES (3, 'porto', 'cala')"); expectedRows = ImmutableList.of( row(1, "ala", null), row(2, "bala", null), row(3, "porto", "cala")); assertThat(onDelta().executeQuery("SELECT a_number, a_varchar, another_varchar FROM default." + tableName)) .containsOnly(expectedRows); assertThat(onTrino().executeQuery("SELECT a_number, a_varchar, another_varchar FROM delta.default." + tableName)) .containsOnly(expectedRows); } finally { onDelta().executeQuery("DROP TABLE default." + tableName); } }

This product test in TestDeltaLakeColumnMappingMode fails because the newly added column is not found in the parquetFieldIdToName map (because it didn't exist at the time when the file got created through the initial inserts).

findepi

Skimming.
Thank you @alexjo2144 @findinpath for your reviews!

ebyhr · 2022-09-07T09:29:28Z

CI hit #12818

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakePageSource.java

findinpath · 2022-09-07T14:45:36Z

...uct-tests/src/main/java/io/trino/tests/product/deltalake/TestDeltaLakeColumnMappingMode.java

+            assertThat(onDelta().executeQuery("SELECT * FROM default." + tableName))
+                    .containsOnly(expectedRows);
+            assertThat(onTrino().executeQuery("SELECT * FROM delta.default." + tableName))
+                    .containsOnly(expectedRows);


While testing, I added the following snippet on your test case:

onDelta().executeQuery("ALTER TABLE default." + tableName + " DROP COLUMN another_varchar"); assertThat(onTrino().executeQuery("DESCRIBE delta.default." + tableName)) .containsOnly(row("a_number", "integer", "", "")); expectedRows = ImmutableList.of(row(1), row(2), row(3)); assertThat(onDelta().executeQuery("SELECT * FROM default." + tableName)) .containsOnly(expectedRows); assertThat(onTrino().executeQuery("SELECT * FROM delta.default." + tableName)) .containsOnly(expectedRows);

Since we are testing on how to the column mapping acts when adding a column we may as well verify how it acts when dropping a column.

I stumbled into an unrelated issue:

| 2022-09-07 20:21:44 INFO: FAILURE / io.trino.tests.product.deltalake.TestDeltaLakeColumnMappingMode.testColumnMappingModeIdAddColumn (Groups: profile_specific_tests, delta-lake-exclude-91, delta-lake-databricks, delta-lake-exclude-73) took 25.6 seconds tests | 2022-09-07 20:21:44 SEVERE: Failure cause: tests | io.trino.tempto.query.QueryExecutionException: java.sql.SQLException: [Databricks][DatabricksJDBCDriver](500051) ERROR processing query/statement. Error Code: 0, SQL state: org.apache.hive.service.cli.HiveSQLException: Error running query: java.lang.UnsupportedOperationException: Unrecognized column change class org.apache.spark.sql.connector.catalog.TableChange$DeleteColumn. You may be running an out of date Delta Lake version. tests | at org.apache.spark.sql.hive.thriftserver.HiveThriftServerErrors$.runningQueryError(HiveThriftServerErrors.scala:53) tests | at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:435) tests | at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:257) tests | at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) tests | at org.apache.spark.sql.hive.thriftserver.ThriftLocalProperties.withLocalProperties(ThriftLocalProperties.scala:123) tests | at org.apache.spark.sql.hive.thriftserver.ThriftLocalProperties.withLocalProperties$(ThriftLocalProperties.scala:48) tests | at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:52) tests | at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:235) tests | at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:220) tests | at java.security.AccessController.doPrivileged(Native Method) tests | at javax.security.auth.Subject.doAs(Subject.java:422) tests | at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878) tests | at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2.run(SparkExecuteStatementOperation.scala:269) tests | at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) tests | at java.util.concurrent.FutureTask.run(FutureTask.java:266) tests | at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) tests | at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) tests | at java.lang.Thread.run(Thread.java:748) tests | Caused by: java.lang.UnsupportedOperationException: Unrecognized column change class org.apache.spark.sql.connector.catalog.TableChange$DeleteColumn. You may be running an out of date Delta Lake version.

Will ask about it on Databricks slack

https://delta-users.slack.com/archives/CGK79PLV6/p1662561719425569

You need to be running Databricks 11 for DROP COLUMN https://docs.databricks.com/release-notes/runtime/11.0.html#support-for-dropping-columns-in-delta-tables-public-preview

Indeed. Thanks @alexjo2144 . Testing with Databricks 11 the scenario evoked earlier proved to be successful.

alexjo2144 · 2022-09-07T15:18:14Z

...in/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakePageSourceProvider.java


        verify(pageSource.getReaderColumns().isEmpty(), "All columns expected to be base columns");

        return new DeltaLakePageSource(
                deltaLakeColumns,
+                nullColumnNames.build(),


I think you could get by without passing the null column names all the way through. The ParquetPageSource should already return nulls for columns whose name doesn't show up in the file. Maybe if the column is missing from the file you could give it a name like missing_column_<uuid>?

I'm not sure if this is cleaner though.

Actually, I tried the same approach first and settled on the current change because I felt passing dummy names is redundant and the ParquetPageSource behavior may change in the future. I don't stick to the current approach though.

@findepi Do you have any opinion?

I am not being able to fully weight the consequences, but i agree that dummy names look weird.
Would would be the benefit of doing so?

It gets rid of the changes in DeltaLakePageSource. That's about it

I don't really have a preference, happy to merge as is

ebyhr · 2022-09-12T00:47:05Z

Rebased on upstream to resolve conflicts.

ebyhr · 2022-09-12T13:17:21Z

Rebased on upstream to resolve conflicts.

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakePageSource.java

ebyhr · 2022-09-12T21:38:31Z

Addressed all comments.

ebyhr · 2022-09-13T01:18:45Z

CI hit #13199 in TestDeltaLakeDatabricksInsertCompatibility.testCompressionWithOptimizedWriter

ebyhr · 2022-09-13T08:52:36Z

Rebased on upstream to resolve conflicts.

Also, extract a method to verify supported column mapping and make DeltaLakePageSourceProvider.getParquetTupleDomain public.

findepi · 2023-05-02T10:32:18Z

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeColumnHandle.java

@@ -57,26 +58,29 @@

    private final String name;
    private final Type type;
+    private final OptionalInt fieldId;


include in getRetainedSizeInBytes too

cc @krvikash

cla-bot bot added the cla-signed label Aug 15, 2022

github-actions bot added the tests:hive label Aug 15, 2022

ebyhr force-pushed the ebi/delta-id-column-mapping branch from e26384a to f9ec2dc Compare August 15, 2022 10:42

ebyhr marked this pull request as ready for review August 15, 2022 23:24

ebyhr force-pushed the ebi/delta-id-column-mapping branch from f9ec2dc to 6c5a125 Compare August 16, 2022 00:29

ebyhr requested review from findinpath, homar, findepi and alexjo2144 August 16, 2022 00:30

findinpath reviewed Aug 16, 2022

View reviewed changes

plugin/trino-hive/src/main/java/io/trino/plugin/hive/parquet/ParquetPageSourceFactory.java Outdated Show resolved Hide resolved

github-actions bot previously requested changes Aug 16, 2022

View reviewed changes

findinpath reviewed Aug 16, 2022

View reviewed changes

homar approved these changes Aug 16, 2022

View reviewed changes

ebyhr force-pushed the ebi/delta-id-column-mapping branch from 041a562 to f3171df Compare August 17, 2022 00:33

ebyhr requested a review from findinpath August 17, 2022 02:48

ebyhr force-pushed the ebi/delta-id-column-mapping branch from f3171df to 0443023 Compare August 23, 2022 23:36

alexjo2144 reviewed Aug 25, 2022

View reviewed changes

lib/trino-parquet/src/main/java/io/trino/parquet/ParquetTypeUtils.java Outdated Show resolved Hide resolved

...n/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeUpdatablePageSource.java Outdated Show resolved Hide resolved

ebyhr force-pushed the ebi/delta-id-column-mapping branch from 0443023 to 6011e46 Compare August 30, 2022 06:00

ebyhr requested a review from alexjo2144 August 30, 2022 07:48

alexjo2144 reviewed Aug 30, 2022

View reviewed changes

ebyhr force-pushed the ebi/delta-id-column-mapping branch from 6011e46 to bc162ab Compare August 30, 2022 21:58

ebyhr force-pushed the ebi/delta-id-column-mapping branch from bc162ab to b8a5b64 Compare August 31, 2022 01:31

ebyhr force-pushed the ebi/delta-id-column-mapping branch from b8a5b64 to 8fc14ab Compare August 31, 2022 23:40

findinpath reviewed Sep 2, 2022

View reviewed changes

...in/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakePageSourceProvider.java Outdated Show resolved Hide resolved

ebyhr force-pushed the ebi/delta-id-column-mapping branch 2 times, most recently from 2494f72 to 95aa880 Compare September 2, 2022 09:30

findinpath reviewed Sep 2, 2022

View reviewed changes

findepi approved these changes Sep 5, 2022

View reviewed changes

ebyhr force-pushed the ebi/delta-id-column-mapping branch from 95aa880 to 79703a5 Compare September 7, 2022 08:11

findinpath approved these changes Sep 7, 2022

View reviewed changes

alexjo2144 reviewed Sep 7, 2022

View reviewed changes

ebyhr requested a review from findepi September 9, 2022 04:38

ebyhr force-pushed the ebi/delta-id-column-mapping branch from 79703a5 to 90b911e Compare September 12, 2022 00:46

ebyhr force-pushed the ebi/delta-id-column-mapping branch from 90b911e to 32bce58 Compare September 12, 2022 13:16

alexjo2144 reviewed Sep 12, 2022

View reviewed changes

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakePageSource.java Show resolved Hide resolved

ebyhr force-pushed the ebi/delta-id-column-mapping branch from 32bce58 to 2cf1a9d Compare September 12, 2022 21:37

ebyhr force-pushed the ebi/delta-id-column-mapping branch from 2cf1a9d to 16e0bcc Compare September 13, 2022 08:52

alexjo2144 approved these changes Sep 13, 2022

View reviewed changes

findepi approved these changes Sep 13, 2022

View reviewed changes

Support id column mapping for Delta Lake

834b469

Also, extract a method to verify supported column mapping and make DeltaLakePageSourceProvider.getParquetTupleDomain public.

ebyhr force-pushed the ebi/delta-id-column-mapping branch from 16e0bcc to 834b469 Compare September 14, 2022 02:49

ebyhr merged commit d468fb7 into master Sep 14, 2022

ebyhr deleted the ebi/delta-id-column-mapping branch September 14, 2022 07:23

ebyhr mentioned this pull request Sep 14, 2022

Release notes for 396 #14047

Closed

github-actions bot added this to the 396 milestone Sep 14, 2022

colebow mentioned this pull request Sep 14, 2022

Add Trino 396 release notes #14109

Merged

findepi reviewed May 2, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support id column mapping for Delta Lake #13678

Support id column mapping for Delta Lake #13678

ebyhr commented Aug 15, 2022 •

edited

Loading

ebyhr commented Aug 16, 2022

github-actions bot left a comment

alexjo2144 left a comment

ebyhr commented Aug 26, 2022

ebyhr commented Aug 30, 2022 •

edited

Loading

alexjo2144 commented Aug 31, 2022

ebyhr commented Aug 31, 2022

findinpath Sep 2, 2022

findepi left a comment

ebyhr commented Sep 7, 2022

findinpath Sep 7, 2022

alexjo2144 Sep 7, 2022

findinpath Sep 7, 2022

alexjo2144 Sep 7, 2022

ebyhr Sep 7, 2022 •

edited

Loading

findepi Sep 12, 2022

alexjo2144 Sep 12, 2022

alexjo2144 Sep 12, 2022

ebyhr commented Sep 12, 2022

ebyhr commented Sep 12, 2022

ebyhr commented Sep 12, 2022

ebyhr commented Sep 13, 2022

ebyhr commented Sep 13, 2022

findepi May 2, 2023

Support id column mapping for Delta Lake #13678

Support id column mapping for Delta Lake #13678

Conversation

ebyhr commented Aug 15, 2022 • edited Loading

Description

Documentation

Release notes

ebyhr commented Aug 16, 2022

github-actions bot left a comment

Choose a reason for hiding this comment

alexjo2144 left a comment

Choose a reason for hiding this comment

ebyhr commented Aug 26, 2022

ebyhr commented Aug 30, 2022 • edited Loading

alexjo2144 commented Aug 31, 2022

ebyhr commented Aug 31, 2022

Choose a reason for hiding this comment

findepi left a comment

Choose a reason for hiding this comment

ebyhr commented Sep 7, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebyhr Sep 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebyhr commented Sep 12, 2022

ebyhr commented Sep 12, 2022

ebyhr commented Sep 12, 2022

ebyhr commented Sep 13, 2022

ebyhr commented Sep 13, 2022

Choose a reason for hiding this comment

ebyhr commented Aug 15, 2022 •

edited

Loading

ebyhr commented Aug 30, 2022 •

edited

Loading

ebyhr Sep 7, 2022 •

edited

Loading