Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support id column mapping for Delta Lake #13678

Merged
merged 1 commit into from
Sep 14, 2022
Merged

Conversation

ebyhr
Copy link
Member

@ebyhr ebyhr commented Aug 15, 2022

Description

Fixes #13629

Documentation

(x) No documentation is needed.

Release notes

(x) Release notes entries required with the following suggested text:

# Delta Lake
* Support reading tables with the property `delta.columnMapping.mode=id`. ({issue}`13629`)

@cla-bot cla-bot bot added the cla-signed label Aug 15, 2022
@ebyhr ebyhr force-pushed the ebi/delta-id-column-mapping branch from e26384a to f9ec2dc Compare August 15, 2022 10:42
@ebyhr ebyhr marked this pull request as ready for review August 15, 2022 23:24
@ebyhr ebyhr force-pushed the ebi/delta-id-column-mapping branch from f9ec2dc to 6c5a125 Compare August 16, 2022 00:29
@ebyhr
Copy link
Member Author

ebyhr commented Aug 16, 2022

@findinpath Addressed comments.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found commits that should not be merged: 1 commit(s) that need to be squashed.

@ebyhr ebyhr force-pushed the ebi/delta-id-column-mapping branch from 041a562 to f3171df Compare August 17, 2022 00:33
@ebyhr ebyhr requested a review from findinpath August 17, 2022 02:48
@ebyhr ebyhr force-pushed the ebi/delta-id-column-mapping branch from f3171df to 0443023 Compare August 23, 2022 23:36
Copy link
Member

@alexjo2144 alexjo2144 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just skimming so far, but had one high level question.

Here we're passing the idea of column ids all the way through to the parquet reader. Would it be easier to leave the reader alone and use the existing index based approach? We could read the schema when setting up the reader and use the ids there to figure out the indexes in the Parquet file.

Does that make sense?

@ebyhr
Copy link
Member Author

ebyhr commented Aug 26, 2022

Would it be easier to leave the reader alone and use the existing index based approach?

Let me try this approach.

@ebyhr ebyhr force-pushed the ebi/delta-id-column-mapping branch from 0443023 to 6011e46 Compare August 30, 2022 06:00
@ebyhr ebyhr requested a review from alexjo2144 August 30, 2022 07:48
@ebyhr ebyhr force-pushed the ebi/delta-id-column-mapping branch from 6011e46 to bc162ab Compare August 30, 2022 21:58
@ebyhr
Copy link
Member Author

ebyhr commented Aug 30, 2022

Addressed comments.

@ebyhr ebyhr force-pushed the ebi/delta-id-column-mapping branch from bc162ab to b8a5b64 Compare August 31, 2022 01:31
@alexjo2144
Copy link
Member

One last thing, but besides that looks good to me.

You don't have to store the column mapping mode in every split, since it is a table level property it can't change from file to file. Instead, the PageSourceProvider has access to the DeltaLakeTableHandle, you can get it from there

@ebyhr ebyhr force-pushed the ebi/delta-id-column-mapping branch from b8a5b64 to 8fc14ab Compare August 31, 2022 23:40
@ebyhr
Copy link
Member Author

ebyhr commented Aug 31, 2022

Thanks, updated.

@ebyhr ebyhr force-pushed the ebi/delta-id-column-mapping branch 2 times, most recently from 2494f72 to 95aa880 Compare September 2, 2022 09:30
switch (columnMapping) {
case ID:
Integer fieldId = deltaLakeColumnHandle.getFieldId().orElseThrow(() -> new IllegalArgumentException("Field ID must exist"));
String fieldName = requireNonNull(fieldIdToName.get(fieldId), "Field name is null");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

 @Test(groups = {DELTA_LAKE_DATABRICKS, DELTA_LAKE_OSS, DELTA_LAKE_EXCLUDE_73, DELTA_LAKE_EXCLUDE_91, PROFILE_SPECIFIC_TESTS})
    public void testColumnMappingModeNameIdAddColumn()
    {
        String tableName = "test_dl_column_mapping_mode_name_" + randomTableSuffix();

        onDelta().executeQuery("" +
                "CREATE TABLE default." + tableName +
                " (a_number INT, a_varchar STRING)" +
                " USING delta " +
                " LOCATION 's3://" + bucketName + "/databricks-compatibility-test-" + tableName + "'" +
                " TBLPROPERTIES (" +
                " 'delta.columnMapping.mode'='id'," +
                " 'delta.minReaderVersion'='2'," +
                " 'delta.minWriterVersion'='5')");

        try {
            onDelta().executeQuery("" +
                    "INSERT INTO default." + tableName + " VALUES " +
                    "(1, 'ala'), " +
                    "(2, 'bala')");

            List<Row> expectedRows = ImmutableList.of(
                    row(1, "ala"),
                    row(2, "bala"));

            assertThat(onDelta().executeQuery("SELECT a_number, a_varchar FROM default." + tableName))
                    .containsOnly(expectedRows);
            assertThat(onTrino().executeQuery("SELECT a_number, a_varchar FROM delta.default." + tableName))
                    .containsOnly(expectedRows);

            // Verify the connector can read renamed columns correctly
            //onDelta().executeQuery("ALTER TABLE default." + tableName + " RENAME COLUMN a_number TO new_a_number");
            onDelta().executeQuery("ALTER TABLE default." + tableName + " ADD COLUMN another_varchar STRING");

            assertThat(onTrino().executeQuery("DESCRIBE delta.default." + tableName))
                    .containsOnly(ImmutableList.of(
                            row("a_number", "integer", "", ""),
                            row("a_varchar", "varchar", "", ""),
                            row("another_varchar", "varchar", "", "")));

            onDelta().executeQuery("INSERT INTO default." + tableName + "(a_number, a_varchar, another_varchar) VALUES  (3, 'porto', 'cala')");
            expectedRows = ImmutableList.of(
                    row(1, "ala", null),
                    row(2, "bala", null),
                    row(3, "porto", "cala"));

            assertThat(onDelta().executeQuery("SELECT a_number, a_varchar, another_varchar FROM default." + tableName))
                    .containsOnly(expectedRows);
            assertThat(onTrino().executeQuery("SELECT a_number, a_varchar, another_varchar FROM delta.default." + tableName))
                    .containsOnly(expectedRows);
        }
        finally {
            onDelta().executeQuery("DROP TABLE default." + tableName);
        }
    }

This product test in TestDeltaLakeColumnMappingMode fails because the newly added column is not found in the parquetFieldIdToName map (because it didn't exist at the time when the file got created through the initial inserts).

Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skimming.
Thank you @alexjo2144 @findinpath for your reviews!

@ebyhr ebyhr force-pushed the ebi/delta-id-column-mapping branch from 95aa880 to 79703a5 Compare September 7, 2022 08:11
@ebyhr
Copy link
Member Author

ebyhr commented Sep 7, 2022

CI hit #12818

assertThat(onDelta().executeQuery("SELECT * FROM default." + tableName))
.containsOnly(expectedRows);
assertThat(onTrino().executeQuery("SELECT * FROM delta.default." + tableName))
.containsOnly(expectedRows);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While testing, I added the following snippet on your test case:

onDelta().executeQuery("ALTER TABLE default." + tableName + " DROP COLUMN another_varchar");
            assertThat(onTrino().executeQuery("DESCRIBE delta.default." + tableName))
                    .containsOnly(row("a_number", "integer", "", ""));
            expectedRows = ImmutableList.of(row(1), row(2), row(3));
            assertThat(onDelta().executeQuery("SELECT * FROM default." + tableName))
                    .containsOnly(expectedRows);
            assertThat(onTrino().executeQuery("SELECT * FROM delta.default." + tableName))
                    .containsOnly(expectedRows);

Since we are testing on how to the column mapping acts when adding a column we may as well verify how it acts when dropping a column.

I stumbled into an unrelated issue:

| 2022-09-07 20:21:44 INFO: FAILURE     /    io.trino.tests.product.deltalake.TestDeltaLakeColumnMappingMode.testColumnMappingModeIdAddColumn (Groups: profile_specific_tests, delta-lake-exclude-91, delta-lake-databricks, delta-lake-exclude-73) took 25.6 seconds
tests               | 2022-09-07 20:21:44 SEVERE: Failure cause:
tests               | io.trino.tempto.query.QueryExecutionException: java.sql.SQLException: [Databricks][DatabricksJDBCDriver](500051) ERROR processing query/statement. Error Code: 0, SQL state: org.apache.hive.service.cli.HiveSQLException: Error running query: java.lang.UnsupportedOperationException: Unrecognized column change class org.apache.spark.sql.connector.catalog.TableChange$DeleteColumn. You may be running an out of date Delta Lake version.
tests               | 	at org.apache.spark.sql.hive.thriftserver.HiveThriftServerErrors$.runningQueryError(HiveThriftServerErrors.scala:53)
tests               | 	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:435)
tests               | 	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:257)
tests               | 	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
tests               | 	at org.apache.spark.sql.hive.thriftserver.ThriftLocalProperties.withLocalProperties(ThriftLocalProperties.scala:123)
tests               | 	at org.apache.spark.sql.hive.thriftserver.ThriftLocalProperties.withLocalProperties$(ThriftLocalProperties.scala:48)
tests               | 	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:52)
tests               | 	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:235)
tests               | 	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:220)
tests               | 	at java.security.AccessController.doPrivileged(Native Method)
tests               | 	at javax.security.auth.Subject.doAs(Subject.java:422)
tests               | 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
tests               | 	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2.run(SparkExecuteStatementOperation.scala:269)
tests               | 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
tests               | 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
tests               | 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
tests               | 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
tests               | 	at java.lang.Thread.run(Thread.java:748)
tests               | Caused by: java.lang.UnsupportedOperationException: Unrecognized column change class org.apache.spark.sql.connector.catalog.TableChange$DeleteColumn. You may be running an out of date Delta Lake version.

Will ask about it on Databricks slack

https://delta-users.slack.com/archives/CGK79PLV6/p1662561719425569

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. Thanks @alexjo2144 . Testing with Databricks 11 the scenario evoked earlier proved to be successful.


verify(pageSource.getReaderColumns().isEmpty(), "All columns expected to be base columns");

return new DeltaLakePageSource(
deltaLakeColumns,
nullColumnNames.build(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could get by without passing the null column names all the way through. The ParquetPageSource should already return nulls for columns whose name doesn't show up in the file. Maybe if the column is missing from the file you could give it a name like missing_column_<uuid>?

I'm not sure if this is cleaner though.

Copy link
Member Author

@ebyhr ebyhr Sep 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I tried the same approach first and settled on the current change because I felt passing dummy names is redundant and the ParquetPageSource behavior may change in the future. I don't stick to the current approach though.

@findepi Do you have any opinion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not being able to fully weight the consequences, but i agree that dummy names look weird.
Would would be the benefit of doing so?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It gets rid of the changes in DeltaLakePageSource. That's about it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really have a preference, happy to merge as is

@ebyhr ebyhr requested a review from findepi September 9, 2022 04:38
@ebyhr ebyhr force-pushed the ebi/delta-id-column-mapping branch from 79703a5 to 90b911e Compare September 12, 2022 00:46
@ebyhr
Copy link
Member Author

ebyhr commented Sep 12, 2022

Rebased on upstream to resolve conflicts.

@ebyhr ebyhr force-pushed the ebi/delta-id-column-mapping branch from 90b911e to 32bce58 Compare September 12, 2022 13:16
@ebyhr
Copy link
Member Author

ebyhr commented Sep 12, 2022

Rebased on upstream to resolve conflicts.

@ebyhr ebyhr force-pushed the ebi/delta-id-column-mapping branch from 32bce58 to 2cf1a9d Compare September 12, 2022 21:37
@ebyhr
Copy link
Member Author

ebyhr commented Sep 12, 2022

Addressed all comments.

@ebyhr
Copy link
Member Author

ebyhr commented Sep 13, 2022

CI hit #13199 in TestDeltaLakeDatabricksInsertCompatibility.testCompressionWithOptimizedWriter

@ebyhr ebyhr force-pushed the ebi/delta-id-column-mapping branch from 2cf1a9d to 16e0bcc Compare September 13, 2022 08:52
@ebyhr
Copy link
Member Author

ebyhr commented Sep 13, 2022

Rebased on upstream to resolve conflicts.

Also, extract a method to verify supported column mapping
and make DeltaLakePageSourceProvider.getParquetTupleDomain public.
@ebyhr ebyhr force-pushed the ebi/delta-id-column-mapping branch from 16e0bcc to 834b469 Compare September 14, 2022 02:49
@ebyhr ebyhr merged commit d468fb7 into master Sep 14, 2022
@ebyhr ebyhr deleted the ebi/delta-id-column-mapping branch September 14, 2022 07:23
@ebyhr ebyhr mentioned this pull request Sep 14, 2022
@github-actions github-actions bot added this to the 396 milestone Sep 14, 2022
@@ -57,26 +58,29 @@

private final String name;
private final Type type;
private final OptionalInt fieldId;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

include in getRetainedSizeInBytes too

cc @krvikash

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Add support for delta.columnMapping.mode='id' in Delta Lake connector
5 participants