Support converting column stats on row type to json in Delta Lake #14314

ebyhr · 2022-09-27T10:00:52Z

Description

Fixes #13996

Release notes

(x) This is not user-visible or docs only and no release notes are required.

ebyhr · 2022-10-05T02:44:26Z

CI hit #14391 at

TestDeltaLakeWriteDatabricksCompatibility.testCaseUpdatePartitionColumnFails
TestDeltaLakeDatabricksPartitioningCompatibility.testTrinoCanReadFromTablePartitionChangedByDatabricks

alexjo2144 · 2022-10-05T15:28:19Z

.../src/main/java/io/trino/plugin/deltalake/transactionlog/DeltaLakeParquetStatisticsUtils.java

+            ImmutableMap.Builder<String, Object> fieldValues = ImmutableMap.builder();
+            for (int i = 0; i < rowBlock.getPositionCount(); i++) {
+                RowType.Field field = rowType.getFields().get(i);
+                Object fieldValue = readNativeValue(field.getType(), rowBlock.getChildren().get(i), i);


Rather than getChildren I think you want to convert the rowBlock to a ColumnarRow

The argument is SingleRowBlock which is unsupported in ColumnarRow#toColumnarRow.

That's surprising, toColumnarRow checks that the input is an instance of AbstractRowBlock, which SingleRowBlock extends. Seems like it should work.

Where does the error come from?

toColumnarRow checks that the input is an instance of AbstractRowBlock, which SingleRowBlock extends. Seems like it should work.

SingleRowBlock extends AbstractSingleRowBlock, not AbstractRowBlock.

Ah, sorry I can't read

.../src/main/java/io/trino/plugin/deltalake/transactionlog/DeltaLakeParquetStatisticsUtils.java

ebyhr · 2022-10-06T04:25:29Z

CI hit #14391

findinpath · 2022-10-06T12:35:52Z

...rino-delta-lake/src/test/java/io/trino/plugin/deltalake/BaseDeltaLakeConnectorSmokeTest.java

+        // The first two entries created by Databricks have column stats. The last one doesn't have column stats because the connector doesn't support collecting it on row columns.
+        List<AddFileEntry> addFileEntries = getAddFileEntries("json_stats_on_row_type").stream().sorted(comparing(AddFileEntry::getModificationTime)).collect(toImmutableList());
+        assertThat(addFileEntries).hasSize(3);
+        assertJsonStatistics(


The assertions for addFileEntries.get(0) and addFileEntries.get(1) are not relevant. The stats already existed there before running the test.

No, they're relevant. Those two assertions fail if we don't copy the statistics.

findepi · 2022-10-07T08:25:06Z

plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeUtils.java

+import static io.trino.plugin.hive.HiveTestUtils.HDFS_ENVIRONMENT;
+import static io.trino.testing.TestingConnectorSession.SESSION;
+
+public final class TestDeltaLakeUtils


Test -> Testing

findepi · 2022-10-07T08:27:11Z

plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeUtils.java

+{
+    private TestDeltaLakeUtils() {}
+
+    public static List<AddFileEntry> getAddFileEntries(SchemaTableName table, String tableLocation)


The table has no impact on the result of this method, so you can remove this parameter and use eg new SchemaTableName("dummy_schema_placeholder", "dummy_table_placeholder") below

findepi · 2022-10-07T08:29:18Z

.../src/main/java/io/trino/plugin/deltalake/transactionlog/DeltaLakeParquetStatisticsUtils.java

@@ -222,6 +240,37 @@ private static Map<String, Object> jsonEncode(Map<String, Optional<Statistics<?>
                .collect(toImmutableMap(Map.Entry::getKey, entry -> entry.getValue().get()));
    }

+    public static Map<String, Object> toNullCounts(Map<String, Type> columnTypeMapping, Map<String, Object> values)
+    {
+        verify(columnTypeMapping.keySet().containsAll(values.keySet()), "columnTypeMapping should contains all keys of values");


include the key sets in the message

also, would be nice to add a comment why this is expected. it's not obvious to me

btw instead of this check here, i'd rather have a non-null check on type after Type type = columnTypeMapping.get(value.getKey()); line

findepi · 2022-10-07T08:33:59Z

...rino-delta-lake/src/test/java/io/trino/plugin/deltalake/BaseDeltaLakeConnectorSmokeTest.java

+    public void testConvertJsonStatisticsToParquetOnRowType()
+            throws Exception
+    {
+        verifySupportsInsert();


a "verify ..." should verify, i.e. ensure something is true

as a follow-up we could rename this to eg skipUnlessInsertsSupported

I will send a follow-up PR.

findepi · 2022-10-07T08:36:13Z

...-delta-lake/src/test/resources/databricks/json_stats_on_row_type/_delta_log/_last_checkpoint

@@ -0,0 +1 @@
+{"version":2,"size":4}


for the test, do we need transaction json files before the checkpoint (0 and 1) ?

Those files aren't required. Removed.

findepi · 2022-10-07T08:37:45Z

...rino-delta-lake/src/test/java/io/trino/plugin/deltalake/BaseDeltaLakeConnectorSmokeTest.java

+        assertUpdate("INSERT INTO json_stats_on_row_type SELECT CAST(row(3) AS row(x bigint)), CAST(row(row('test insert')) AS row(y row(nested varchar)))", 1);
+
+        // The first two entries created by Databricks have column stats. The last one doesn't have column stats because the connector doesn't support collecting it on row columns.
+        List<AddFileEntry> addFileEntries = getAddFileEntries("json_stats_on_row_type").stream().sorted(comparing(AddFileEntry::getModificationTime)).collect(toImmutableList());


do the getAddFileEntries come from a new snapshot that we just created, or from previous snapshot + transaction log files?

i think the intention is that we create transaction 4 and a checkpoint, so let's verify that happened

findepi · 2022-10-07T08:39:39Z

plugin/trino-delta-lake/src/test/resources/databricks/json_stats_on_row_type/README.md

+Data generated using Databricks 10.4:
+
+```sql
+CREATE TABLE default.json_stats_on_row_type
+ (struct_col struct<x bigint>, nested_struct_col struct<y struct<nested string>>)
+USING DELTA 
+LOCATION 's3://bucket/table' 
+TBLPROPERTIES (
+ delta.checkpointInterval = 2,
+ delta.checkpoint.writeStatsAsJson = false,
+ delta.checkpoint.writeStatsAsStruct = true
+);
+
+INSERT INTO default.json_stats_on_row_type SELECT named_struct('x', 1), named_struct('y', named_struct('nested', 'test'));
+INSERT INTO default.json_stats_on_row_type SELECT named_struct('x', NULL), named_struct('y', named_struct('nested', NULL));
+
+ALTER TABLE default.json_stats_on_row_type SET TBLPROPERTIES (
+ 'delta.checkpoint.writeStatsAsJson' = true,
+ 'delta.checkpoint.writeStatsAsStruct' = false
+);
+```


Additionally, remove a redundant argument.

cla-bot bot added the cla-signed label Sep 27, 2022

ebyhr force-pushed the ebi/delta-json-stats-row-type branch from a5a1264 to 497e650 Compare October 4, 2022 08:26

ebyhr marked this pull request as ready for review October 4, 2022 08:29

ebyhr force-pushed the ebi/delta-json-stats-row-type branch from 497e650 to 8c61387 Compare October 5, 2022 00:25

ebyhr requested review from findepi, homar, findinpath and alexjo2144 October 5, 2022 02:45

alexjo2144 reviewed Oct 5, 2022

View reviewed changes

findinpath reviewed Oct 6, 2022

View reviewed changes

findepi approved these changes Oct 7, 2022

View reviewed changes

findepi reviewed Oct 7, 2022

View reviewed changes

ebyhr added 3 commits October 11, 2022 09:23

Fix typo in TransactionLogAccess

5a6e493

Extract getAddFileEntries in Delta Lake test

b92481d

Additionally, remove a redundant argument.

Support converting column stats on row type to json in Delta Lake

bcbbc9f

ebyhr force-pushed the ebi/delta-json-stats-row-type branch from 2312115 to bcbbc9f Compare October 11, 2022 01:44

ebyhr merged commit a9480bd into master Oct 11, 2022

ebyhr deleted the ebi/delta-json-stats-row-type branch October 11, 2022 04:58

ebyhr mentioned this pull request Oct 11, 2022

Rename verifySupportsInsert to skipUnlessInsertsSupported in Delta Lake #14558

Merged

github-actions bot added this to the 400 milestone Oct 11, 2022

colebow mentioned this pull request Oct 11, 2022

Add Trino 400 release notes #14556

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support converting column stats on row type to json in Delta Lake #14314

Support converting column stats on row type to json in Delta Lake #14314

ebyhr commented Sep 27, 2022 •

edited

Loading

ebyhr commented Oct 5, 2022

alexjo2144 Oct 5, 2022

ebyhr Oct 5, 2022

alexjo2144 Oct 6, 2022

ebyhr Oct 6, 2022

alexjo2144 Oct 7, 2022

ebyhr commented Oct 6, 2022

findinpath Oct 6, 2022

ebyhr Oct 6, 2022

findepi Oct 7, 2022

findepi Oct 7, 2022

findepi Oct 7, 2022

findepi Oct 7, 2022

findepi Oct 7, 2022

ebyhr Oct 11, 2022

findepi Oct 7, 2022

ebyhr Oct 11, 2022

findepi Oct 7, 2022

findepi Oct 7, 2022

		@@ -0,0 +1 @@
		{"version":2,"size":4}

Support converting column stats on row type to json in Delta Lake #14314

Support converting column stats on row type to json in Delta Lake #14314

Conversation

ebyhr commented Sep 27, 2022 • edited Loading

Description

Release notes

ebyhr commented Oct 5, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebyhr commented Oct 6, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebyhr commented Sep 27, 2022 •

edited

Loading