Collect Delta extended statistics when creating table #15878

pajaks · 2023-01-27T12:29:24Z

Description

Collect delta lake statistics for CREATE TABLE AS.

Additional context and related issues

Release notes

(x) Release notes are required, with the following suggested text:

# Delta Lake
* Collect statistics for CREATE TABLE AS

pajaks · 2023-01-30T13:17:07Z

rebase on master to use CI fix #15879

alexjo2144

Looks pretty good overall. Couple questions/nitpicks

alexjo2144 · 2023-01-30T15:46:35Z

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeConfig.java

@@ -68,6 +68,7 @@
    private Duration dynamicFilteringWaitTimeout = new Duration(0, SECONDS);
    private boolean tableStatisticsEnabled = true;
    private boolean extendedStatisticsEnabled = true;
+    private boolean collectExtendedStatisticsColumnStatisticsOnWrite = true;


maybe just collectExtendedColumnStatisticsOnWrite ?

alexjo2144 · 2023-01-30T15:50:34Z

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

+        Set<String> allColumnNames = extractColumnMetadata(metadata, typeManager).stream()
+                .map(ColumnMetadata::getName)
+                .collect(toImmutableSet());


Save the result of extractColumnMetadata so that you don't have to call it again at the bottom of this method.

Suggested change

Set<String> allColumnNames = extractColumnMetadata(metadata, typeManager).stream()

.map(ColumnMetadata::getName)

.collect(toImmutableSet());

List<ColumnMetadata> columnMetadata = extractColumnMetadata(metadata, typeManager);

Set<String> allColumnNames = columnMetadata.stream()

.map(ColumnMetadata::getName)

.collect(toImmutableSet());

alexjo2144 · 2023-01-30T15:50:57Z

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

@@ -2124,31 +2135,61 @@ public ConnectorAnalyzeMetadata getStatisticsCollectionMetadata(ConnectorSession
                handle.getReadVersion(),
                false);

+        TableStatisticsMetadata statisticsMetadata = getStatisticsCollectionMetadata(
+                statistics,
+                extractColumnMetadata(metadata, typeManager),


Per other comment, don't have to call extractColumnMetadata again.

alexjo2144 · 2023-01-30T15:55:32Z

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

+                Optional.empty(),
+                tableMetadata.getColumns(),
+                allColumnNames,
+                false);


Why not includeMaxFileModifiedTime in this situation?

Statistic aggregation during table creation does not have information about file_modified_time yet.

Right, right. Then if the modified time isn't present we just use the current time when the collection is done. Makes sense.

Can you please add a code comment explaining this consideration?
What do we need to have this information available?

alexjo2144 · 2023-01-30T15:58:05Z

...rino-delta-lake/src/test/java/io/trino/plugin/deltalake/BaseDeltaLakeConnectorSmokeTest.java

@@ -1361,7 +1361,7 @@ private void testDeltaLakeTableLocationChanged(boolean fewerEntries, boolean fir
     * testing in {@link TestDeltaLakeAnalyze}.
     */
    @Test
-    public void testAnalyze()
+    public void testStatisticsGenerationDuringTableCreation()


We should still test the old thing too

findinpath · 2023-01-30T17:15:00Z

plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeAnalyze.java

@@ -147,22 +139,24 @@ private void testAnalyze(Optional<Integer> checkpointInterval)
    public void testAnalyzePartitioned()
    {
        String tableName = "test_analyze_" + randomNameSuffix();
-        assertUpdate("CREATE TABLE " + tableName
+        assertUpdate(


Please do create a compatibility test with spark to verify that after a CTAS DESC EXTENDED works as intended on Databricks

Nevermind. Trino Delta Lake (on the storage layer) & Databricks (on the metastore properties) have outputs in different places.

findinpath · 2023-01-30T17:15:25Z

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeConfig.java

+        return collectExtendedStatisticsColumnStatisticsOnWrite;
+    }
+
+    @Config("delta.extended-statistics.collect-on-write")


Do consider documenting this new property in delta-lake.rst - either in this PR or a follow-up PR

I would wait with documentation until other write operations are implemented if that's ok.

alexjo2144

👍

ebyhr · 2023-02-01T06:09:49Z

plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeAnalyze.java

@@ -388,8 +382,7 @@ public void testAnalyzeSomeColumns()
    @Test
    public void testDropExtendedStats()
    {
-        try (TestTable table = new TestTable(
-                getQueryRunner()::execute,
+        try (TestTable table = new TestTable(getQueryRunner()::execute,


nit: There's no need to change this line. I would revert.

Reduce map iterations and lookups to minimum, while also simplifying the code flow.

findinpath · 2023-02-01T14:09:03Z

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

-                            mergedColumnStatistics.keySet(),
-                            analyzeHandle.getColumns().get()));
-        }
+        analyzeHandle.flatMap(AnalyzeHandle::getColumns).ifPresent(analyzeColumns -> {


nit: this kind of cosmetic changes can be done in a separate commit.

findinpath · 2023-02-01T14:12:39Z

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

@@ -2402,7 +2451,8 @@ private static Optional<Instant> getMaxFileModificationTime(Collection<ComputedS
                    }
                    return Optional.of(Instant.ofEpochMilli(unpackMillisUtc(TimestampWithTimeZoneType.TIMESTAMP_TZ_MILLIS.getLong(entry.getValue(), 0))));
                })
-                .collect(onlyElement());
+                .collect(toOptional())


separate commit

This change make sense only with this commit as it allows collection to have 0 elements. It should throw exception before.

findinpath · 2023-02-01T14:15:14Z

plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeAnalyze.java

@@ -147,22 +139,24 @@ private void testAnalyze(Optional<Integer> checkpointInterval)
    public void testAnalyzePartitioned()
    {
        String tableName = "test_analyze_" + randomNameSuffix();
-        assertUpdate("CREATE TABLE " + tableName
+        assertUpdate(


Nevermind. Trino Delta Lake (on the storage layer) & Databricks (on the metastore properties) have outputs in different places.

findinpath · 2023-02-01T14:31:19Z

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

+                Optional.empty(),
+                tableMetadata.getColumns(),
+                allColumnNames,
+                false);


Can you please add a code comment explaining this consideration?
What do we need to have this information available?

pajaks · 2023-02-02T14:44:05Z

CI #12535, #15809

findinpath · 2023-02-03T08:56:00Z

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

+                Optional.empty(),
+                tableMetadata.getColumns(),
+                allColumnNames,
+                false); // File modified time is not available during planning phase as table is not created yet. Time is added during statistics update.


Time is added during statistics update.

Do you mean Maximum File modified time ?

findepi · 2023-02-03T15:43:50Z

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

+                        .max(Long::compare)
+                        .map(Instant::ofEpochMilli);
+
+                updateTableStatistics(session,


updateTableStatistics( session,

findepi · 2023-02-03T15:46:35Z

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

+    private void updateTableStatistics(ConnectorSession session, Optional<AnalyzeHandle> analyzeHandle, String location, Optional<Instant> maxFileModificationTime,
+            Collection<ComputedStatistics> computedStatistics)


this line is now over line length limit, so --

we put all arguments on one line, or each on separate line

findepi · 2023-02-03T15:50:20Z

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

@@ -2410,7 +2458,8 @@ private static Optional<Instant> getMaxFileModificationTime(Collection<ComputedS
                    }
                    return Optional.of(Instant.ofEpochMilli(unpackMillisUtc(TimestampWithTimeZoneType.TIMESTAMP_TZ_MILLIS.getLong(entry.getValue(), 0))));
                })
-                .collect(onlyElement());
+                .collect(toOptional())
+                .flatMap(identity());


that's minimal change, but that's not how you'd write the code if you were writing the code anew.

.flatMap(entry -> { .... if (....) { return Stream.of(); } return Stream.of(Instant.ofEpochMilli(....)); }) .collect(toOptional());

findepi · 2023-02-03T15:53:31Z

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

+                Optional.empty(),
+                tableMetadata.getColumns(),
+                allColumnNames,
+                false); // File modified time is not available during planning phase for writes. Maximum file modification time is obtained during statistics update.


It sounds like a problem and a workaround, but there isn't a problem

// File modified time does not need to be collected as a statistics because it gets derived directly from files being written false);

findepi · 2023-02-03T15:55:49Z

...rino-delta-lake/src/test/java/io/trino/plugin/deltalake/BaseDeltaLakeConnectorSmokeTest.java

+    @Test
+    public void testStatisticsGenerationDuringTableCreation()
+    {
+        String tableName = "test_analyze_" + randomNameSuffix();


test_analyze_ -> test_ctats_stats_

findepi · 2023-02-03T15:57:08Z

...rino-delta-lake/src/test/java/io/trino/plugin/deltalake/BaseDeltaLakeConnectorSmokeTest.java

@@ -1391,6 +1395,26 @@ public void testAnalyze()
                        "(null, null, null, null, 25.0, null, null)");
    }

+    @Test
+    public void testStatisticsGenerationDuringTableCreation()


can you paste this method contents into testCreateTableAsStatistics above?

testCreateTableAsStatistics has good name and a javadoc, just the contents are worse

findepi · 2023-02-03T16:01:57Z

plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeAnalyze.java

+                "CREATE TABLE " + tableName + " " +
+                        (checkpointInterval.isPresent() ? format(" WITH (checkpoint_interval = %s)", checkpointInterval.get()) : "") +
+                        " AS SELECT * FROM tpch.sf1.nation",
+                25);


nit: unrelated fmt change

findepi · 2023-02-03T16:02:01Z

plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeAnalyze.java

                + " WITH ("
                + "   partitioned_by = ARRAY['regionkey']"
                + ")"
-                + "AS SELECT * FROM tpch.sf1.nation", 25);
+                + "AS SELECT * FROM tpch.sf1.nation",
+                25);


nit: unrelated fmt change

findepi · 2023-02-03T16:02:24Z

plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeAnalyze.java

@@ -276,7 +270,7 @@ public void testAnalyzeWithFilesModifiedAfter()
    {
        String tableName = "test_analyze_" + randomNameSuffix();

-        assertUpdate("CREATE TABLE " + tableName + " AS SELECT * FROM tpch.sf1.nation", 25);
+        assertUpdate(disableStatisticsCollectionOnWrite(getSession()), "CREATE TABLE " + tableName + " AS SELECT * FROM tpch.sf1.nation", 25);


nit: each arg on separate line

findepi · 2023-02-03T16:04:21Z

plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeAnalyze.java

+                        "('name', null, null, 0.0, null, null, null)," +
+                        "(null, null, null, null, 25.0, null, null)");
+
+        runAnalyzeVerifySplitCount(tableName, 5);


i know it's preexisting but i don't think we need to assert split count in every test method here. It blurs the test's intent
(perhaps, we don't need it in any test, i don't know, but i am not requesting any change to existing tests)

this would be better:

assertUpdate("ANALYZE " + tableName);

findepi · 2023-02-03T16:06:03Z

@pajaks @findinpath @alexjo2144 thank you, this is awesome!

In particular this improves Delta query performance on data sets created in the connector using CTAS.

pajaks · 2023-02-07T10:42:26Z

CI:
suite-7-non-generic #14441
suite-iceberg #16013

cla-bot bot added the cla-signed label Jan 27, 2023

github-actions bot added the tests:hive label Jan 27, 2023

pajaks marked this pull request as ready for review January 30, 2023 12:45

pajaks requested review from findepi, ebyhr, findinpath and alexjo2144 January 30, 2023 12:45

alexjo2144 reviewed Jan 30, 2023

View reviewed changes

findinpath reviewed Jan 30, 2023

View reviewed changes

alexjo2144 approved these changes Jan 31, 2023

View reviewed changes

ebyhr approved these changes Feb 1, 2023

View reviewed changes

findepi added 3 commits February 1, 2023 08:35

Fix typo

d7976ba

Remove unnecessary variable

0b55cb6

Simplify map merging in Delta finishStatisticsCollection

6b0b80b

Reduce map iterations and lookups to minimum, while also simplifying the code flow.

findinpath reviewed Feb 1, 2023

View reviewed changes

Extract statistics update to separate method

04165ef

findinpath reviewed Feb 3, 2023

View reviewed changes

findepi approved these changes Feb 3, 2023

View reviewed changes

alexjo2144 mentioned this pull request Feb 3, 2023

Allow forcing Delta Lake analyze to ignore previous analysis time #15968

Closed

findepi and others added 2 commits February 6, 2023 10:41

Collect Delta extended statistics during table creation

2b0ae97

In particular this improves Delta query performance on data sets created in the connector using CTAS.

empty

22a27e6

findepi approved these changes Feb 6, 2023

View reviewed changes

empty

97e5fe8

findepi merged commit 6ed0ad5 into trinodb:master Feb 7, 2023

findepi mentioned this pull request Feb 7, 2023

Release notes for 407 #15854

Closed

github-actions bot added this to the 407 milestone Feb 7, 2023

pajaks deleted the findepi/delta-analyze-on-write branch February 8, 2023 08:14

colebow mentioned this pull request Feb 10, 2023

Add Trino 407 release notes #15919

Merged

findepi mentioned this pull request Mar 29, 2023

Collect Delta extended statistics during writes #14575

Closed

		private void updateTableStatistics(ConnectorSession session, Optional<AnalyzeHandle> analyzeHandle, String location, Optional<Instant> maxFileModificationTime,
		Collection<ComputedStatistics> computedStatistics)

Collect Delta extended statistics when creating table #15878

Collect Delta extended statistics when creating table #15878

Conversation

pajaks commented Jan 27, 2023 • edited by ebyhr Loading

Description

Additional context and related issues

Release notes

pajaks commented Jan 30, 2023

alexjo2144 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexjo2144 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pajaks commented Feb 2, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

findepi commented Feb 3, 2023

pajaks commented Feb 7, 2023 • edited Loading

pajaks commented Jan 27, 2023 •

edited by ebyhr

Loading

pajaks commented Feb 7, 2023 •

edited

Loading