Handle partition schema evolution in partitions metadata #12416

homar · 2022-05-16T12:24:01Z

Description

Provides information about partitioning based on the set of all columns which were used in any spec.

Related issues, pull requests, and links

Documentation

( ) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
( ) Release notes entries required with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

findinpath · 2022-05-16T13:40:23Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/PartitionTable.java

@@ -199,8 +206,9 @@ private Map<StructLikeWrapper, IcebergStatistics> getStatisticsByPartition(Table
                        .acceptDataFile(dataFile, fileScanTask.spec());
            }

-            return partitions.entrySet().stream()
+            ImmutableMap<StructLikeWrapper, IcebergStatistics> collect = partitions.entrySet().stream()


drop variable assignment.

findinpath · 2022-05-16T13:46:29Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/PartitionTable.java

@@ -140,6 +142,11 @@ public ConnectorTableMetadata getTableMetadata()
        return connectorTableMetadata;
    }

+    private List<PartitionField> getAllPartitionFields(Table icebergTable)
+    {
+        return icebergTable.specs().values().stream().flatMap(x -> x.fields().stream()).collect(toUnmodifiableList());


Do we need to perform deduplication here?

Yeah, I think we do

findinpath · 2022-05-16T13:54:05Z

Can you pls add a test inspired by https://blog.starburst.io/trino-on-ice-ii-in-place-table-evolution-and-cloud-compatibility-with-iceberg which uses transforms https://trino.io/docs/current/connector/iceberg.html#partitioned-tables (e.g. : from month(ts) to day(ts)).

Also adding a partition field, dropping it and later adding it again would be welcome to see whether the deduplication of the partition fields is needed.

findinpath · 2022-05-16T13:55:12Z

This PR could build much easier tests upon the functionality exposed by the PR #12259

@alexjo2144

alexjo2144 · 2022-05-16T14:21:49Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/PartitionTable.java

@@ -140,6 +142,11 @@ public ConnectorTableMetadata getTableMetadata()
        return connectorTableMetadata;
    }

+    private List<PartitionField> getAllPartitionFields(Table icebergTable)


Suggested change

private List<PartitionField> getAllPartitionFields(Table icebergTable)

private static List<PartitionField> getAllPartitionFields(Table icebergTable)

alexjo2144 · 2022-05-16T14:22:14Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/PartitionTable.java

@@ -140,6 +142,11 @@ public ConnectorTableMetadata getTableMetadata()
        return connectorTableMetadata;
    }

+    private List<PartitionField> getAllPartitionFields(Table icebergTable)
+    {
+        return icebergTable.specs().values().stream().flatMap(x -> x.fields().stream()).collect(toUnmodifiableList());


Yeah, I think we do

alexjo2144 · 2022-05-16T14:22:34Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/PartitionTable.java

@@ -140,6 +142,11 @@ public ConnectorTableMetadata getTableMetadata()
        return connectorTableMetadata;
    }

+    private List<PartitionField> getAllPartitionFields(Table icebergTable)
+    {
+        return icebergTable.specs().values().stream().flatMap(x -> x.fields().stream()).collect(toUnmodifiableList());


toUnmodifiableList() -> toImmutableList()

alexjo2144 · 2022-05-16T14:24:14Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/PartitionTable.java

@@ -180,7 +187,7 @@ public RecordCursor cursor(ConnectorTransactionHandle transactionHandle, Connect
                .useSnapshot(snapshotId.get())
                .includeColumnStats();
        // TODO make the cursor lazy
-        return buildRecordCursor(getStatisticsByPartition(tableScan), icebergTable.spec().fields());
+        return buildRecordCursor(getStatisticsByPartition(tableScan), getAllPartitionFields(icebergTable));


Isn't there a class level variable with these already?

Suggested change

return buildRecordCursor(getStatisticsByPartition(tableScan), getAllPartitionFields(icebergTable));

return buildRecordCursor(getStatisticsByPartition(tableScan), partitionFields);

it is not, but I can make it

findepi · 2022-05-19T15:08:48Z

@alexjo2144 @findinpath PTAL

findinpath · 2022-05-19T15:20:04Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/PartitionTable.java

    {
        List<RowType.Field> partitionFields = fields.stream()
                .map(field -> RowType.field(
                        field.name(),
                        toTrinoType(field.transform().getResultType(schema.findType(field.sourceId())), typeManager)))
                .collect(toImmutableList());
+        List<Integer> ids = fields.stream()
+                .map(PartitionField::fieldId)
+                .collect(toImmutableList());
        if (partitionFields.isEmpty()) {


unrelated to the current commit:

if (partitionFields.isEmpty()) { return Optional.empty(); }

this can be replaced with a check

if (fields.isEmpty()) { return Optional.empty(); }

at the beginning of the method.

I will add it as a separate commit

findinpath · 2022-05-19T15:24:57Z

...roduct-tests/src/main/java/io/trino/tests/product/iceberg/TestIcebergSparkCompatibility.java

+
+        onTrino().executeQuery(format("CREATE TABLE %s (old_partition_key INT, new_partition_key INT, value date) WITH (PARTITIONING = array['old_partition_key'])", trinoTableName));
+        onTrino().executeQuery(format("INSERT INTO %s VALUES (1, 10, date '2022-04-10'), (2, 20, date '2022-05-11'), (3, 30, date '2022-06-12'), (2, 20, date '2022-06-13')", trinoTableName));
+


please check the partitioning before doing changes on the partition fields.

findinpath · 2022-05-19T19:30:59Z

...roduct-tests/src/main/java/io/trino/tests/product/iceberg/TestIcebergSparkCompatibility.java

+                .column(1);
+        Set<String> partitions = partitioning.stream().map(String::valueOf).collect(toUnmodifiableSet());
+        Assertions.assertThat(partitions.size()).isEqualTo(3);
+        Assertions.assertThat(partitions).containsAll(ImmutableList.of(


Why is the old_partition_key shown here?

The current snapshot of the table has only new_partition_key as partition key.

Checking the partitions metadata table with Iceberg Spark implementation shows also only new_partition_key.

discussed offline, I will add more query to spark to show that behaviour is the same

alexjo2144 · 2022-05-19T19:46:51Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/PartitionTable.java

+    private static class IcebergPartitionColumn
+    {
+        private final RowType rowType;
+        private final List<Integer> ids;


fieldIds?

alexjo2144 · 2022-05-19T19:53:16Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/PartitionTable.java

+                    io.trino.spi.type.Type trinoType = partitionColumnType.rowType.getFields().get(i).getType();
+                    Object value = null;
+                    for (int j = 0; j < partitionStruct.structType.fields().size(); j++) {
+                        if (partitionStruct.structType.fields().get(j).fieldId() == partitionColumnType.ids.get(i)) {


Can we use a Map from fieldId -> Type rather than doing the double iteration on the lists?

I may be missing something but I failed to create such a map.
The reason for this inner loop is not to find type(it is found above it) but to assign data from partitionStruct to a correct partition field. The problem is partitionStruct comes from reading the files using fileScanTasks and here I am trying to match that with partitioning columns that come from reading the table spec.
My assumption is that none of this place contains all information this is why I need to match them.
It'd be great if I'm wrong about that, please point it out.
Tough before I started changing this it was done this way, maybe that was wrong and I should have tried to change it - I doubt this.

hmm I may have figured out how to improve this tough

alexjo2144 · 2022-05-19T19:55:43Z

...roduct-tests/src/main/java/io/trino/tests/product/iceberg/TestIcebergSparkCompatibility.java

+                "{old_partition_key=2, new_partition_key=null}",
+                "{old_partition_key=3, new_partition_key=null}"));
+
+        onSpark().executeQuery(format("ALTER TABLE %s DROP PARTITION FIELD old_partition_key", sparkTableName));


Can you add an insert along with each of these alter table sections so we can see the partition using this field show up?

…Type

findinpath · 2022-05-23T08:11:13Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/PartitionTable.java

+                .filter(partitionField -> existingColumnsIds.contains(partitionField.sourceId()))
+                .collect(toImmutableList());
+
+        Set<Integer> alreadyExistingFieldIds = new HashSet<>();


nit (optional): extract the logic for filtering duplicates to a different method. (just to improve the readability)

findinpath · 2022-05-23T08:19:52Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/PartitionTable.java

-    private final List<Types.NestedField> nonPartitionPrimitiveColumns;
-    private final Optional<RowType> partitionColumnType;
+    private final List<NestedField> nonPartitionPrimitiveColumns;
+    private final Optional<IcebergPartitionColumn> partitionColumnType;
    private final List<io.trino.spi.type.Type> partitionColumnTypes;


The field partitionColumnTypes can be dropped now.
It is used only once in a code branch which is dependent to partitionColumnType

findinpath · 2022-05-23T08:24:43Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/PartitionTable.java

+        {
+            this.structLikeWrapper = structLikeWrapper;
+            Map<Integer, Integer> fieldIdToIndex = new HashMap<>();
+            List<NestedField> fields = structType.fields();


this.fieldIdToIndex = fields.stream().collect(Collectors.toMap(NestedField::fieldId, Function.identity()));

This is incorrect. I need mapping from fieldId -> its index, your code gives fieldId -> NestedField, I could probably do it with some zipWithIndex method but I think standard way is more readible

Sorry, i didn't pay enough attention here. Thanks for the explanation.

alexjo2144

Couple nits but looks good to me

alexjo2144 · 2022-05-23T15:19:11Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/PartitionTable.java

-        List<Types.NestedField> columns = icebergTable.schema().columns();
-        List<PartitionField> partitionFields = icebergTable.spec().fields();
+        List<NestedField> columns = icebergTable.schema().columns();
+        partitionFields = getAllPartitionFields(icebergTable);


Suggested change

partitionFields = getAllPartitionFields(icebergTable);

this.partitionFields = getAllPartitionFields(icebergTable);

alexjo2144 · 2022-05-23T15:22:11Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/PartitionTable.java

+                .values().stream()
+                .flatMap(partitionSpec -> partitionSpec.fields().stream())
+                // skip columns that were dropped
+                .filter(partitionField -> existingColumnsIds.contains(partitionField.sourceId()))


This is not possible right now, because of apache/iceberg#4563 right?

yes but not only, it is also to avoid name conflicts with columns that were renamed

alexjo2144 · 2022-05-23T15:22:41Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/PartitionTable.java

+                .filter(partitionField -> existingColumnsIds.contains(partitionField.sourceId()))
+                .collect(toImmutableList());
+
+        return filterOutDuplicates(visiblePartitionFields);


Use Stream#distinct?

I can't rely on PartitionField implementation of hashCode and equals as they take transformation into account and I really care about Id's. I could maybe provide my own comparator or something but I think this is cleaner now

alexjo2144 · 2022-05-23T15:25:44Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/PartitionTable.java

+        public StructLikeWrapperWithStructType(StructLikeWrapper structLikeWrapper, Types.StructType structType)
+        {
+            this.structLikeWrapper = structLikeWrapper;
+            Map<Integer, Integer> fieldIdToIndex = new HashMap<>();


Use ImmutableMap.Builder

sure but that's more code

alexjo2144 · 2022-05-23T15:26:11Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/PartitionTable.java

@@ -292,4 +333,70 @@ private static Block getColumnMetricBlock(RowType columnMetricType, Object min,
        rowBlockBuilder.closeEntry();
        return columnMetricType.getObject(rowBlockBuilder, 0);
    }
+
+    private static class StructLikeWrapperWithStructType


Lets rename this, now that the field is fieldIdToIndex instead of the structType itself

findepi · 2022-05-25T20:03:30Z

Merged, thanks!
Thank you @findinpath @alexjo2144 for your review

cla-bot bot added the cla-signed label May 16, 2022

github-actions bot added the tests:hive label May 16, 2022

findepi requested review from findinpath and alexjo2144 May 16, 2022 13:26

findinpath reviewed May 16, 2022

View reviewed changes

alexjo2144 reviewed May 16, 2022

View reviewed changes

homar force-pushed the homar/fix_iceberg_partitions_metadata branch 3 times, most recently from c0f61a5 to e3854c2 Compare May 18, 2022 07:24

homar requested review from findinpath, alexjo2144 and findepi May 18, 2022 10:02

findinpath reviewed May 19, 2022

View reviewed changes

alexjo2144 reviewed May 19, 2022

View reviewed changes

Refactor of io.trino.plugin.iceberg.PartitionTable#getPartitionColumn…

45b04b5

…Type

homar force-pushed the homar/fix_iceberg_partitions_metadata branch from e3854c2 to a8b1e3f Compare May 20, 2022 15:45

findinpath reviewed May 23, 2022

View reviewed changes

findinpath approved these changes May 23, 2022

View reviewed changes

homar force-pushed the homar/fix_iceberg_partitions_metadata branch from a8b1e3f to 30f0d72 Compare May 23, 2022 10:08

alexjo2144 approved these changes May 23, 2022

View reviewed changes

homar force-pushed the homar/fix_iceberg_partitions_metadata branch from 30f0d72 to 4d5e849 Compare May 25, 2022 07:57

Handle partition schema evolution in partitions metadata

9d31fc7

homar force-pushed the homar/fix_iceberg_partitions_metadata branch from 4d5e849 to 9d31fc7 Compare May 25, 2022 10:14

findepi merged commit db099cf into trinodb:master May 25, 2022

github-actions bot added this to the 382 milestone May 25, 2022

mosabua mentioned this pull request May 25, 2022

Add Trino 382 release notes #12440

Merged

	private List<PartitionField> getAllPartitionFields(Table icebergTable)
	private static List<PartitionField> getAllPartitionFields(Table icebergTable)

	return buildRecordCursor(getStatisticsByPartition(tableScan), getAllPartitionFields(icebergTable));
	return buildRecordCursor(getStatisticsByPartition(tableScan), partitionFields);


		onTrino().executeQuery(format("CREATE TABLE %s (old_partition_key INT, new_partition_key INT, value date) WITH (PARTITIONING = array['old_partition_key'])", trinoTableName));
		onTrino().executeQuery(format("INSERT INTO %s VALUES (1, 10, date '2022-04-10'), (2, 20, date '2022-05-11'), (3, 30, date '2022-06-12'), (2, 20, date '2022-06-13')", trinoTableName));

	partitionFields = getAllPartitionFields(icebergTable);
	this.partitionFields = getAllPartitionFields(icebergTable);

Handle partition schema evolution in partitions metadata #12416

Handle partition schema evolution in partitions metadata #12416

Conversation

homar commented May 16, 2022

Description

Related issues, pull requests, and links

Documentation

Release notes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

findinpath commented May 16, 2022

findinpath commented May 16, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

findepi commented May 19, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

findinpath May 19, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

homar May 20, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexjo2144 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

homar May 25, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

findepi commented May 25, 2022

findinpath May 19, 2022 •

edited

Loading

homar May 20, 2022 •

edited

Loading

homar May 25, 2022 •

edited

Loading