Use table schema from the table handle #14076

findinpath · 2022-09-09T10:01:16Z

Description

In the context of the dealing with an Iceberg table with a structure which evolves over time (columns are added / dropped) in case of performing a snapshot/time travel query, the schema of the output matches the corresponding schema of the table snapshot queried.

Fixes #14064
Relates to #12786

Non-technical explanation

In the context of time travel queries, use the table schema corresponding to the snapshot of the table queried
for retrieving the columns of the output.

Release notes

( ) This is not user-visible and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Iceberg
* Use table schema corresponding to snapshot in snapshot queries

alexjo2144 · 2022-09-09T14:14:49Z

Looks like there are some other places we do loadTable().getSchema() like this:

getTableProperties - still needs to load the table, but should use the schema from the handle
getInsertLayout - this one doesn't practically matter because you can only insert on the latest snapshot, but should still change it
TableStatisticsMaker#makeTableStatistics - also shouldn't matter, but might want to change it

I think only the first one really matters

findinpath · 2022-09-09T14:17:27Z

@alexjo2144 this is actually a point that I wanted to bring in this PR - IcebergMetadata at the time of this writing has 15 places where of table.schema(). Would it be safe to reuse everywhere the schema passed through the IcebergTableHandle ?

alexjo2144 · 2022-09-09T14:20:42Z

Yeah I think we should prefer the schema in the Handle. The pattern I was looking for here were methods which called loadTable().getSchema() and had an IcebergTableHandle as input. If the method takes a SchemaTableName, loading the table and using that schema is fine.

findepi · 2022-09-09T14:47:17Z

getInsertLayout - this one doesn't practically matter because you can only insert on the latest snapshot, but should still change it

let's make sure cleanups like this and the bug fix come in separate commits

findinpath · 2022-09-09T15:03:07Z

let's make sure cleanups like this and the bug fix come in separate commits

I created a separate PR to avoid cluttering the current changes

#14079

Due to internal caching within the method `org.apache.iceberg.ManifestGroup.planFiles` the returned file scan tasks may contain an invalid split schema string. Rely on the table schema from the table handle while reading from AVRO data files.

In the context of the dealing with an Iceberg table with a structure which evolves over time (columns are added / dropped) in case of performing a snapshot/time travel query, the schema of the output matches the corresponding schema of the table snapshot queried.

findepi · 2022-09-10T09:34:32Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSourceProvider.java

@@ -312,7 +312,7 @@ else if (identity.getId() == TRINO_MERGE_PARTITION_DATA) {
                partitionSpec.specId(),
                split.getPartitionDataJson(),
                split.getFileFormat(),
-                split.getSchemaAsJson().map(SchemaParser::fromJson),
+                SchemaParser.fromJson(table.getTableSchemaJson()),


Due to internal caching within the method
org.apache.iceberg.ManifestGroup.planFiles
the returned file scan tasks may contain an invalid split
schema string.

is it testable?

Yes, it it testable through io.trino.plugin.iceberg.TestIcebergAvroConnectorTest.

I was reluctant on squashing the two commits of this PR because they address different issues.

The test io.trino.plugin.iceberg.TestIcebergAvroConnectorTest covers both of the issues.

findepi · 2022-09-10T09:39:20Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java

        ImmutableMap.Builder<String, ColumnHandle> columnHandles = ImmutableMap.builder();
-        for (IcebergColumnHandle columnHandle : getColumns(icebergTable.schema(), typeManager)) {
+        for (IcebergColumnHandle columnHandle : getColumns(SchemaParser.fromJson(table.getTableSchemaJson()), typeManager)) {


this is a good change.

However, it looks like we call SchemaParser.fromJson(tableHandle.getTableSchemaJson()) multiple times on one table handle. Am i right?

SchemaParser.fromJson does cache internally (on a static field).
This isn't ideal, and we could better, caching within table handle object. Not sure it matters though -- depends how frequently this is called.

Should we switch to SchemaParser.fromJson(JsonNode)

fromJson(JsonUtil.mapper().readValue(jsonKey, JsonNode.class))

?

cla-bot bot added the cla-signed label Sep 9, 2022

findinpath requested review from findepi, ebyhr and alexjo2144 September 9, 2022 10:01

alexjo2144 approved these changes Sep 9, 2022

View reviewed changes

findinpath added 2 commits September 9, 2022 22:00

findinpath force-pushed the iceberg-use-schema-from-table-handle branch from 5940942 to 009b735 Compare September 9, 2022 20:02

findepi approved these changes Sep 10, 2022

View reviewed changes

findepi merged commit 3421fdf into trinodb:master Sep 12, 2022

github-actions bot added this to the 396 milestone Sep 12, 2022

findinpath mentioned this pull request Sep 12, 2022

Iceberg reuse information from table handle #14079

Merged

ebyhr mentioned this pull request Sep 13, 2022

Fix incorrect metadata for Iceberg table after rollback #13704

Closed

colebow mentioned this pull request Sep 13, 2022

Add Trino 396 release notes #14109

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use table schema from the table handle #14076

Use table schema from the table handle #14076

findinpath commented Sep 9, 2022 •

edited

Loading

alexjo2144 commented Sep 9, 2022

findinpath commented Sep 9, 2022

alexjo2144 commented Sep 9, 2022 •

edited

Loading

findepi commented Sep 9, 2022

findinpath commented Sep 9, 2022

findepi Sep 10, 2022

findinpath Sep 12, 2022

findepi Sep 10, 2022

findinpath Sep 12, 2022 •

edited

Loading

Use table schema from the table handle #14076

Use table schema from the table handle #14076

Conversation

findinpath commented Sep 9, 2022 • edited Loading

Description

Non-technical explanation

Release notes

alexjo2144 commented Sep 9, 2022

findinpath commented Sep 9, 2022

alexjo2144 commented Sep 9, 2022 • edited Loading

findepi commented Sep 9, 2022

findinpath commented Sep 9, 2022

findepi Sep 10, 2022

Choose a reason for hiding this comment

findinpath Sep 12, 2022

Choose a reason for hiding this comment

findepi Sep 10, 2022

Choose a reason for hiding this comment

findinpath Sep 12, 2022 • edited Loading

Choose a reason for hiding this comment

findinpath commented Sep 9, 2022 •

edited

Loading

alexjo2144 commented Sep 9, 2022 •

edited

Loading

findinpath Sep 12, 2022 •

edited

Loading