Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use table schema from the table handle #14076

Merged

Conversation

findinpath
Copy link
Contributor

@findinpath findinpath commented Sep 9, 2022

Description

In the context of the dealing with an Iceberg table with a structure which evolves over time (columns are added / dropped) in case of performing a snapshot/time travel query, the schema of the output matches the corresponding schema of the table snapshot queried.

Fixes #14064
Relates to #12786

Non-technical explanation

In the context of time travel queries, use the table schema corresponding to the snapshot of the table queried
for retrieving the columns of the output.

Release notes

( ) This is not user-visible and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Iceberg
* Use table schema corresponding to snapshot in snapshot queries

@alexjo2144
Copy link
Member

Looks like there are some other places we do loadTable().getSchema() like this:

  • getTableProperties - still needs to load the table, but should use the schema from the handle
  • getInsertLayout - this one doesn't practically matter because you can only insert on the latest snapshot, but should still change it
  • TableStatisticsMaker#makeTableStatistics - also shouldn't matter, but might want to change it

I think only the first one really matters

@findinpath
Copy link
Contributor Author

@alexjo2144 this is actually a point that I wanted to bring in this PR - IcebergMetadata at the time of this writing has 15 places where of table.schema(). Would it be safe to reuse everywhere the schema passed through the IcebergTableHandle ?

@alexjo2144
Copy link
Member

alexjo2144 commented Sep 9, 2022

Yeah I think we should prefer the schema in the Handle. The pattern I was looking for here were methods which called loadTable().getSchema() and had an IcebergTableHandle as input. If the method takes a SchemaTableName, loading the table and using that schema is fine.

@findepi
Copy link
Member

findepi commented Sep 9, 2022

  • getInsertLayout - this one doesn't practically matter because you can only insert on the latest snapshot, but should still change it

let's make sure cleanups like this and the bug fix come in separate commits

@findinpath
Copy link
Contributor Author

let's make sure cleanups like this and the bug fix come in separate commits

I created a separate PR to avoid cluttering the current changes

#14079

Due to internal caching within the method
`org.apache.iceberg.ManifestGroup.planFiles`
the returned file scan tasks may contain an invalid split
schema string.

Rely on the table schema from the table handle while reading
from AVRO data files.
In the context of the dealing with an Iceberg table with
a structure which evolves over time (columns are added / dropped)
in case of performing a snapshot/time travel query, the schema of
the output matches the corresponding schema of the table
snapshot queried.
@findinpath findinpath force-pushed the iceberg-use-schema-from-table-handle branch from 5940942 to 009b735 Compare September 9, 2022 20:02
@@ -312,7 +312,7 @@ else if (identity.getId() == TRINO_MERGE_PARTITION_DATA) {
partitionSpec.specId(),
split.getPartitionDataJson(),
split.getFileFormat(),
split.getSchemaAsJson().map(SchemaParser::fromJson),
SchemaParser.fromJson(table.getTableSchemaJson()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to internal caching within the method
org.apache.iceberg.ManifestGroup.planFiles
the returned file scan tasks may contain an invalid split
schema string.

is it testable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it it testable through io.trino.plugin.iceberg.TestIcebergAvroConnectorTest.

I was reluctant on squashing the two commits of this PR because they address different issues.

The test io.trino.plugin.iceberg.TestIcebergAvroConnectorTest covers both of the issues.

ImmutableMap.Builder<String, ColumnHandle> columnHandles = ImmutableMap.builder();
for (IcebergColumnHandle columnHandle : getColumns(icebergTable.schema(), typeManager)) {
for (IcebergColumnHandle columnHandle : getColumns(SchemaParser.fromJson(table.getTableSchemaJson()), typeManager)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a good change.

However, it looks like we call SchemaParser.fromJson(tableHandle.getTableSchemaJson()) multiple times on one table handle. Am i right?

SchemaParser.fromJson does cache internally (on a static field).
This isn't ideal, and we could better, caching within table handle object. Not sure it matters though -- depends how frequently this is called.

Copy link
Contributor Author

@findinpath findinpath Sep 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we switch to SchemaParser.fromJson(JsonNode)

fromJson(JsonUtil.mapper().readValue(jsonKey, JsonNode.class))

?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Iceberg time travel should respect column definitions as of the snapshot version
3 participants