Determine and support appropriate schema evolution semantics for Iceberg table with legacy Hive files #9843

findepi · 2021-11-03T09:40:17Z

For a Hive table migrated to iceberg with https://iceberg.apache.org/spark-procedures/#migrate-table-procedure
if I drop a field and add a new field with same name, should i expect nulls, or data from legacy files that do not use field ID mappings?

Currently, the Trino implementation uses current names when reading a legacy file (a file without field ID information), so i will see data being read.
I would expect that legacy files are read with first Iceberg table schema, so i would expect nulls being read from legacy files.

findepi · 2021-11-03T16:26:53Z

cc @rdblue @losipiuk @alexjo2144 @phd3

rdblue · 2021-11-05T15:40:35Z

We solved this problem by using a name mapping that is stored in table properties as schema.name-mapping.default. The name mapping is a bit more flexible than using the current field names:

It supports multiple names per field, in case the name has changed in the past (Avro field aliases)
It is not based on the current schema, so the schema can evolve normally (e.g. rename a field)

I've been debating whether to put this in the spec or to start a secondary optional spec for it. It would be great to hear what people here think.

Using a name mapping is fairly easy. When reading a data file, you check whether the file has any field IDs. If it does, then use those IDs. If not, apply the name mapping using one of the existing utility methods in Iceberg to produce a file schema with field IDs. Then just use that file schema instead of the original.

It's also possible for us to create an ID to ID mapping, but we haven't yet. This would be needed to support formats like Protobuf or Thrift that use different IDs. We could also use it to map from another table's IDs to support moving files from one table to another. Right now, we have no plans to add this but I thought I'd at least outline the idea.

findepi · 2021-11-08T12:10:19Z

Seems like schema.name-mapping.default might be addressing problem what's the semantics of a migrated table. Thanks @rdblue for sharing this.
If this is non-spec (currently), what's the specwise behavior for migrated tables after table schema is further evolved?

rdblue · 2021-11-08T19:16:59Z

The spec does not have a special way to support "migrated" data files. Files are required to have correct field IDs. If they have no field IDs, then they have no columns.

The name to ID mapping is only applied when a file has no IDs, so it is used to convert from name-based column resolution into ID-based. Since data files either have IDs or not, schema evolution works just fine for all new data files and for all old data files.

alexjo2144 · 2021-11-16T17:02:09Z

Thanks for the overview @rdblue. I was taking a look at using the default name mapping for projections but ran into an inconsistency in the Spark readers.

In the case where schema.name-mapping.default is not set, and a file does not have field IDs:

The Paruqet reader returns nulls for Struct fields
The ORC reader returns actual values for Struct fields

CREATE TABLE test_table_parquet (
id STRING,
nested_struct STRUCT<id:INT, name:STRING, address:STRUCT<street_number:INT, street_name:STRING>>)
USING PARQUET;

INSERT INTO TABLE test_table_parquet SELECT
'213',
named_struct('id', 1, 'name', 'P. Sherman', 'address', named_struct('street_number', 42, 'street_name', 'Wallaby Way'));

CALL system.migrate('default.test_table_parquet');
ALTER TABLE iceberg_test.default.test_table_parquet UNSET TBLPROPERTIES ('schema.name-mapping.default');
SELECT * FROM iceberg_test.default.test_table_parquet;

Does the spec specify is one of these correct?

rdblue · 2021-11-16T17:46:10Z

If there is no mapping and no field IDs, then to Iceberg the file has no columns. The Parquet behavior is correct and all Iceberg columns should be null.

findepi · 2021-11-18T17:45:03Z

then to Iceberg the file has no columns. The Parquet behavior is correct and all Iceberg columns should be null.

from engineering perspective, this sounds logical, but i am concerned about practical implications of such approach.
A file full of data, which we don't attempt to read and return nulls instead -- this is unlikely user intention, so failing could be a clearer and more useful reaction.
i.e. if something doesn't look like an iceberg file, we should fail to read it in iceberg, unless there is additional information (mapping) indicating how to read non-Iceberg files.

rdblue · 2021-11-18T20:14:24Z

@findepi, failing is also reasonable if you see that a file has no columns.

alexjo2144 · 2022-04-04T19:14:40Z

Resolved by #9959

findepi mentioned this issue Nov 3, 2021

Iceberg Connector #1324

Closed

93 tasks

This was referenced Nov 9, 2021

Implement Dereference pushdown for the Iceberg connector #8129

Merged

Iceberg Projection Pushdown Followup #9931

Open

alexjo2144 self-assigned this Nov 10, 2021

phd3 mentioned this issue Nov 18, 2021

Handle Iceberg files with missing Field IDs #9959

Merged

alexjo2144 closed this as completed Apr 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Determine and support appropriate schema evolution semantics for Iceberg table with legacy Hive files #9843

Determine and support appropriate schema evolution semantics for Iceberg table with legacy Hive files #9843

findepi commented Nov 3, 2021

findepi commented Nov 3, 2021

rdblue commented Nov 5, 2021

findepi commented Nov 8, 2021

rdblue commented Nov 8, 2021

alexjo2144 commented Nov 16, 2021

rdblue commented Nov 16, 2021 •

edited

Loading

findepi commented Nov 18, 2021

rdblue commented Nov 18, 2021

alexjo2144 commented Apr 4, 2022

Determine and support appropriate schema evolution semantics for Iceberg table with legacy Hive files #9843

Determine and support appropriate schema evolution semantics for Iceberg table with legacy Hive files #9843

Comments

findepi commented Nov 3, 2021

findepi commented Nov 3, 2021

rdblue commented Nov 5, 2021

findepi commented Nov 8, 2021

rdblue commented Nov 8, 2021

alexjo2144 commented Nov 16, 2021

rdblue commented Nov 16, 2021 • edited Loading

findepi commented Nov 18, 2021

rdblue commented Nov 18, 2021

alexjo2144 commented Apr 4, 2022

rdblue commented Nov 16, 2021 •

edited

Loading