Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determine and support appropriate schema evolution semantics for Iceberg table with legacy Hive files #9843

Closed
Tracked by #1324
findepi opened this issue Nov 3, 2021 · 9 comments
Assignees

Comments

@findepi
Copy link
Member

findepi commented Nov 3, 2021

For a Hive table migrated to iceberg with https://iceberg.apache.org/spark-procedures/#migrate-table-procedure
if I drop a field and add a new field with same name, should i expect nulls, or data from legacy files that do not use field ID mappings?

Currently, the Trino implementation uses current names when reading a legacy file (a file without field ID information), so i will see data being read.
I would expect that legacy files are read with first Iceberg table schema, so i would expect nulls being read from legacy files.

@findepi findepi mentioned this issue Nov 3, 2021
93 tasks
@findepi
Copy link
Member Author

findepi commented Nov 3, 2021

cc @rdblue @losipiuk @alexjo2144 @phd3

@rdblue
Copy link
Contributor

rdblue commented Nov 5, 2021

We solved this problem by using a name mapping that is stored in table properties as schema.name-mapping.default. The name mapping is a bit more flexible than using the current field names:

  1. It supports multiple names per field, in case the name has changed in the past (Avro field aliases)
  2. It is not based on the current schema, so the schema can evolve normally (e.g. rename a field)

I've been debating whether to put this in the spec or to start a secondary optional spec for it. It would be great to hear what people here think.

Using a name mapping is fairly easy. When reading a data file, you check whether the file has any field IDs. If it does, then use those IDs. If not, apply the name mapping using one of the existing utility methods in Iceberg to produce a file schema with field IDs. Then just use that file schema instead of the original.

It's also possible for us to create an ID to ID mapping, but we haven't yet. This would be needed to support formats like Protobuf or Thrift that use different IDs. We could also use it to map from another table's IDs to support moving files from one table to another. Right now, we have no plans to add this but I thought I'd at least outline the idea.

@findepi
Copy link
Member Author

findepi commented Nov 8, 2021

Seems like schema.name-mapping.default might be addressing problem what's the semantics of a migrated table. Thanks @rdblue for sharing this.
If this is non-spec (currently), what's the specwise behavior for migrated tables after table schema is further evolved?

@rdblue
Copy link
Contributor

rdblue commented Nov 8, 2021

The spec does not have a special way to support "migrated" data files. Files are required to have correct field IDs. If they have no field IDs, then they have no columns.

The name to ID mapping is only applied when a file has no IDs, so it is used to convert from name-based column resolution into ID-based. Since data files either have IDs or not, schema evolution works just fine for all new data files and for all old data files.

@alexjo2144
Copy link
Member

Thanks for the overview @rdblue. I was taking a look at using the default name mapping for projections but ran into an inconsistency in the Spark readers.

In the case where schema.name-mapping.default is not set, and a file does not have field IDs:

  • The Paruqet reader returns nulls for Struct fields
  • The ORC reader returns actual values for Struct fields
CREATE TABLE test_table_parquet (
id STRING,
nested_struct STRUCT<id:INT, name:STRING, address:STRUCT<street_number:INT, street_name:STRING>>)
USING PARQUET;

INSERT INTO TABLE test_table_parquet SELECT
'213',
named_struct('id', 1, 'name', 'P. Sherman', 'address', named_struct('street_number', 42, 'street_name', 'Wallaby Way'));

CALL system.migrate('default.test_table_parquet');
ALTER TABLE iceberg_test.default.test_table_parquet UNSET TBLPROPERTIES ('schema.name-mapping.default');
SELECT * FROM iceberg_test.default.test_table_parquet;

Does the spec specify is one of these correct?

@rdblue
Copy link
Contributor

rdblue commented Nov 16, 2021

If there is no mapping and no field IDs, then to Iceberg the file has no columns. The Parquet behavior is correct and all Iceberg columns should be null.

@findepi
Copy link
Member Author

findepi commented Nov 18, 2021

then to Iceberg the file has no columns. The Parquet behavior is correct and all Iceberg columns should be null.

from engineering perspective, this sounds logical, but i am concerned about practical implications of such approach.
A file full of data, which we don't attempt to read and return nulls instead -- this is unlikely user intention, so failing could be a clearer and more useful reaction.
i.e. if something doesn't look like an iceberg file, we should fail to read it in iceberg, unless there is additional information (mapping) indicating how to read non-Iceberg files.

@rdblue
Copy link
Contributor

rdblue commented Nov 18, 2021

@findepi, failing is also reasonable if you see that a file has no columns.

@alexjo2144
Copy link
Member

Resolved by #9959

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants