-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Determine and support appropriate schema evolution semantics for Iceberg table with legacy Hive files #9843
Comments
We solved this problem by using a name mapping that is stored in table properties as
I've been debating whether to put this in the spec or to start a secondary optional spec for it. It would be great to hear what people here think. Using a name mapping is fairly easy. When reading a data file, you check whether the file has any field IDs. If it does, then use those IDs. If not, apply the name mapping using one of the existing utility methods in Iceberg to produce a file schema with field IDs. Then just use that file schema instead of the original. It's also possible for us to create an ID to ID mapping, but we haven't yet. This would be needed to support formats like Protobuf or Thrift that use different IDs. We could also use it to map from another table's IDs to support moving files from one table to another. Right now, we have no plans to add this but I thought I'd at least outline the idea. |
Seems like |
The spec does not have a special way to support "migrated" data files. Files are required to have correct field IDs. If they have no field IDs, then they have no columns. The name to ID mapping is only applied when a file has no IDs, so it is used to convert from name-based column resolution into ID-based. Since data files either have IDs or not, schema evolution works just fine for all new data files and for all old data files. |
Thanks for the overview @rdblue. I was taking a look at using the default name mapping for projections but ran into an inconsistency in the Spark readers. In the case where
Does the spec specify is one of these correct? |
If there is no mapping and no field IDs, then to Iceberg the file has no columns. The Parquet behavior is correct and all Iceberg columns should be null. |
from engineering perspective, this sounds logical, but i am concerned about practical implications of such approach. |
@findepi, failing is also reasonable if you see that a file has no columns. |
Resolved by #9959 |
For a Hive table migrated to iceberg with https://iceberg.apache.org/spark-procedures/#migrate-table-procedure
if I drop a field and add a new field with same name, should i expect nulls, or data from legacy files that do not use field ID mappings?
Currently, the Trino implementation uses current names when reading a legacy file (a file without field ID information), so i will see data being read.
I would expect that legacy files are read with first Iceberg table schema, so i would expect nulls being read from legacy files.
The text was updated successfully, but these errors were encountered: