Build new `TableMetadata` without reassigning field IDs #919

jdockerty · 2025-01-28T14:02:45Z

This is in part a question and open for discussion.

When building TableMetadata through the TableMetadataBuilder, all options of building "from scratch" force a reassignment of field IDs:

Using TableMetadataBuilder::new
Using TableMetadataBuilder::from_table_creation, as this is a wrapper over TableMetadataBuilder::new using the TableCreation struct.

I noticed that it would be possible to get any type of TableMetadata that was desired through using the object directly, but all of the fields are restricted to pub(crate) scope. I suspect the reason for this is safety, i.e. ensuring that creation occurs through the builder pattern where the relevant checks are performed on call to build().

Questions:

Would it be problematic to lift the restriction on the TableMetadata fields to be pub¹ or allow the creation of TableMetadata without reassigning field IDs?
If the above is not possible, is there an example of creating the iceberg metadata file hierarchy in the correct way?

For extra context, we're currently constructing Iceberg metadata around pre-existing parquet files written by another system; however, there is no Iceberg catalog or prior metadata JSON. I noticed there is also a StaticTable; however, this requires either pre-existing JSON from FileIO or an input TableMetadata, this 2nd option brings us back to the above issue.

This assignment leads to a mismatch in what is shown in the table metadata JSON vs the actual parquet file:

parquet schema

required group field_id=-1 arrow_schema {
  optional binary field_id=2 cpu (String);
  optional binary field_id=3 host1 (String);
  optional int64 field_id=1 time (Timestamp(isAdjustedToUTC=false, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false));
}

iceberg metadata JSON schema snippet

This reassignment occurs to the order that they appear within the parquet/arrow Schema, rather than the given field IDs.

  "schemas": [
    {
      "schema-id": 0,
      "type": "struct",
      "fields": [
        {
          "id": 1, <-- field_id=2 in parquet
          "name": "cpu",
          "required": false,
          "type": "string"
        },
        {
          "id": 2, <-- field_id=3 in parquet
          "name": "host1",
          "required": false,
          "type": "string"
        },
        {
          "id": 3, <-- field_id=1 in parquet
          "name": "time",
          "required": false,
          "type": "timestamp"
        }
      ]
    }
  ],

This is also referenced by a question in the iceberg slack

Considering this conflicts with the native Java implementation, I would also suspect it is problematic to do in the Rust version. ↩

The text was updated successfully, but these errors were encountered:

kevinjqliu · 2025-02-05T20:37:05Z

It makes sense to be able to create an TableMetadata object without automatically assigning the field-ids. Esp for use cases such as what you described above.

How we can do so is another question. TableMetadata::new currently has this behavior and we need to figure out how to move forward without breaking existing integrations

in pyiceberg, we have a function that will reassign field-ids and return a TableMetadata object
https://github.com/apache/iceberg-python/blob/8ed8f6cae9f431be7eea7ea8d6aec6e6ec537f1f/pyiceberg/table/metadata.py#L571-L581
but we also allow creating TableMetadata object with provided Schema

jdockerty · 2025-02-05T21:39:25Z

in pyiceberg, we have a function that will reassign field-ids and return a TableMetadata object

Yeah, I missed the link into the main issue description, but I noticed this is also the case within the Java implementation too. I assume this carries over to all of the other language implementations as well.

we also allow creating TableMetadata object with provided Schema

Can you point me to where this is done?

I can only find the regular new_table_metadata in PyIceberg (which you linked), which looks equivalent to TableMetadataBuilder::new here, which is doing the same reassignment when passed an iceberg Schema to the function.

kevinjqliu · 2025-02-06T04:27:20Z

For python, you can just pass in the Schema to the TableMetadata constructor
https://github.com/apache/iceberg-python/blob/26eff992715fdf22c020c0f1da8a38afad55f7be/tests/catalog/test_hive.py#L311-L358

we had some discussions around this topic too
apache/iceberg-python#278
apache/iceberg-python#1284

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build new `TableMetadata` without reassigning field IDs #919

Build new `TableMetadata` without reassigning field IDs #919

jdockerty commented Jan 28, 2025 •

edited

Loading

kevinjqliu commented Feb 5, 2025

jdockerty commented Feb 5, 2025

kevinjqliu commented Feb 6, 2025 •

edited

Loading

Build new TableMetadata without reassigning field IDs #919

Build new TableMetadata without reassigning field IDs #919

Comments

jdockerty commented Jan 28, 2025 • edited Loading

Footnotes

kevinjqliu commented Feb 5, 2025

jdockerty commented Feb 5, 2025

kevinjqliu commented Feb 6, 2025 • edited Loading

Build new `TableMetadata` without reassigning field IDs #919

Build new `TableMetadata` without reassigning field IDs #919

jdockerty commented Jan 28, 2025 •

edited

Loading

kevinjqliu commented Feb 6, 2025 •

edited

Loading