Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build new TableMetadata without reassigning field IDs #919

Open
jdockerty opened this issue Jan 28, 2025 · 3 comments
Open

Build new TableMetadata without reassigning field IDs #919

jdockerty opened this issue Jan 28, 2025 · 3 comments

Comments

@jdockerty
Copy link
Contributor

jdockerty commented Jan 28, 2025

This is in part a question and open for discussion.

When building TableMetadata through the TableMetadataBuilder, all options of building "from scratch" force a reassignment of field IDs:

I noticed that it would be possible to get any type of TableMetadata that was desired through using the object directly, but all of the fields are restricted to pub(crate) scope. I suspect the reason for this is safety, i.e. ensuring that creation occurs through the builder pattern where the relevant checks are performed on call to build().

Questions:

  1. Would it be problematic to lift the restriction on the TableMetadata fields to be pub1 or allow the creation of TableMetadata without reassigning field IDs?
  2. If the above is not possible, is there an example of creating the iceberg metadata file hierarchy in the correct way?

For extra context, we're currently constructing Iceberg metadata around pre-existing parquet files written by another system; however, there is no Iceberg catalog or prior metadata JSON. I noticed there is also a StaticTable; however, this requires either pre-existing JSON from FileIO or an input TableMetadata, this 2nd option brings us back to the above issue.

This assignment leads to a mismatch in what is shown in the table metadata JSON vs the actual parquet file:

parquet schema
required group field_id=-1 arrow_schema {
  optional binary field_id=2 cpu (String);
  optional binary field_id=3 host1 (String);
  optional int64 field_id=1 time (Timestamp(isAdjustedToUTC=false, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false));
}
iceberg metadata JSON schema snippet

This reassignment occurs to the order that they appear within the parquet/arrow Schema, rather than the given field IDs.

  "schemas": [
    {
      "schema-id": 0,
      "type": "struct",
      "fields": [
        {
          "id": 1, <-- field_id=2 in parquet
          "name": "cpu",
          "required": false,
          "type": "string"
        },
        {
          "id": 2, <-- field_id=3 in parquet
          "name": "host1",
          "required": false,
          "type": "string"
        },
        {
          "id": 3, <-- field_id=1 in parquet
          "name": "time",
          "required": false,
          "type": "timestamp"
        }
      ]
    }
  ],

This is also referenced by a question in the iceberg slack

Footnotes

  1. Considering this conflicts with the native Java implementation, I would also suspect it is problematic to do in the Rust version.

@kevinjqliu
Copy link
Contributor

It makes sense to be able to create an TableMetadata object without automatically assigning the field-ids. Esp for use cases such as what you described above.

How we can do so is another question. TableMetadata::new currently has this behavior and we need to figure out how to move forward without breaking existing integrations

in pyiceberg, we have a function that will reassign field-ids and return a TableMetadata object
https://github.com/apache/iceberg-python/blob/8ed8f6cae9f431be7eea7ea8d6aec6e6ec537f1f/pyiceberg/table/metadata.py#L571-L581
but we also allow creating TableMetadata object with provided Schema

@jdockerty
Copy link
Contributor Author

in pyiceberg, we have a function that will reassign field-ids and return a TableMetadata object

Yeah, I missed the link into the main issue description, but I noticed this is also the case within the Java implementation too. I assume this carries over to all of the other language implementations as well.

we also allow creating TableMetadata object with provided Schema

Can you point me to where this is done?

I can only find the regular new_table_metadata in PyIceberg (which you linked), which looks equivalent to TableMetadataBuilder::new here, which is doing the same reassignment when passed an iceberg Schema to the function.

@kevinjqliu
Copy link
Contributor

kevinjqliu commented Feb 6, 2025

For python, you can just pass in the Schema to the TableMetadata constructor
https://github.com/apache/iceberg-python/blob/26eff992715fdf22c020c0f1da8a38afad55f7be/tests/catalog/test_hive.py#L311-L358

we had some discussions around this topic too
apache/iceberg-python#278
apache/iceberg-python#1284

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants