-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create Iceberg Table from pyarrow Schema with no IDs #278
Comments
To add some more context. As also mentioned in the earlier conversation, I don't think assigning fresh IDs is safe: #219 (comment) On the Java side, this is also being deprecated: apache/iceberg#9324 |
Thank you for adding the context @Fokko :) Just so that we make sure new readers aren’t confused, do you think it’s fair to say that we are talking about ‘assigning fresh IDs’ in two separate ways?
The discussions so far, and the PR above, have alluded to the fact that (1) above is dangerous and will not be supported in Python and Java. (2) however is necessary for applications to use PyIceberg in a similar way that Spark is being used. Spark Dataframes currently can be used to create new iceberg tables without field_id metadata. In that sense, it will be helpful to design the function in a way that prevents users from introducing bad behaviors through (1) and limits the scope of these Visitor and function to its intended usage of (2) |
What do you think of the following approach:
This would require:
|
That sounds good @Fokko I think having a _CreateMappingFromPyArrowSchma preorder visitor does a good job of separating out the two concerns above.
I think this bit about not doing it by position is catching me a bit off guard because I’m not convinced that we can assign ids without relying on the position when generating the name mapping. Just to make sure we are on the same page, this new Visitor will:
And then, we will use the name mapping generated from the pyarrow schema to assign field ids by name and create a new Iceberg Schema. Does that approach sound consistent with your current thought? |
Thanks for summarizing the approaches and explanation on the concerns.
Same question as @syun64 mentioned. My understanding is the I have another related question. If we
what do we do with the name-mapping created in step 1 after the table is created? Do we just discard it or put it in |
Great question @HonahX my understanding is that the act of putting in a name mapping into schema.name-mapping.default isn't done automatically by any operation, and requires the user to actually insert the name mapping json as a table property into the iceberg table. I think regardless of whether we create this visitor to create a name mapping (which in turn will be used to create an iceberg schema), or an iceberg schema directly, it will need have to have the ability to incrementally assign a new id by position. Because we are trying to create a new iceberg schema based on an arrow schema that does not have the field_id metadata. Imagine we are trying to grab a 100 column parquet file from a vendor and create an Iceberg table based on it, and it doens't have PARQUET:FIELD_ID metadata on its columns. Currently, there's no way to create this iceberg table and ingest this data without manually coding and labelling each and every column using the Iceberg schema types to create an Iceberg schema. |
Hello, I would like to put up a PR as per the discussion above if no one has started working already. Please let me know if this is fine. Also, @syun64 and I work together hence I can get up to speed quickly with the discussion. |
Alright, I went to the source and talked with @danielcweeks and @rdblue. It looks like we made things more complicated than actually needed. So when reading and writing Parquet, we need to make sure that the IDs are aligned properly. When we are working with runtime data ( I also discussed with Dan about adding Arrow types to the catalog = load_catalog()
catalog.create_table('some.table', df=df) We can just convert the schema, and assign fresh IDs. And then: # It will wire up the schema by name
tbl.overwrite(df) # Should be quite easy with union by name:
tbl.append(df, merge_schema=True) We don't want to keep the IDs around when we have the Arrow table. Tink of the situation where you read from a table, and then write to another table. You don't want to re-use the IDs. Sorry for making this bigger than it actually was 🙏 |
That makes sense @Fokko . Just to make sure we are on the same page, does the following approach align with your thoughts? We are proposing to update the create_table API to:
We will call the function like:
And use the previously proposed Visitor: https://github.com/syun64/iceberg-python/blob/preorder-fresh-schema/pyiceberg/io/pyarrow.py#L994 since new_table_metadata has to take |
@syun64 Yes that sounds like a reasonable proposal to me. On thing to mention. We would also like to use PyIceberg without Arrow, and we can do this by making the type annotation lazy:
|
Oh! That's very neat. Thank you for the suggestion :) |
Feature Request / Improvement
I see three ways a user would want to create an Iceberg table:
create_table function currently takes a pyiceberg.Schema as the input. The existing visitors support patterns (1) and (3), but not (2).
This is because the creation of a pyiceberg.Schema is only supported in the following two ways:
Therefore, we need to update an existing Visitor, or create a new Visitor in order to support the generation of a pyiceberg.Schema from a pyarrow Schema with no IDs.
On #219 the following approaches have been discussed so far:
When we are entertaining different ideas to reduce code duplication in the new visitor, we need to keep in mind that the task of assigning fresh IDs works best in a pre-order traversal order. This is how _SetFreshIDs works now. All the existing schema visitors discussed above that construct the NameMapping or pyiceberg Schema are done in post-order traversal order.
The text was updated successfully, but these errors were encountered: