-
Notifications
You must be signed in to change notification settings - Fork 820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parquet: support setting the field_id with an ArrowWriter #4702
Comments
Couple of notes from digging into this: From https://iceberg.apache.org/spec/#column-projection:
So it would appear that field mappings are not strictly required to be present, this may be a way to avoid needing to rewrite data lacking such attributes Additionally also from https://iceberg.apache.org/spec/#column-projection:
This would appear to suggest that iceberg only requires that field IDs are present for the bottom of the three-level list declaration
I think the approach suggested in this PR is perfectly acceptable, as whilst it provides no mechanism to provide a field id for |
|
|
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
We would like to use the parquet files written from a set of arrow record batches as part of an apache-iceberg snapshot without modification. The apache-iceberg parquet specification requires that field-ids are present.
Describe the solution you'd like
The solution implemented by (at least) the go parquet package seems reasonable. This uses a metadata value with the key
PARQUET:field_id
to determine the field_id when converting an arrow schema into a parquet schema. If there is no such metadata entry then the field_id will not be present.Describe alternatives you've considered
An alternative would be to add a mechanism to
WriterProperties
to specify thefield_id
to use with a column. This presumably would work in a similar manner to encoding.Additional context
N/A
The text was updated successfully, but these errors were encountered: