Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iceberg: add missing field to manifest_entry avro #21481

Merged

Conversation

andrwng
Copy link
Contributor

@andrwng andrwng commented Jul 17, 2024

Adds the file_sequence_number to the manifest_entry schema, as indicated by the spec[1].

It's unclear why the schema didn't have them in the first place (the schema was taken from DuckDB), but it's clearly defined upstream[2] and is important for efficiently writing metadata[3].

There is another field, distinct_counts, in the spec that is not in the manifest_entry schema. Interestingly enough, it doesn't appear to be defined at all in upstream Java code[4] (it is defined in iceberg-go, but not iceberg-python or iceberg-rust), so I'm leaving it out.

[1] https://iceberg.apache.org/spec/#manifests
[2] https://github.com/apache/iceberg/blob/319f29ea860e42e7cc21cda8c05d882134e6431f/core/src/main/java/org/apache/iceberg/ManifestEntry.java#L48-L49
[3] https://iceberg.apache.org/spec/#manifest-entry-fields
[4] https://github.com/apache/iceberg/blob/319f29ea860e42e7cc21cda8c05d882134e6431f/api/src/main/java/org/apache/iceberg/DataFile.java#L37-L105

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.1.x
  • v23.3.x
  • v23.2.x

Release Notes

  • none

Adds the file_sequence_number to the manifest_entry schema, as indicated by the
spec[1].

It's unclear why the schema didn't have them in the first place (the
schema was taken from DuckDB), but it's clearly defined upstream[2] and
is important for efficiently writing metadata[3].

There is another field, distinct_counts, in the spec that is not in the
manifest_entry schema. Interestingly enough, it doesn't appear to be
defined at all in upstream Java code[4] (it is defined in iceberg-go,
but not iceberg-python or iceberg-rust), so I'm leaving it out.

[1] https://iceberg.apache.org/spec/#manifests
[2] https://github.com/apache/iceberg/blob/319f29ea860e42e7cc21cda8c05d882134e6431f/core/src/main/java/org/apache/iceberg/ManifestEntry.java#L48-L49
[3] https://iceberg.apache.org/spec/#manifest-entry-fields
[4] https://github.com/apache/iceberg/blob/319f29ea860e42e7cc21cda8c05d882134e6431f/api/src/main/java/org/apache/iceberg/DataFile.java#L37-L105
@vbotbuildovich
Copy link
Collaborator

@andrwng andrwng merged commit 637d31f into redpanda-data:dev Jul 17, 2024
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants