Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spec disagrees with implementation on field names #8684

Closed
JFinis opened this issue Sep 29, 2023 · 3 comments · Fixed by #5338
Closed

Spec disagrees with implementation on field names #8684

JFinis opened this issue Sep 29, 2023 · 3 comments · Fixed by #5338

Comments

@JFinis
Copy link
Contributor

JFinis commented Sep 29, 2023

Apache Iceberg version

1.3.1 (latest release)

Query engine

Spark

Please describe the bug 🐞

The spec calls the following fields as follows:

504 added_files_count
505 existing_files_count
506 deleted_files_count

The implementation calls them as follows:

optional(504, "added_data_files_count"
  optional(505, "existing_data_files_count"
  optional(506, "deleted_data_files_count"

(The _data is in the implementation, but not in the spec)

And hence the produced iceberg use these field names, which deviate from the names in the spec. Either the spec or the implementation should be updated.

@JFinis JFinis changed the title Spec disagrees with implementation on file names Spec disagrees with implementation on field names Sep 29, 2023
@Fokko
Copy link
Contributor

Fokko commented Sep 29, 2023

Thanks for raising this Jan. I made a first attempt to fix this a while ago: #5338

For context, this has been changed with v2, because then there were both data and delete files. Since Iceberg does the lookup by field-id, this should not impose an issue, but it is good to get it fixed.

@JFinis
Copy link
Contributor Author

JFinis commented Oct 2, 2023

@Fokko What is the intended future of your PR? If I read the discussion in there correctly, then the spec is correct, as the same field is also used to track delete files, so data_ is fully incorrect here and thus instead the field names should be adapted to the spec.

Do I get the gist of the discussion correctly? If so, do I understand correctly that we all agree that this should be fixed in the implementation, it was just not merged, yet?

(I'm trying to gauge in which direction this will go, so I can adapt my implementation to either also write data_ or not.)

@Fokko
Copy link
Contributor

Fokko commented Oct 2, 2023

@JFinis You're in the right direction. data_ should be left out.

Do I get the gist of the discussion correctly? If so, do I understand correctly that we all agree that this should be fixed in the implementation, it was just not merged, yet?

It is on my list for today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants