Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix broken tombstones metadata when extended_file_metadata is different between tomstones in state #450

Merged
merged 2 commits into from
Oct 6, 2021

Conversation

mosyp
Copy link
Contributor

@mosyp mosyp commented Oct 6, 2021

This is a follow-up fix for the #427.
Although we don't write extended metadata fields if extended_file_metadata is false. The actual value is not mapped to false, hence the reader will see that the flags is true but the actual columns are missing from the schema.

Also in addition to that fix, we still want to prevent the broken tombstone to cause the error, hence we add the check on size column if extended_file_metadata=true, since this is the only required record and spark fails to read it if it's null and extended metadata is true.

@mosyp mosyp requested review from rtyler, xianwill and houqp October 6, 2021 16:14
xianwill
xianwill previously approved these changes Oct 6, 2021
@mosyp mosyp enabled auto-merge (squash) October 6, 2021 16:18
@mosyp mosyp requested a review from xianwill October 6, 2021 16:19
@mosyp mosyp merged commit 773864d into delta-io:main Oct 6, 2021
@mosyp mosyp deleted the fix-broken-tombstones-metadata branch October 6, 2021 16:38
// on the `extended_file_metadata` flag, hence this is safer to set `extended_file_metadata=false`
// and omit metadata columns if at least one remove action has `extended_file_metadata=false`.
// We've added the additional check on `size.is_some` because in delta-spark the primitive long type
// is used, hence we want to omit possible errors when `extended_file_metadata=true`, but `size=null`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

under what circumstance would we have remove file size set to null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in checkpoints created by delta-rs prior to this fix where in state we have both checkpoints with extended metadata true and false.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this mapping https://github.com/delta-io/delta-rs/pull/450/files#diff-826e312bd2f2ca61f53dbe4a0fafeea492621f5b89c0f4de5a27442b23242b2aR186-R190 was not applied for these tombstones, we've ended up with metadata=true, but size is null

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is a in correct behavior is specific to delta-rs? The loop that forces all extended_file_metadata to Some(false) seems like a workaround that doesn't address the root cause of the problem. Do we know exactly which part of the delta-rs code base could create an invalid remove action in memory that has size=null and extended_file_metadata=true? I am concerned that this invalid remove action bug will cause other problems in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filed #466 for tracking purpose

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants