-
Notifications
You must be signed in to change notification settings - Fork 433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix broken tombstones metadata when extended_file_metadata is different between tomstones in state #450
Conversation
…nt between tomstones in state
// on the `extended_file_metadata` flag, hence this is safer to set `extended_file_metadata=false` | ||
// and omit metadata columns if at least one remove action has `extended_file_metadata=false`. | ||
// We've added the additional check on `size.is_some` because in delta-spark the primitive long type | ||
// is used, hence we want to omit possible errors when `extended_file_metadata=true`, but `size=null` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
under what circumstance would we have remove file size set to null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in checkpoints created by delta-rs prior to this fix where in state we have both checkpoints with extended metadata true and false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this mapping https://github.com/delta-io/delta-rs/pull/450/files#diff-826e312bd2f2ca61f53dbe4a0fafeea492621f5b89c0f4de5a27442b23242b2aR186-R190 was not applied for these tombstones, we've ended up with metadata=true, but size is null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is a in correct behavior is specific to delta-rs? The loop that forces all extended_file_metadata
to Some(false)
seems like a workaround that doesn't address the root cause of the problem. Do we know exactly which part of the delta-rs code base could create an invalid remove action in memory that has size=null
and extended_file_metadata=true
? I am concerned that this invalid remove action bug will cause other problems in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filed #466 for tracking purpose
This is a follow-up fix for the #427.
Although we don't write extended metadata fields if
extended_file_metadata
isfalse
. The actual value is not mapped to false, hence the reader will see that the flags istrue
but the actual columns are missing from the schema.Also in addition to that fix, we still want to prevent the broken tombstone to cause the error, hence we add the check on
size
column ifextended_file_metadata=true
, since this is the only required record and spark fails to read it if it's null and extended metadata is true.