-
Notifications
You must be signed in to change notification settings - Fork 433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix checkpoint compatibility for remove fields #427
Fix checkpoint compatibility for remove fields #427
Conversation
…o writer-map-support
…o writer-map-support
…o writer-map-support
So am I correct that DBR 8.x is still trying to read these extension fields even though |
@houqp So from the testing, it seems like at some of DBR 8.x, if |
// create remove fields with or without extendedFileMetadata | ||
let mut remove_fields = REMOVE_FIELDS.clone(); | ||
if use_extended_remove_schema { | ||
remove_fields.extend(REMOVE_EXTENDED_FILE_METADATA_FIELDS.clone()); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xianwill btw, wouldn't the easier hotfix will be a just write a extended_file_metadata=false without any other columns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
however for the full compitability with delta 1.0 I agree that we must include them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mosyp - Just setting extended_file_metadata
to false doesn't fix the issue. Same error still happens. If its false, the schema for the other three must be omitted entirely.
let use_extended_remove_schema = tombstones | ||
.iter() | ||
.all(|r| r.extended_file_metadata == Some(true)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is to avoid extending parquet schema with null metadata? E.g. so it'll make DBR 8.x to fail I suppose
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly - this is to avoid writing out the additional fields in the schema and prevent the break in DBR 8.x.
@houqp - here is comment from reference code base with emphasized portion:
So I am reading this to mean - if |
@mosyp - i tested w/ 9.0 and that failed as well when fields are present. |
* Bump arrow deps and bring map support to schema * Fix datafustion deps * Fix checkpoint and timestamp bugs (#351) * post merge fixes * Add tests for new checkpoint API * Post merge from main * Reverse integrate main to writer-map-support * post merge fixes * cargo fmt * Fix checkpoint compatibility for remove fields (#427) * Add datafusion PR link Co-authored-by: Christian Williams <christianw@scribd.com> Co-authored-by: xianwill <christianwilliams79@gmail.com>
Description
NOTE: This PR targets the
writer-map-support
branch. The affecting fields of the remove action are map types, so they are not present or testable on main.We discovered a production bug in which delta-rs checkpoints that include remove actions added by DBR 7.x (OPTIMIZE), are not readable by DBR cluster versions >= 8.x. The log corruption is due to null extended file metadata fields in the checkpoint schema for remove actions. DBR 7.x does not write extended file metadata for removes. Since they are always included in the checkpoint schema written by delta-rs, we write them as nulls, and DBR 8.x cannot read them back.
This PR addresses the issue by conditionally selecting the remove schema to use. If all remove actions in the checkpoint include extended file metadata, we include extended file metadata in the remove schema. Otherwise, we do not.
Four fields are related to remove action extended file metadata.
extendedFileMetadata
specifies whetherpartitionValues
,size
andtags
are included. Following comments in the reference implementation, new writers should always write this field. The other three fields should be omitted from the schema ifextendedFileMetadata
is false.Documentation