Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix checkpoint compatibility for remove fields #427

Merged

Conversation

xianwill
Copy link
Collaborator

Description

NOTE: This PR targets the writer-map-support branch. The affecting fields of the remove action are map types, so they are not present or testable on main.

We discovered a production bug in which delta-rs checkpoints that include remove actions added by DBR 7.x (OPTIMIZE), are not readable by DBR cluster versions >= 8.x. The log corruption is due to null extended file metadata fields in the checkpoint schema for remove actions. DBR 7.x does not write extended file metadata for removes. Since they are always included in the checkpoint schema written by delta-rs, we write them as nulls, and DBR 8.x cannot read them back.

This PR addresses the issue by conditionally selecting the remove schema to use. If all remove actions in the checkpoint include extended file metadata, we include extended file metadata in the remove schema. Otherwise, we do not.

Four fields are related to remove action extended file metadata. extendedFileMetadata specifies whether partitionValues, size and tags are included. Following comments in the reference implementation, new writers should always write this field. The other three fields should be omitted from the schema if extendedFileMetadata is false.

Documentation

@xianwill xianwill requested review from rtyler, houqp and mosyp September 12, 2021 17:25
@houqp
Copy link
Member

houqp commented Sep 13, 2021

So am I correct that DBR 8.x is still trying to read these extension fields even though extendedFileMetadata is set to false for a particular remove action? This seems to contradict with the comment in the reference code base.

@mosyp
Copy link
Contributor

mosyp commented Sep 13, 2021

@houqp So from the testing, it seems like at some of DBR 8.x, if extendedFileMetadata is false then there's should not be any of size, partitionValues and tags columns. Otherwise, even if they're null, spark fails. Although, I see that BDR 9.0 succeeds with it. But I feel like for the compability purposes we should omit them writing if extendedFileMetadata is false.

Comment on lines +375 to +380
// create remove fields with or without extendedFileMetadata
let mut remove_fields = REMOVE_FIELDS.clone();
if use_extended_remove_schema {
remove_fields.extend(REMOVE_EXTENDED_FILE_METADATA_FIELDS.clone());
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xianwill btw, wouldn't the easier hotfix will be a just write a extended_file_metadata=false without any other columns

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

however for the full compitability with delta 1.0 I agree that we must include them

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mosyp - Just setting extended_file_metadata to false doesn't fix the issue. Same error still happens. If its false, the schema for the other three must be omitted entirely.

Comment on lines +175 to +177
let use_extended_remove_schema = tombstones
.iter()
.all(|r| r.extended_file_metadata == Some(true));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is to avoid extending parquet schema with null metadata? E.g. so it'll make DBR 8.x to fail I suppose

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly - this is to avoid writing out the additional fields in the schema and prevent the break in DBR 8.x.

@xianwill
Copy link
Collaborator Author

xianwill commented Sep 13, 2021

So am I correct that DBR 8.x is still trying to read these extension fields even though extendedFileMetadata is set to false for a particular remove action? This seems to contradict with the comment in the reference code base.

@houqp - here is comment from reference code base with emphasized portion:

Note that for protocol compatibility reasons, the fields partitionValues, size, and tags
are only present when the extendedFileMetadata flag is true
. New writers should generally be
setting this flag, but old writers (and FSCK) won't, so readers must check this flag before
attempting to consume those values.

So I am reading this to mean - if extendedFileMetadata is false, you must completely omit the fields from the schema. Setting values to null is not good enough. The tests bear that out as well.

@xianwill
Copy link
Collaborator Author

@houqp So from the testing, it seems like at some of DBR 8.x, if extendedFileMetadata is false then there's should not be any of size, partitionValues and tags columns. Otherwise, even if they're null, spark fails. Although, I see that BDR 9.0 succeeds with it. But I feel like for the compability purposes we should omit them writing if extendedFileMetadata is false.

@mosyp - i tested w/ 9.0 and that failed as well when fields are present.

@xianwill xianwill merged commit eb0ecf7 into delta-io:writer-map-support Sep 13, 2021
@xianwill xianwill deleted the bugfix/writer-map-support branch September 14, 2021 12:57
mosyp added a commit that referenced this pull request Sep 14, 2021
* Bump arrow deps and bring map support to schema

* Fix datafustion deps

* Fix checkpoint and timestamp bugs (#351)

* post merge fixes

* Add tests for new checkpoint API

* Post merge from main

* Reverse integrate main to writer-map-support

* post merge fixes

* cargo fmt

* Fix checkpoint compatibility for remove fields (#427)

* Add datafusion PR link

Co-authored-by: Christian Williams <christianw@scribd.com>
Co-authored-by: xianwill <christianwilliams79@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants