-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Robust schema usage and checks for batch CDF queries #1509
Conversation
a879564
to
f43699e
Compare
8f08eb3
to
f299dc2
Compare
* | ||
* However, if there are schema changes between analysis and execution, since we froze this | ||
* schema, our schema incompatibility checks will kick in during the scan so we will always | ||
* be safe - Although it is a notable caveat that user should be aware of because the CDC query |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are users aware of this? should we update any documentation to make this clearer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing is documented yet for this behavior, but we probably should cc @jose-torres
* may break. | ||
*/ | ||
private lazy val endingVersionForBatchSchema: Long = endingVersion.map { v => | ||
// As defined in docs, if ending version is greater than the latest version, we will just use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which docs? method docs? public website docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
method doc, i could make it clearer.
sql( | ||
s""" | ||
|ALTER TABLE delta.`${dir.getCanonicalPath}` ADD COLUMN (name string) | ||
|""".stripMargin) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be 1 line instead of 4?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Left minor comments.
Description
Properly handle read-incompatible schema changes when querying batch CDF (e.g.
table_changes()
TVF, and SQL/DF APIs).Right now, batch CDF is almost always serving past data using the latest schema, but the latest schema may not be read-compatible with the data files.
this PR introduces the following:
changeDataFeed.defaultSchemaModeForColumnMappingTable
that can be set toendVersion
,latest
orlegacy
. Iflegacy
it would fallback to the current behavior in which either latest schema is used or a time-travel version schema is used, ifendVersion
is set, we will use the end version's schema to serve the batch and iflatest
is set, the latest schema will always be used. Note, this is orthogonal to 1), checks will be triggered all the time to ensure read-compatibility, even whenendVersion
is used.versionOf
is specified. Apparently ppl can time-travel the schema during querying batch CDF, this is probably unintentional, but since it exists, i explicitly blocked two options from being used concurrently.How was this patch tested?
New Unit tests.
Does this PR introduce any user-facing changes?
No