-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Protocol update for column defaults #2240
Conversation
cc @tdas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @felipepessoto for your review, responded to your comments, please take another look!
PROTOCOL.md
Outdated
|
||
When enabled: | ||
- The `metadata` for the column in the table schema MAY contain the key `CURRENT_DEFAULT`. | ||
- The value of `CURRENT_DEFAULT` SHOULD be parsed as a SQL expression. Any engine that assigns this value can use its own SQL dialect of choice to represent the expression as a string, and use that same dialect to evaluate that expression later for future writes. If one engine writes the string metadata using its own SQL dialect and another engine then consumes it later when performing writes, the results are undefined. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any engine that assigns this value can use its own SQL dialect of choice to represent the expression as a string, and use that same dialect to evaluate that expression later for future writes. If one engine writes the string metadata using its own SQL dialect and another engine then consumes it later when performing writes, the results are undefined.
For other features, like invariant and check constraint, we only say SQL expression, I assume it would be standard SQL. Accepting any SQL dialect seems dangerous to me as it would make the table dependent on a specific engine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I updated this to just say "SQL expression" like we do for the "Generated Columns" section above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dtenedor, at #2238 you mentioned this https://issues.apache.org/jira/browse/SPARK-38334. Does it mean we are following the same behavior?
It says when you add a new column, the existing rows are updated with the default value, I think we need to describe this behavior in the protocol:
CREATE TABLE T(a INT, b INT NOT NULL);
-- The default default is NULL
INSERT INTO T VALUES (DEFAULT, 0);
INSERT INTO T(b) VALUES (1);
SELECT * FROM T;
(NULL, 0)
(NULL, 1)
-- Adding a default to a table with rows, sets the values for the
-- existing rows (exist default) and new rows (current default).
ALTER TABLE T ADD COLUMN c INT DEFAULT 5;
INSERT INTO T VALUES (1, 2, DEFAULT);
SELECT * FROM T;
(NULL, 0, 5)
(NULL, 1, 5)
(1, 2, 5)
@felipepessoto good point, we are not including this part for the Delta Lake data source type in Spark because it adds a lot of complexity on the reader path. We will follow the rest of the Spark specification for the writer path (e.g. for future DML commands like INSERT and UPDATE) since the implementation is much simpler. |
Got it. I think we need to explicitly mention that, as most of users will assume the same behavior as Spark. |
@dtenedor. Can we do this in a follow up PR? I think this is important and can cause confusion to library authors. |
@felipepessoto yes, sorry I missed your latest comment. I will prepare a follow-up PR to fix this. |
@felipepessoto here's the requested update with respect to read operations: #2266 |
…support column default values. This PR updates the description to explicitly mention that this feature only applies to write operations, not reads. Closes #2266 GitOrigin-RevId: a1749871b8d2ae33496136a8bf26bd7ac2bd4437
Description
Update Delta log protocol to support column default values.
This will support column default values for Delta Lake tables.
Users should be able to associate default values with Delta Lake columns at table creation time or thereafter.
Support for column defaults is a key requirement to facilitate updating the table schema over time and performing DML operations on wide tables with sparse data.
Please refer to an open design doc here.
How was this patch tested?
N/A
Does this PR introduce any user-facing changes?
No, this is just a protocol change.