Protocol update for column defaults #2240

dtenedor · 2023-10-25T23:51:45Z

Description

Update Delta log protocol to support column default values.

This will support column default values for Delta Lake tables.
Users should be able to associate default values with Delta Lake columns at table creation time or thereafter.

Support for column defaults is a key requirement to facilitate updating the table schema over time and performing DML operations on wide tables with sparse data.

Please refer to an open design doc here.

How was this patch tested?

N/A

Does this PR introduce any user-facing changes?

No, this is just a protocol change.

dtenedor · 2023-10-25T23:55:19Z

cc @tdas

PROTOCOL.md

dtenedor

Thanks @felipepessoto for your review, responded to your comments, please take another look!

PROTOCOL.md

felipepessoto · 2023-10-26T22:05:50Z

PROTOCOL.md

+
+When enabled:
+ - The `metadata` for the column in the table schema MAY contain the key `CURRENT_DEFAULT`.
+ - The value of `CURRENT_DEFAULT` SHOULD be parsed as a SQL expression. Any engine that assigns this value can use its own SQL dialect of choice to represent the expression as a string, and use that same dialect to evaluate that expression later for future writes. If one engine writes the string metadata using its own SQL dialect and another engine then consumes it later when performing writes, the results are undefined.


Any engine that assigns this value can use its own SQL dialect of choice to represent the expression as a string, and use that same dialect to evaluate that expression later for future writes. If one engine writes the string metadata using its own SQL dialect and another engine then consumes it later when performing writes, the results are undefined.

For other features, like invariant and check constraint, we only say SQL expression, I assume it would be standard SQL. Accepting any SQL dialect seems dangerous to me as it would make the table dependent on a specific engine.

OK, I updated this to just say "SQL expression" like we do for the "Generated Columns" section above.

felipepessoto

@dtenedor, at #2238 you mentioned this https://issues.apache.org/jira/browse/SPARK-38334. Does it mean we are following the same behavior?

It says when you add a new column, the existing rows are updated with the default value, I think we need to describe this behavior in the protocol:

CREATE TABLE T(a INT, b INT NOT NULL);

-- The default default is NULL
INSERT INTO T VALUES (DEFAULT, 0);
INSERT INTO T(b)  VALUES (1);
SELECT * FROM T;
(NULL, 0)
(NULL, 1)

-- Adding a default to a table with rows, sets the values for the
-- existing rows (exist default) and new rows (current default).
ALTER TABLE T ADD COLUMN c INT DEFAULT 5;
INSERT INTO T VALUES (1, 2, DEFAULT);
SELECT * FROM T;
(NULL, 0, 5)
(NULL, 1, 5)
(1, 2, 5)

dtenedor · 2023-10-31T22:14:14Z

It says when you add a new column, the existing rows are updated with the default value, I think we need to describe this behavior in the protocol.

@felipepessoto good point, we are not including this part for the Delta Lake data source type in Spark because it adds a lot of complexity on the reader path. We will follow the rest of the Spark specification for the writer path (e.g. for future DML commands like INSERT and UPDATE) since the implementation is much simpler.

felipepessoto · 2023-10-31T22:23:12Z

Got it. I think we need to explicitly mention that, as most of users will assume the same behavior as Spark.

felipepessoto · 2023-11-02T20:11:36Z

Got it. I think we need to explicitly mention that, as most of users will assume the same behavior as Spark.

@dtenedor. Can we do this in a follow up PR? I think this is important and can cause confusion to library authors.

dtenedor · 2023-11-03T22:36:51Z

@dtenedor. Can we do this in a follow up PR? I think this is important and can cause confusion to library authors.

@felipepessoto yes, sorry I missed your latest comment. I will prepare a follow-up PR to fix this.

dtenedor · 2023-11-06T18:31:13Z

@felipepessoto here's the requested update with respect to read operations: #2266

…support column default values. This PR updates the description to explicitly mention that this feature only applies to write operations, not reads. Closes #2266 GitOrigin-RevId: a1749871b8d2ae33496136a8bf26bd7ac2bd4437

commit

8bddd45

felipepessoto reviewed Oct 26, 2023

View reviewed changes

PROTOCOL.md Outdated Show resolved Hide resolved

PROTOCOL.md Outdated Show resolved Hide resolved

PROTOCOL.md Outdated Show resolved Hide resolved

felipepessoto mentioned this pull request Oct 26, 2023

[Feature Request] Column default values for Delta Lake #2238

Closed

3 tasks

respond to code review comments

e2d9ca4

dtenedor commented Oct 26, 2023

View reviewed changes

PROTOCOL.md Outdated Show resolved Hide resolved

PROTOCOL.md Outdated Show resolved Hide resolved

PROTOCOL.md Outdated Show resolved Hide resolved

vkorukanti reviewed Oct 26, 2023

View reviewed changes

PROTOCOL.md Outdated Show resolved Hide resolved

dtenedor requested review from felipepessoto and vkorukanti October 26, 2023 21:21

dtenedor added 5 commits October 26, 2023 14:24

respond to code review comments

b202cc1

respond to code review comments

4226b05

respond to code review comments

9641106

respond to code review comments

bd5de08

respond to code review comments

86bad40

felipepessoto mentioned this pull request Oct 26, 2023

Correct Delta protocol terminology "enabled" vs "supported" #1780

Closed

felipepessoto reviewed Oct 26, 2023

View reviewed changes

respond to code review comments

e5d23e5

dtenedor requested a review from felipepessoto October 26, 2023 23:13

felipepessoto approved these changes Oct 26, 2023

View reviewed changes

vkorukanti approved these changes Oct 31, 2023

View reviewed changes

felipepessoto suggested changes Oct 31, 2023

View reviewed changes

allisonport-db closed this in 1536ebe Nov 2, 2023

dtenedor mentioned this pull request Nov 6, 2023

Update Delta log protocol for column default values to limit applicability to writes only #2266

Closed

felipepessoto mentioned this pull request Mar 23, 2024

[BUG] Default columns is also a reader feature, the PROTOCOL doc is wrong. #2790

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Protocol update for column defaults #2240

Protocol update for column defaults #2240

dtenedor commented Oct 25, 2023 •

edited

Loading

dtenedor commented Oct 25, 2023

dtenedor left a comment

felipepessoto Oct 26, 2023 •

edited

Loading

dtenedor Oct 26, 2023

felipepessoto left a comment

dtenedor commented Oct 31, 2023 •

edited

Loading

felipepessoto commented Oct 31, 2023

felipepessoto commented Nov 2, 2023

dtenedor commented Nov 3, 2023

dtenedor commented Nov 6, 2023

Protocol update for column defaults #2240

Protocol update for column defaults #2240

Conversation

dtenedor commented Oct 25, 2023 • edited Loading

Description

How was this patch tested?

Does this PR introduce any user-facing changes?

dtenedor commented Oct 25, 2023

dtenedor left a comment

Choose a reason for hiding this comment

felipepessoto Oct 26, 2023 • edited Loading

Choose a reason for hiding this comment

dtenedor Oct 26, 2023

Choose a reason for hiding this comment

felipepessoto left a comment

Choose a reason for hiding this comment

dtenedor commented Oct 31, 2023 • edited Loading

felipepessoto commented Oct 31, 2023

felipepessoto commented Nov 2, 2023

dtenedor commented Nov 3, 2023

dtenedor commented Nov 6, 2023

dtenedor commented Oct 25, 2023 •

edited

Loading

felipepessoto Oct 26, 2023 •

edited

Loading

dtenedor commented Oct 31, 2023 •

edited

Loading