Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protocol update for column defaults #2240

Closed
wants to merge 8 commits into from

Conversation

dtenedor
Copy link
Contributor

@dtenedor dtenedor commented Oct 25, 2023

Description

Update Delta log protocol to support column default values.

This will support column default values for Delta Lake tables.
Users should be able to associate default values with Delta Lake columns at table creation time or thereafter.

Support for column defaults is a key requirement to facilitate updating the table schema over time and performing DML operations on wide tables with sparse data.

Please refer to an open design doc here.

How was this patch tested?

N/A

Does this PR introduce any user-facing changes?

No, this is just a protocol change.

@dtenedor
Copy link
Contributor Author

cc @tdas

PROTOCOL.md Outdated Show resolved Hide resolved
PROTOCOL.md Outdated Show resolved Hide resolved
PROTOCOL.md Outdated Show resolved Hide resolved
Copy link
Contributor Author

@dtenedor dtenedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @felipepessoto for your review, responded to your comments, please take another look!

PROTOCOL.md Outdated Show resolved Hide resolved
PROTOCOL.md Outdated Show resolved Hide resolved
PROTOCOL.md Outdated Show resolved Hide resolved
PROTOCOL.md Outdated Show resolved Hide resolved
PROTOCOL.md Outdated

When enabled:
- The `metadata` for the column in the table schema MAY contain the key `CURRENT_DEFAULT`.
- The value of `CURRENT_DEFAULT` SHOULD be parsed as a SQL expression. Any engine that assigns this value can use its own SQL dialect of choice to represent the expression as a string, and use that same dialect to evaluate that expression later for future writes. If one engine writes the string metadata using its own SQL dialect and another engine then consumes it later when performing writes, the results are undefined.
Copy link
Contributor

@felipepessoto felipepessoto Oct 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any engine that assigns this value can use its own SQL dialect of choice to represent the expression as a string, and use that same dialect to evaluate that expression later for future writes. If one engine writes the string metadata using its own SQL dialect and another engine then consumes it later when performing writes, the results are undefined.

For other features, like invariant and check constraint, we only say SQL expression, I assume it would be standard SQL. Accepting any SQL dialect seems dangerous to me as it would make the table dependent on a specific engine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I updated this to just say "SQL expression" like we do for the "Generated Columns" section above.

Copy link
Contributor

@felipepessoto felipepessoto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dtenedor, at #2238 you mentioned this https://issues.apache.org/jira/browse/SPARK-38334. Does it mean we are following the same behavior?

It says when you add a new column, the existing rows are updated with the default value, I think we need to describe this behavior in the protocol:

CREATE TABLE T(a INT, b INT NOT NULL);

-- The default default is NULL
INSERT INTO T VALUES (DEFAULT, 0);
INSERT INTO T(b)  VALUES (1);
SELECT * FROM T;
(NULL, 0)
(NULL, 1)

-- Adding a default to a table with rows, sets the values for the
-- existing rows (exist default) and new rows (current default).
ALTER TABLE T ADD COLUMN c INT DEFAULT 5;
INSERT INTO T VALUES (1, 2, DEFAULT);
SELECT * FROM T;
(NULL, 0, 5)
(NULL, 1, 5)
(1, 2, 5) 

@dtenedor
Copy link
Contributor Author

dtenedor commented Oct 31, 2023

It says when you add a new column, the existing rows are updated with the default value, I think we need to describe this behavior in the protocol.

@felipepessoto good point, we are not including this part for the Delta Lake data source type in Spark because it adds a lot of complexity on the reader path. We will follow the rest of the Spark specification for the writer path (e.g. for future DML commands like INSERT and UPDATE) since the implementation is much simpler.

@felipepessoto
Copy link
Contributor

Got it. I think we need to explicitly mention that, as most of users will assume the same behavior as Spark.

@felipepessoto
Copy link
Contributor

Got it. I think we need to explicitly mention that, as most of users will assume the same behavior as Spark.

@dtenedor. Can we do this in a follow up PR? I think this is important and can cause confusion to library authors.

@dtenedor
Copy link
Contributor Author

dtenedor commented Nov 3, 2023

@dtenedor. Can we do this in a follow up PR? I think this is important and can cause confusion to library authors.

@felipepessoto yes, sorry I missed your latest comment. I will prepare a follow-up PR to fix this.

@dtenedor
Copy link
Contributor Author

dtenedor commented Nov 6, 2023

@felipepessoto here's the requested update with respect to read operations: #2266

tdas pushed a commit that referenced this pull request Nov 14, 2023
…support column default values. This PR updates the description to explicitly mention that this feature only applies to write operations, not reads.

Closes #2266

GitOrigin-RevId: a1749871b8d2ae33496136a8bf26bd7ac2bd4437
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants