-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PROTOCOL RFC] Variant data type #2867
Conversation
protocol_rfcs/variant-type.md
Outdated
Struct fields which start with `_` (underscore) can be safely ignored. | ||
The only non-ignorable fields must be `value` and `metadata`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to say (somewhat implicitly) that any field that is not value
or metadata
should start with an underscore? I don't see why that is necessary, if we define that only value
and metadata
should be non-ignorable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something we would like to support in the future is to introduce another non-ignorable struct field. That would require a new table feature, but it would depend on this table feature. In that case, how can we specify in the protocol that the new table feature would use an additional, non-ignorable field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Delta spec already states that clients should ignore unrecognized fields, does that not suffice?
If the idea is to carve out a "safe" namespace for custom fields, then _
prefix seems reasonable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I removed this unnecessary line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bart-samwel Thanks! I updated the PR.
I also left a question about how we could specify the non-ignorable fields spec regarding potential future enhancements?
protocol_rfcs/variant-type.md
Outdated
Struct fields which start with `_` (underscore) can be safely ignored. | ||
The only non-ignorable fields must be `value` and `metadata`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something we would like to support in the future is to introduce another non-ignorable struct field. That would require a new table feature, but it would depend on this table feature. In that case, how can we specify in the protocol that the new table feature would use an additional, non-ignorable field?
protocol_rfcs/variant-type.md
Outdated
|
||
## Variant data in Parquet | ||
|
||
The Variant data type is represented as two binary encoded values, according to the [Variant binary encoding specification](https://github.com/apache/spark/blob/master/common/variant/README.md). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Variant data type is represented as two binary encoded values, according to the [Variant binary encoding specification](https://github.com/apache/spark/blob/master/common/variant/README.md). | |
The Variant data type is represented as two binary encoded values, according to the [Spark Variant binary encoding specification](https://github.com/apache/spark/blob/master/common/variant/README.md). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
protocol_rfcs/variant-type.md
Outdated
Struct fields which start with `_` (underscore) can be safely ignored. | ||
The only non-ignorable fields must be `value` and `metadata`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Delta spec already states that clients should ignore unrecognized fields, does that not suffice?
If the idea is to carve out a "safe" namespace for custom fields, then _
prefix seems reasonable?
protocol_rfcs/variant-type.md
Outdated
## Reader Requirements for Variant Data Type | ||
|
||
When Variant type is supported (`readerFeatures` field of a table's `protocol` action contains `variantType`), readers: | ||
- must be able to read the two parquet struct fields, `value` and `metadata` and interpret them as a Variant in concordance with the [Variant binary encoding specification](https://github.com/apache/spark/blob/master/common/variant/README.md). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As worded here, this "must be able to read... and interpret... in accordance with the... specification" would seem to contradict the next line that allows to not fully support the logical datatype?
I guess the main requirements are really:
- Reader must recognize and tolerate a
variant
data type in a Delta schema - Reader must use the correct physical schema (struct-of-binary) when reading a Variant column from file
- Reader must make the column available to queries:
- [Recommended] Expose the logical Variant column, if the engine supports Variant.
- [Alternate] Expose the raw physical column, e.g. if the engine does not support Variant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scovich Thanks! Updated the PR.
protocol_rfcs/variant-type.md
Outdated
|
||
## Variant data in Parquet | ||
|
||
The Variant data type is represented as two binary encoded values, according to the [Variant binary encoding specification](https://github.com/apache/spark/blob/master/common/variant/README.md). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
protocol_rfcs/variant-type.md
Outdated
## Reader Requirements for Variant Data Type | ||
|
||
When Variant type is supported (`readerFeatures` field of a table's `protocol` action contains `variantType`), readers: | ||
- must be able to read the two parquet struct fields, `value` and `metadata` and interpret them as a Variant in concordance with the [Variant binary encoding specification](https://github.com/apache/spark/blob/master/common/variant/README.md). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Updated.
protocol_rfcs/variant-type.md
Outdated
Struct fields which start with `_` (underscore) can be safely ignored. | ||
The only non-ignorable fields must be `value` and `metadata`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I removed this unnecessary line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. One question.
- must use the correct physical schema (struct-of-binary, with fields `value` and `metadata`) when reading a Variant data type from file | ||
- must make the column available to the engine: | ||
- [Recommended] Expose and interpret the struct-of-binary as a single Variant field in accordance with the [Spark Variant binary encoding specification](https://github.com/apache/spark/blob/master/common/variant/README.md). | ||
- [Alternate] Expose the raw physical struct-of-binary, e.g. if the engine does not support Variant. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it also an acceptable alternate, to expose a string column by internally converting the struct-of-binary? Gives up all the benefits of the variant encoding, but maximizes compat with engines that don't allow users to load the library code that would interpret the struct-of-binary directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that should work. I added that as another alternate.
protocol_rfcs/README.md
Outdated
@@ -22,6 +22,7 @@ Here is the history of all the RFCs propose/accepted/rejected since Feb 6, 2024, | |||
| 2023-02-09 | [type-widening.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/type-widening.md) | https://github.com/delta-io/delta/issues/2623 | Type Widening | | |||
| 2023-02-14 | [managed-commits.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/managed-commits.md) | https://github.com/delta-io/delta/issues/2598 | Managed Commits | | |||
| 2023-02-26 | [column-mapping-usage.tracking.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/column-mapping-usage-tracking.md)) | https://github.com/delta-io/delta/issues/2682 | Column Mapping Usage Tracking | | |||
| 2023-04-08 | [variant-type.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/variant-type.md) | https://github.com/delta-io/delta/issues/2864 | Variant Data Type | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the gap in this date, and when we are ready to merge is large. can you update this to todays; date
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the date.
…2923) <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md 2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP] Your PR title ...'. 3. Be sure to keep the PR description updated to reflect all changes. 4. Please write your PR title to summarize what this PR proposes. 5. If possible, provide a concise example to reproduce the issue for a faster review. 6. If applicable, include the corresponding issue number in the PR title and link it in the body. --> #### Which Delta project/connector is this regarding? <!-- Please add the component selected below to the beginning of the pull request title For example: [Spark] Title of my pull request --> - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description <!-- - Describe what this PR changes. - Describe why we need the change. If this PR resolves an issue be sure to include "Resolves #XXX" to correctly link and close the issue upon merge. --> Adds the variant table feature to minimally implement the variant type as described in the RFC in #2867. Also disables using variant columns as partition columns. ## How was this patch tested? Added some UTs. More UTs will be merged in followup PRs tested against both spark 3.5 and 4.0 snapshot with ``` build/sbt -DsparkVersion=latest spark/'testOnly org.apache.spark.sql.delta.DeltaVariantSuite' build/sbt -DsparkVersion=master spark/'testOnly org.apache.spark.sql.delta.DeltaVariantSuite' ``` ## Does this PR introduce _any_ user-facing changes? <!-- If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible. If possible, please also clarify if this is a user-facing change compared to the released Delta Lake versions or within the unreleased branches such as master. If no, write 'No'. --> no
Which Delta project/connector is this regarding?
Description
Adds the protocol changes for the Variant data type (see #2864) to the RFC folder.
How was this patch tested?
N/A
Does this PR introduce any user-facing changes?
N/A