Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROTOCOL RFC] Variant data type #2867

Merged
merged 7 commits into from
Apr 25, 2024
Merged

[PROTOCOL RFC] Variant data type #2867

merged 7 commits into from
Apr 25, 2024

Conversation

gene-db
Copy link
Contributor

@gene-db gene-db commented Apr 8, 2024

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

Adds the protocol changes for the Variant data type (see #2864) to the RFC folder.

How was this patch tested?

N/A

Does this PR introduce any user-facing changes?

N/A

protocol_rfcs/variant-type.md Outdated Show resolved Hide resolved
protocol_rfcs/variant-type.md Outdated Show resolved Hide resolved
protocol_rfcs/variant-type.md Outdated Show resolved Hide resolved
protocol_rfcs/variant-type.md Outdated Show resolved Hide resolved
protocol_rfcs/variant-type.md Outdated Show resolved Hide resolved
Comment on lines 59 to 60
Struct fields which start with `_` (underscore) can be safely ignored.
The only non-ignorable fields must be `value` and `metadata`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to say (somewhat implicitly) that any field that is not value or metadata should start with an underscore? I don't see why that is necessary, if we define that only value and metadata should be non-ignorable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something we would like to support in the future is to introduce another non-ignorable struct field. That would require a new table feature, but it would depend on this table feature. In that case, how can we specify in the protocol that the new table feature would use an additional, non-ignorable field?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Delta spec already states that clients should ignore unrecognized fields, does that not suffice?
If the idea is to carve out a "safe" namespace for custom fields, then _ prefix seems reasonable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I removed this unnecessary line.

protocol_rfcs/variant-type.md Outdated Show resolved Hide resolved
protocol_rfcs/variant-type.md Outdated Show resolved Hide resolved
protocol_rfcs/variant-type.md Outdated Show resolved Hide resolved
Copy link
Contributor Author

@gene-db gene-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bart-samwel Thanks! I updated the PR.

I also left a question about how we could specify the non-ignorable fields spec regarding potential future enhancements?

protocol_rfcs/variant-type.md Outdated Show resolved Hide resolved
protocol_rfcs/variant-type.md Outdated Show resolved Hide resolved
protocol_rfcs/variant-type.md Outdated Show resolved Hide resolved
protocol_rfcs/variant-type.md Outdated Show resolved Hide resolved
protocol_rfcs/variant-type.md Outdated Show resolved Hide resolved
protocol_rfcs/variant-type.md Outdated Show resolved Hide resolved
protocol_rfcs/variant-type.md Outdated Show resolved Hide resolved
Comment on lines 59 to 60
Struct fields which start with `_` (underscore) can be safely ignored.
The only non-ignorable fields must be `value` and `metadata`.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something we would like to support in the future is to introduce another non-ignorable struct field. That would require a new table feature, but it would depend on this table feature. In that case, how can we specify in the protocol that the new table feature would use an additional, non-ignorable field?

protocol_rfcs/variant-type.md Outdated Show resolved Hide resolved
protocol_rfcs/variant-type.md Outdated Show resolved Hide resolved

## Variant data in Parquet

The Variant data type is represented as two binary encoded values, according to the [Variant binary encoding specification](https://github.com/apache/spark/blob/master/common/variant/README.md).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The Variant data type is represented as two binary encoded values, according to the [Variant binary encoding specification](https://github.com/apache/spark/blob/master/common/variant/README.md).
The Variant data type is represented as two binary encoded values, according to the [Spark Variant binary encoding specification](https://github.com/apache/spark/blob/master/common/variant/README.md).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 59 to 60
Struct fields which start with `_` (underscore) can be safely ignored.
The only non-ignorable fields must be `value` and `metadata`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Delta spec already states that clients should ignore unrecognized fields, does that not suffice?
If the idea is to carve out a "safe" namespace for custom fields, then _ prefix seems reasonable?

## Reader Requirements for Variant Data Type

When Variant type is supported (`readerFeatures` field of a table's `protocol` action contains `variantType`), readers:
- must be able to read the two parquet struct fields, `value` and `metadata` and interpret them as a Variant in concordance with the [Variant binary encoding specification](https://github.com/apache/spark/blob/master/common/variant/README.md).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As worded here, this "must be able to read... and interpret... in accordance with the... specification" would seem to contradict the next line that allows to not fully support the logical datatype?

I guess the main requirements are really:

  1. Reader must recognize and tolerate a variant data type in a Delta schema
  2. Reader must use the correct physical schema (struct-of-binary) when reading a Variant column from file
  3. Reader must make the column available to queries:
    • [Recommended] Expose the logical Variant column, if the engine supports Variant.
    • [Alternate] Expose the raw physical column, e.g. if the engine does not support Variant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Updated.

Copy link
Contributor Author

@gene-db gene-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scovich Thanks! Updated the PR.


## Variant data in Parquet

The Variant data type is represented as two binary encoded values, according to the [Variant binary encoding specification](https://github.com/apache/spark/blob/master/common/variant/README.md).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

## Reader Requirements for Variant Data Type

When Variant type is supported (`readerFeatures` field of a table's `protocol` action contains `variantType`), readers:
- must be able to read the two parquet struct fields, `value` and `metadata` and interpret them as a Variant in concordance with the [Variant binary encoding specification](https://github.com/apache/spark/blob/master/common/variant/README.md).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Updated.

Comment on lines 59 to 60
Struct fields which start with `_` (underscore) can be safely ignored.
The only non-ignorable fields must be `value` and `metadata`.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I removed this unnecessary line.

Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. One question.

- must use the correct physical schema (struct-of-binary, with fields `value` and `metadata`) when reading a Variant data type from file
- must make the column available to the engine:
- [Recommended] Expose and interpret the struct-of-binary as a single Variant field in accordance with the [Spark Variant binary encoding specification](https://github.com/apache/spark/blob/master/common/variant/README.md).
- [Alternate] Expose the raw physical struct-of-binary, e.g. if the engine does not support Variant.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it also an acceptable alternate, to expose a string column by internally converting the struct-of-binary? Gives up all the benefits of the variant encoding, but maximizes compat with engines that don't allow users to load the library code that would interpret the struct-of-binary directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that should work. I added that as another alternate.

@@ -22,6 +22,7 @@ Here is the history of all the RFCs propose/accepted/rejected since Feb 6, 2024,
| 2023-02-09 | [type-widening.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/type-widening.md) | https://github.com/delta-io/delta/issues/2623 | Type Widening |
| 2023-02-14 | [managed-commits.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/managed-commits.md) | https://github.com/delta-io/delta/issues/2598 | Managed Commits |
| 2023-02-26 | [column-mapping-usage.tracking.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/column-mapping-usage-tracking.md)) | https://github.com/delta-io/delta/issues/2682 | Column Mapping Usage Tracking |
| 2023-04-08 | [variant-type.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/variant-type.md) | https://github.com/delta-io/delta/issues/2864 | Variant Data Type |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the gap in this date, and when we are ready to merge is large. can you update this to todays; date

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the date.

@gene-db gene-db requested a review from tdas April 25, 2024 02:53
@tdas tdas merged commit 1e2c74f into delta-io:master Apr 25, 2024
7 of 8 checks passed
scottsand-db pushed a commit that referenced this pull request Apr 25, 2024
…2923)

<!--
Thanks for sending a pull request!  Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md
2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
Your PR title ...'.
  3. Be sure to keep the PR description updated to reflect all changes.
  4. Please write your PR title to summarize what this PR proposes.
5. If possible, provide a concise example to reproduce the issue for a
faster review.
6. If applicable, include the corresponding issue number in the PR title
and link it in the body.
-->

#### Which Delta project/connector is this regarding?
<!--
Please add the component selected below to the beginning of the pull
request title
For example: [Spark] Title of my pull request
-->

- [x] Spark
- [ ] Standalone
- [ ] Flink
- [ ] Kernel
- [ ] Other (fill in here)

## Description

<!--
- Describe what this PR changes.
- Describe why we need the change.
 
If this PR resolves an issue be sure to include "Resolves #XXX" to
correctly link and close the issue upon merge.
-->

Adds the variant table feature to minimally implement the variant type
as described in the RFC in #2867.

Also disables using variant columns as partition columns.

## How was this patch tested?

Added some UTs. More UTs will be merged in followup PRs

tested against both spark 3.5 and 4.0 snapshot with
```
build/sbt -DsparkVersion=latest spark/'testOnly org.apache.spark.sql.delta.DeltaVariantSuite'
build/sbt -DsparkVersion=master spark/'testOnly org.apache.spark.sql.delta.DeltaVariantSuite'
```


## Does this PR introduce _any_ user-facing changes?

<!--
If yes, please clarify the previous behavior and the change this PR
proposes - provide the console output, description and/or an example to
show the behavior difference if possible.
If possible, please also clarify if this is a user-facing change
compared to the released Delta Lake versions or within the unreleased
branches such as master.
If no, write 'No'.
-->

no
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants