Skip to content

Commit

Permalink
[PROTOCOL RFC] Variant data type (delta-io#2867)
Browse files Browse the repository at this point in the history
#### Which Delta project/connector is this regarding?

- [x] Spark
- [ ] Standalone
- [ ] Flink
- [ ] Kernel
- [ ] Other (fill in here)

## Description

Adds the protocol changes for the Variant data type (see
delta-io#2864) to the RFC folder.

## How was this patch tested?

N/A

## Does this PR introduce _any_ user-facing changes?

N/A
  • Loading branch information
gene-db committed Apr 25, 2024
1 parent 88c03d3 commit 1e2c74f
Show file tree
Hide file tree
Showing 2 changed files with 100 additions and 0 deletions.
1 change: 1 addition & 0 deletions protocol_rfcs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ Here is the history of all the RFCs propose/accepted/rejected since Feb 6, 2024,
| 2023-02-09 | [type-widening.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/type-widening.md) | https://github.com/delta-io/delta/issues/2623 | Type Widening |
| 2023-02-14 | [managed-commits.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/managed-commits.md) | https://github.com/delta-io/delta/issues/2598 | Managed Commits |
| 2023-02-26 | [column-mapping-usage.tracking.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/column-mapping-usage-tracking.md)) | https://github.com/delta-io/delta/issues/2682 | Column Mapping Usage Tracking |
| 2023-04-24 | [variant-type.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/variant-type.md) | https://github.com/delta-io/delta/issues/2864 | Variant Data Type |

### Accepted RFCs

Expand Down
99 changes: 99 additions & 0 deletions protocol_rfcs/variant-type.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# Variant Data Type
**Associated Github issue for discussions: https://github.com/delta-io/delta/issues/2864**

This protocol change adds support for the Variant data type.
The Variant data type is beneficial for storing and processing semi-structured data.

--------

> ***New Section after the [Clustered Table](#clustered-table) section***
# Variant Data Type

This feature enables support for the Variant data type, for storing semi-structured data.
The schema serialization method is described in [Schema Serialization Format](#schema-serialization-format).

To support this feature:
- The table must be on Reader Version 3 and Writer Version 7
- The feature `variantType` must exist in the table `protocol`'s `readerFeatures` and `writerFeatures`.

## Example JSON-Encoded Delta Table Schema with Variant types

```
{
"type" : "struct",
"fields" : [ {
"name" : "raw_data",
"type" : "variant",
"nullable" : true,
"metadata" : { }
}, {
"name" : "variant_array",
"type" : {
"type" : "array",
"elementType" : {
"type" : "variant"
},
"containsNull" : false
},
"nullable" : false,
"metadata" : { }
} ]
}
```

## Variant data in Parquet

The Variant data type is represented as two binary encoded values, according to the [Spark Variant binary encoding specification](https://github.com/apache/spark/blob/master/common/variant/README.md).
The two binary values are named `value` and `metadata`.

When writing Variant data to parquet files, the Variant data is written as a single Parquet struct, with the following fields:

Struct field name | Parquet primitive type | Description
-|-|-
value | binary | The binary-encoded Variant value, as described in [Variant binary encoding](https://github.com/apache/spark/blob/master/common/variant/README.md)
metadata | binary | The binary-encoded Variant metadata, as described in [Variant binary encoding](https://github.com/apache/spark/blob/master/common/variant/README.md)

The parquet struct must include the two struct fields `value` and `metadata`.
Supported writers must write the two binary fields, and supported readers must read the two binary fields.
Struct fields which start with `_` (underscore) can be safely ignored.

## Writer Requirements for Variant Data Type

When Variant type is supported (`writerFeatures` field of a table's `protocol` action contains `variantType`), writers:
- must write a column of type `variant` to parquet as a struct containing the fields `value` and `metadata` and storing values that conform to the [Variant binary encoding specification](https://github.com/apache/spark/blob/master/common/variant/README.md)
- must not write additional, non-ignorable parquet struct fields. Writing additional struct fields with names starting with `_` (underscore) is allowed.

## Reader Requirements for Variant Data Type

When Variant type is supported (`readerFeatures` field of a table's `protocol` action contains `variantType`), readers:
- must recognize and tolerate a `variant` data type in a Delta schema
- must use the correct physical schema (struct-of-binary, with fields `value` and `metadata`) when reading a Variant data type from file
- must make the column available to the engine:
- [Recommended] Expose and interpret the struct-of-binary as a single Variant field in accordance with the [Spark Variant binary encoding specification](https://github.com/apache/spark/blob/master/common/variant/README.md).
- [Alternate] Expose the raw physical struct-of-binary, e.g. if the engine does not support Variant.
- [Alternate] Convert the struct-of-binary to a string, and expose the string representation, e.g. if the engine does not support Variant.

## Compatibility with other Delta Features

Feature | Support for Variant Data Type
-|-
Partition Columns | **Supported:** A Variant column is allowed to be a non-partitioned column of a partitioned table. <br/> **Unsupported:** Variant is not a comparable data type, so it cannot be included in a partition column.
Clustered Tables | **Supported:** A Variant column is allowed to be a non-clustering column of a clustered table. <br/> **Unsupported:** Variant is not a comparable data type, so it cannot be included in a clustering column.
Delta Column Statistics | **Supported:** A Variant column supports the `nullCount` statistic. <br/> **Unsupported:** Variant is not a comparable data type, so a Variant column does not support the `minValues` and `maxValues` statistics.
Generated Columns | **Supported:** A Variant column is allowed to be used as a source in a generated column expression, as long as the Variant type is not the result type of the generated column expression. <br/> **Unsupported:** The Variant data type is not allowed to be the result type of a generated column expression.
Delta CHECK Constraints | **Supported:** A Variant column is allowed to be used for a CHECK constraint expression.
Default Column Values | **Supported:** A Variant column is allowed to have a default column value.
Change Data Feed | **Supported:** A table using the Variant data type is allowed to enable the Delta Change Data Feed.

--------

> ***New Sub-Section after the [Map Type](#map-type) sub-section within the [Schema Serialization Format](#schema-serialization-format) section***
### Variant Type

Variant data uses the Delta type name `variant` for Delta schema serialization.

Field Name | Description
-|-
type | Always the string "variant"

0 comments on commit 1e2c74f

Please sign in to comment.