Skip to content

Commit

Permalink
Add MetadataDomain action to delta spec by Jintao Shen
Browse files Browse the repository at this point in the history
  • Loading branch information
dabao521 committed May 5, 2023
1 parent 5c3f4d3 commit c30d3da
Showing 1 changed file with 39 additions and 4 deletions.
43 changes: 39 additions & 4 deletions PROTOCOL.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
- [Protocol Evolution](#protocol-evolution)
- [Commit Provenance Information](#commit-provenance-information)
- [Increase Row ID High-Water Mark](#increase-row-id-high-watermark)
- [Domain Metadata](#domain-metadata)
- [Action Reconciliation](#action-reconciliation)
- [Table Features](#table-features)
- [Table Features for new and Existing Tables](#table-features-for-new-and-existing-tables)
Expand Down Expand Up @@ -261,7 +262,7 @@ uppercase and lowercase as part of percent-encoding. Thus, we require a stricter
Actions modify the state of the table and they are stored both in delta files and in checkpoints.
This section lists the space of available actions as well as their schema.

### Change Metadata
### Change M*etadata
The `metaData` action changes the current metadata of the table.
The first version of a table must contain a `metaData` action.
Subsequent` metaData` actions completely overwrite the current metadata of the table.
Expand Down Expand Up @@ -564,22 +565,54 @@ The following is an example `rowIdHighWaterMark` action:
}
```

### Domain Metadata
The domain metadata action contains a configuration (string-string map) for a named metadata domain. Two overlapping transactions conflict if they both contain a domain metadata action for the same metadata domain.

There are two types of metadata domains:
1. **User-controlled metadata domains** have names that start with anything other than the `delta.` prefix. Any Delta client implementation or user application can modify these metadata domains, and can allow users to modify them arbitrarily. Delta clients and user applications are encouraged to use a naming convention designed to avoid conflicts with other clients' or users' metadata domains (e.g. `com.databricks.*` or `org.apache.*`).
2. **System-controlled metadata domains** have names that start with the `delta.` prefix. Only Delta client implementations are allowed to modify the metadata for system-controlled domains. A Delta client implementation should only update metadata for system-controlled domains that it knows about and understands. System-controlled metadata domains are used by various table features and each table feature may impose additional semantics on the metadata domains it uses. `delta.` prefix is reserved for metadata domains mentioned in the Delta spec (e.g. as part of some table feature).

The schema of the `domainMetadata` action is as follows:

Field Name | Data Type | Description
-|-|-
domain | String | Identifier for this domain (system- or user-provided)
configuration | Map[String, String] | A map containing configuration for the metadata domain
removed | Boolean | When `true`, the action serves as a tombstone to logically delete a metadata domain

Enablement:
- The table must be on Writer Version 7.
- A feature name `domainMetadata` must exist in the table's `writerFeatures`.

The following is an example `domainMetadata` action:
```json
{
"domainMetadata": {
"domain": "delta.deltaTableFeatureX",
"configuration": {"key1": "..."},
"removed": false
}
}
```

# Action Reconciliation
A given snapshot of the table can be computed by replaying the events committed to the table in ascending order by commit version. A given snapshot of a Delta table consists of:

- A single `protocol` action
- A single `metaData` action
- At most one `rowIdHighWaterMark` action
- A map from `appId` to transaction `version`
- A collection of `add` actions with unique `path`s.
- A collection of `txn` actions with unique `appId`s
- A collection of `domainMetadata` actions with unique `domain`s.
- A collection of `add` actions with unique `(path, deletionVector.uniqueId)` keys.
- A collection of `remove` actions with unique `(path, deletionVector.uniqueId)` keys. The intersection of the primary keys in the `add` collection and `remove` collection must be empty. That means a logical file cannot exist in both the `remove` and `add` collections at the same time; however, the same *data file* can exist with *different* DVs in the `remove` collection, as logically they represent different content. The `remove` actions act as _tombstones_, and only exist for the benefit of the VACUUM command. Snapshot reads only return `add` actions on the read path.

To achieve the requirements above, related actions from different delta files need to be reconciled with each other:

- The latest `protocol` action seen wins
- The latest `metaData` action seen wins
- The latest `rowIdHighWaterMark` action seen wins
- For transaction identifiers, the latest `version` seen for a given `appId` wins
- For `txn` actions, the latest `version` seen for a given `appId` wins
- For `domainMetadata`, the latest `domainMetadata` seen for a given `domain` wins. The actions with `removed=true` act as tombstones to suppress earlier versions. Snapshot reads do _not_ return removed `domainMetadata` actions.
- Logical files in a table are identified by their `(path, deletionVector.uniqueId)` primary key. File actions (`add` or `remove`) reference logical files, and a log can contain any number of references to a single file.
- To replay the log, scan all file actions and keep only the newest reference for each logical file.
- `add` actions in the result identify logical files currently present in the table (for queries). `remove` actions in the result identify tombstones of logical files no longer present in the table (for VACUUM).
Expand Down Expand Up @@ -861,6 +894,7 @@ Checkpoint files must be written in [Apache Parquet](https://parquet.apache.org/
* The [metadata](#Change-Metadata) of the table
* Files that have been [added and removed](#Add-File-and-Remove-File)
* [Transaction identifiers](#Transaction-Identifiers)
* [Domain Metadata](#Domain-Metadata)

Commit provenance information does not need to be included in the checkpoint. All of these actions are stored as their individual columns in parquet as struct fields.

Expand Down Expand Up @@ -1019,6 +1053,7 @@ Feature | Name | Readers or Writers?
[Deletion Vectors](#deletion-vectors) | `deletionVectors` | Readers and writers
[Row IDs](#row-ids) | `rowIds` | Writers only
[Timestamp without Timezone](#timestamp-ntz) | `timestampNTZ` | Readers and writers
[Domain Metadata](#domain-metadata) | `domainMetadata` | Writers only

## Deletion Vector Format

Expand Down

0 comments on commit c30d3da

Please sign in to comment.