From 9d8037c146184221e5678d94d365a7b768495e80 Mon Sep 17 00:00:00 2001 From: Yan Yan Date: Fri, 8 Jan 2021 17:16:29 -0800 Subject: [PATCH 1/3] add sort order to spec --- site/docs/spec.md | 46 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 46 insertions(+) diff --git a/site/docs/spec.md b/site/docs/spec.md index 367e10bf1a2e..404e101adb72 100644 --- a/site/docs/spec.md +++ b/site/docs/spec.md @@ -254,6 +254,24 @@ Notes: 2. The width, `W`, used to truncate decimal values is applied using the scale of the decimal column to avoid additional (and potentially conflicting) parameters. +### Sorting + +Users could sort their data within partitions by columns to gain performance. The information on how the data is sorted could be declared per data or delete file, by a **sort order**. + +A sort order is defined by an sort order id and a list of sort fields. The order of the sort fields within the list defines the order in which the sort is applied to the data. Each sort field consists of: + +* A **source column id** from the table's schema +* A **transform** that is used to produce values to be sorted on from the source column. This is the same transform as described in [partition transforms](#partition-transforms). +* A **sort direction**, that can only be either `asc` or `desc` +* A **null order** that describes the order of null values when sorted. Can only be either `nulls-first` or `nulls-last` + +Order id `0` is reserved for the unsorted order. + +A data or delete file is associated with a sort order by the sort order's id within [a manifest](#manifests). Therefore, the table must declare all the sort orders for lookup. A table could also be configured with a default sort order id, indicating how the new data should be sorted by default. This default could be overridden per file basis if the file is sorted differently, such as if the engine is incapable of ensure ordering of the data on write, the generated files should be annotated with sort order id 0 (unsorted). + +Note that only data files and equality delete files could have sort order. [Position deletes](#position-delete-files) should not have sort order, since they have their own sorting requirements. + + ### Manifests A manifest is an immutable Avro file that lists data files or delete files, along with each file’s partition data tuple, metrics, and tracking information. One or more manifest files are used to store a [snapshot](#snapshots), which tracks all of the files in a table at some point in time. Manifests are tracked by a [manifest list](#manifest-lists) for each table snapshot. @@ -305,6 +323,7 @@ The schema of a manifest file is a struct called `manifest_entry` with the follo | _optional_ | _optional_ | **`131 key_metadata`** | `binary` | Implementation-specific key metadata for encryption | | _optional_ | _optional_ | **`132 split_offsets`** | `list<133: long>` | Split offsets for the data file. For example, all row group offsets in a Parquet file. Must be sorted ascending | | | _optional_ | **`135 equality_ids`** | `list<136: int>` | Field ids used to determine row equality in equality delete files. Required when `content=2` and should be null otherwise. Fields with ids listed in this column must be present in the delete file | +| _optional_ | _optional_ | **`140 sort_order_id`** | `int` | ID representing sort order for this file. | Notes: @@ -480,6 +499,8 @@ Table metadata consists of the following fields: | _optional_ | _optional_ | **`current-snapshot-id`**| `long` ID of the current table snapshot. | | _optional_ | _optional_ | **`snapshots`**| A list of valid snapshots. Valid snapshots are snapshots for which all data files exist in the file system. A data file must not be deleted from the file system until the last snapshot in which it was listed is garbage collected. | | _optional_ | _optional_ | **`snapshot-log`**| A list (optional) of timestamp and snapshot ID pairs that encodes changes to the current snapshot for the table. Each time the current-snapshot-id is changed, a new entry should be added with the last-updated-ms and the new current-snapshot-id. When snapshots are expired from the list of valid snapshots, all entries before a snapshot that has expired should be removed. | +| _optional_ | _optional_ | **`sort-orders`**| A list of sort orders, stored as full sort order objects. | +| _optional_ | _optional_ | **`default-sort-order-id`**| Default sort order id of the table. Note that this could be used by writers, but is not used when reading because reads use the specs stored in manifest files. | For serialization details, see Appendix C. @@ -831,6 +852,29 @@ Each partition field in the fields list is stored as an object. See the table fo In some cases partition specs are stored using only the field list instead of the object format that includes the spec ID, like the deprecated `partition-spec` field in table metadata. The object format should be used unless otherwise noted in this spec. +### Sort Orders + +Sort orders are serialized as a list of JSON object, each of which contains the following fields: + +|Field|JSON representation|Example| +|--- |--- |--- | +|**`order-id`**|`JSON int`|`1`| +|**`fields`**|`JSON list: [`
  `,`
  `...`
`]`|`[ {`
  ` "transform": "identity",`
  ` "source-id": 2,`
  ` "direction": "asc",`
  ` "null-order": "nulls-first"`
  `}, {`
  ` "transform": "bucket[4]",`
  ` "source-id": 3,`
  ` "direction": "desc",`
  ` "null-order": "nulls-last"`
`} ]`| + +Each sort field in the fields list is stored as an object with the following properties: + +|Field|JSON representation|Example| +|--- |--- |--- | +|**`Sort Field`**|`JSON object: {`
  `"transform": ,`
  `"source-id": ,`
  `"direction": ,`
  `"null-order": `
`}`|`{`
  ` "transform": "bucket[4]",`
  ` "source-id": 3,`
  ` "direction": "desc",`
  ` "null-order": "nulls-last"`
`}`| + +The following table describes the possible values for the some of the field within sort field: + +|Field|JSON representation|Possible values| +|--- |--- |--- | +|**`direction`**|`JSON string`|`"asc", "desc"`| +|**`null-order`**|`JSON string`|`"nulls-first", "nulls-last"`| + + ### Table Metadata and Snapshots Table metadata is serialized as a JSON object according to the following table. Snapshots are not serialized separately. Instead, they are stored in the table metadata JSON. @@ -850,6 +894,8 @@ Table metadata is serialized as a JSON object according to the following table. |**`current-snapshot-id`**|`JSON long`|`3051729675574597004`| |**`snapshots`**|`JSON list of objects: [ {`
  `"snapshot-id": ,`
  `"timestamp-ms": ,`
  `"summary": {`
    `"operation": ,`
    `... },`
  `"manifest-list": ""`
  `},`
  `...`
`]`|`[ {`
  `"snapshot-id": 3051729675574597004,`
  `"timestamp-ms": 1515100955770,`
  `"summary": {`
    `"operation": "append"`
  `},`
  `"manifest-list": "s3://b/wh/.../s1.avro"`
`} ]`| |**`snapshot-log`**|`JSON list of objects: [`
  `{`
  `"snapshot-id": ,`
  `"timestamp-ms": `
  `},`
  `...`
`]`|`[ {`
  `"snapshot-id": 30517296...,`
  `"timestamp-ms": 1515100...`
`} ]`| +|**`sort-orders`**|`JSON sort orders (list of sort field object)`|`See above`| +|**`default-sort-order-id`**|`JSON int`|`0`| ## Appendix D: Single-value serialization From 796a91b711a56b83f0c87ba40eb0085f097002dc Mon Sep 17 00:00:00 2001 From: Yan Yan Date: Mon, 11 Jan 2021 17:27:39 -0800 Subject: [PATCH 2/3] update based on comments --- site/docs/spec.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/site/docs/spec.md b/site/docs/spec.md index 404e101adb72..51717de40c75 100644 --- a/site/docs/spec.md +++ b/site/docs/spec.md @@ -267,9 +267,7 @@ A sort order is defined by an sort order id and a list of sort fields. The order Order id `0` is reserved for the unsorted order. -A data or delete file is associated with a sort order by the sort order's id within [a manifest](#manifests). Therefore, the table must declare all the sort orders for lookup. A table could also be configured with a default sort order id, indicating how the new data should be sorted by default. This default could be overridden per file basis if the file is sorted differently, such as if the engine is incapable of ensure ordering of the data on write, the generated files should be annotated with sort order id 0 (unsorted). - -Note that only data files and equality delete files could have sort order. [Position deletes](#position-delete-files) should not have sort order, since they have their own sorting requirements. +A data or delete file is associated with a sort order by the sort order's id within [a manifest](#manifests). Therefore, the table must declare all the sort orders for lookup. A table could also be configured with a default sort order id, indicating how the new data should be sorted by default. Writers should use this default sort order to sort the data on write, but are not required to if the default order is prohibitively expensive, as it would be for streaming writes. ### Manifests @@ -323,12 +321,14 @@ The schema of a manifest file is a struct called `manifest_entry` with the follo | _optional_ | _optional_ | **`131 key_metadata`** | `binary` | Implementation-specific key metadata for encryption | | _optional_ | _optional_ | **`132 split_offsets`** | `list<133: long>` | Split offsets for the data file. For example, all row group offsets in a Parquet file. Must be sorted ascending | | | _optional_ | **`135 equality_ids`** | `list<136: int>` | Field ids used to determine row equality in equality delete files. Required when `content=2` and should be null otherwise. Fields with ids listed in this column must be present in the delete file | -| _optional_ | _optional_ | **`140 sort_order_id`** | `int` | ID representing sort order for this file. | +| _optional_ | _optional_ | **`140 sort_order_id`** | `int` | ID representing sort order for this file [2]. | Notes: 1. Single-value serialization for lower and upper bounds is detailed in Appendix D. +2. If sort order ID is missing or unknown, then the order is assumed to be unsorted. Only data files and equality delete files could have valid sort orders, and [position deletes](#position-delete-files) are required to be sorted by file and position. The manifest should not be written with an order ID for position delete files, and readers must ignore this field for those files. + The `partition` struct stores the tuple of partition values for each file. Its type is derived from the partition fields of the partition spec used to write the manifest file. In v2, the partition struct's field ids must match the ids from the partition spec. The column metrics maps are used when filtering to select both data and delete files. For delete files, the metrics must store bounds and counts for all deleted rows, or must be omitted. Storing metrics for deleted rows ensures that the values can be used during job planning to find delete files that must be merged during a scan. From 1c76ebd15bb8c1fb2685f1239c08da8196826936 Mon Sep 17 00:00:00 2001 From: Yan Yan Date: Tue, 19 Jan 2021 17:10:22 -0800 Subject: [PATCH 3/3] update based on comments --- site/docs/spec.md | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/site/docs/spec.md b/site/docs/spec.md index 51717de40c75..98bd90f15f1c 100644 --- a/site/docs/spec.md +++ b/site/docs/spec.md @@ -256,7 +256,7 @@ Notes: ### Sorting -Users could sort their data within partitions by columns to gain performance. The information on how the data is sorted could be declared per data or delete file, by a **sort order**. +Users can sort their data within partitions by columns to gain performance. The information on how the data is sorted can be declared per data or delete file, by a **sort order**. A sort order is defined by an sort order id and a list of sort fields. The order of the sort fields within the list defines the order in which the sort is applied to the data. Each sort field consists of: @@ -267,6 +267,8 @@ A sort order is defined by an sort order id and a list of sort fields. The order Order id `0` is reserved for the unsorted order. +Sorting floating-point numbers should produce the following behavior: `-NaN` < `-Infinity` < `-value` < `-0` < `0` < `value` < `Infinity` < `NaN`. This aligns with the implementation of Java floating-point types comparisons. + A data or delete file is associated with a sort order by the sort order's id within [a manifest](#manifests). Therefore, the table must declare all the sort orders for lookup. A table could also be configured with a default sort order id, indicating how the new data should be sorted by default. Writers should use this default sort order to sort the data on write, but are not required to if the default order is prohibitively expensive, as it would be for streaming writes. @@ -327,7 +329,7 @@ Notes: 1. Single-value serialization for lower and upper bounds is detailed in Appendix D. -2. If sort order ID is missing or unknown, then the order is assumed to be unsorted. Only data files and equality delete files could have valid sort orders, and [position deletes](#position-delete-files) are required to be sorted by file and position. The manifest should not be written with an order ID for position delete files, and readers must ignore this field for those files. +2. If sort order ID is missing or unknown, then the order is assumed to be unsorted. Only data files and equality delete files should be written with a non-null order id. [Position deletes](#position-delete-files) are required to be sorted by file and position, not a table order, and should set sort order id to null. Readers must ignore sort order id for position delete files. The `partition` struct stores the tuple of partition values for each file. Its type is derived from the partition fields of the partition spec used to write the manifest file. In v2, the partition struct's field ids must match the ids from the partition spec. @@ -499,8 +501,8 @@ Table metadata consists of the following fields: | _optional_ | _optional_ | **`current-snapshot-id`**| `long` ID of the current table snapshot. | | _optional_ | _optional_ | **`snapshots`**| A list of valid snapshots. Valid snapshots are snapshots for which all data files exist in the file system. A data file must not be deleted from the file system until the last snapshot in which it was listed is garbage collected. | | _optional_ | _optional_ | **`snapshot-log`**| A list (optional) of timestamp and snapshot ID pairs that encodes changes to the current snapshot for the table. Each time the current-snapshot-id is changed, a new entry should be added with the last-updated-ms and the new current-snapshot-id. When snapshots are expired from the list of valid snapshots, all entries before a snapshot that has expired should be removed. | -| _optional_ | _optional_ | **`sort-orders`**| A list of sort orders, stored as full sort order objects. | -| _optional_ | _optional_ | **`default-sort-order-id`**| Default sort order id of the table. Note that this could be used by writers, but is not used when reading because reads use the specs stored in manifest files. | +| _optional_ | _required_ | **`sort-orders`**| A list of sort orders, stored as full sort order objects. | +| _optional_ | _required_ | **`default-sort-order-id`**| Default sort order id of the table. Note that this could be used by writers, but is not used when reading because reads use the specs stored in manifest files. | For serialization details, see Appendix C. @@ -947,5 +949,7 @@ Writing v2 metadata: * Snapshot added required field field `sequence-number`. * Snapshot now requires field `manifest-list`. * Snapshot field `manifests` is no longer allowed. +* Table metadata now requires field `sort-orders`. +* Table metadata now requires field `default-sort-order-id`. Note that these requirements apply when writing data to a v2 table. Tables that are upgraded from v1 may contain metadata that does not follow these requirements. Implementations should remain backward-compatible with v1 metadata requirements.