Skip to content

Commit

Permalink
doc(book): describe details about derived identifiers
Browse files Browse the repository at this point in the history
Explain that it is possible to automatically derive identifiers instead
of always explicitly defining them. Also, move some encoding details
from the ideas section to the wire format as those are already
implemented.
  • Loading branch information
dnaka91 committed Jan 9, 2024
1 parent d27361e commit 616dfbf
Show file tree
Hide file tree
Showing 3 changed files with 47 additions and 22 deletions.
19 changes: 0 additions & 19 deletions book/src/ideas.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,22 +31,3 @@ When decoding a value, it may contain new fields and enum variants that are not
The same can happen the other way around. For example, if the data was saved in some form of storage and the schema evolved in the meantime, the decoder might encounter old data that lacks the newer content.

In both cases, the schema must be able to handle missing or unknown fields. Several rules must be upheld when updating a schema, to ensure it is both forward and backward compatible.

### Skip fields without knowing the exact type

This section explains how a decoder is able to process payloads that contain newer or unknown fields, given these were introduced in a backward compatible way.

Without the new schema it's not possible to make decisions about the data that follows after a field identifier. To work around this, reduced information can be encoded into the identifier.

Only a few details are important for the decoder to proceed, not needing full type information:

- Is the value a variable integer?
- Skip over individual bytes until the end marker is found
- Is the value length delimited?
- Parse the delimiter, which is always a _varint_, and skip over the length.
- Is the value a nested struct or enum?
- Step into the nested type and skip over all its fields.
- Is the value of fixed length?
- Skip over the fixed length of 1 (`bool`, `u8` and `i8`), 4 (`f32`) or 8 (`f64`) bytes.

Furthermore, this information is only needed for direct elements of a struct or enum variant, as this allows to skip over the whole field. Types nested into another, like a `vec<u32>` for example, don't need to provide this information for each element again.
36 changes: 35 additions & 1 deletion book/src/reference/schema/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -148,7 +148,41 @@ Byte arrays are mutable in other languages as well, but they don't have a reason

Identifier are an integral part of schemas and are attached to named and unnamed fields inside a struct or enum.

As the wire format doesn't contain any field names, fields have to be identified in some way. This is done by identifiers, which are [varint](../wire-format#varint-encoding) encoded integers.
As the wire format doesn't contain any field names, fields have to be identified in some way. This is done by identifiers, which are [Varint](../wire-format#varint-encoding) encoded **32-bit unsigned integers**.

Depending on the type of identifier (field or variant), they might carry some additional information. This is further explained in the [Wire Format](../wire-format).

### Deriving identifiers

Similar to classic enums in most languages, the identifiers can be omitted. In that case the compiler derives the identifiers automatically. This feature can be combined to mix and match explicit identifiers with derived ones.

Whenever an integer is is explicitly derived, it becomes the source for deriving the next potentially following identifier. After all, it's just an integer counter.

::: info
Identifiers don't have to be strictly increasing. They can appear in any order, can jump from different ranges like `1, 100, 5, 200, ...`

The only requirement is that they are unique within a struct or enum variant (for fields), and unique within an enum (for variants).
:::

For example, the following schema applies a mix of explicitly defined and derived identifiers on a single struct:

```mabo
struct Sample {
field1: u32,
field2: u32 @100,
field3: u32,
field4: u32 @10,
field5: u32,
}
```

The final identifiers are as follows:

- field1: `1` as the minimum identifier is one.
- field2: `100` because it's explicitly defined.
- field3: `101` the next value after 100.
- field4: `10` explicitly defined again.
- field5: `11` the next value after 10.

## Naming

Expand Down
14 changes: 12 additions & 2 deletions book/src/reference/wire-format.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ Both tuples and arrays have a known length as defined in the schema. Therefore,

## Identifiers

Identifiers are an essential part of the format. They mark the start of a field or enum variant and decribe which one it is, so the decoder knows how to parse the following data and assign it to the right element of a struct or enum.
Identifiers are an essential part of the format. They mark the start of a field or enum variant and describe which one it is, so the decoder knows how to parse the following data and assign it to the right element of a struct or enum.

These IDs are regular **32-bit unsigned integers**, and may encode additional information together with field or variant number.

Expand All @@ -75,8 +75,18 @@ This encoding marker is placed in the first 3 bits and the field number in shift

It means the maximum possible field number is **2<sup>29</sup> - 1** (**536,870,911**) instead of the integer types maximum of **2<sup>32</sup> - 1** (**4,294,967,295**). This amount is still sufficient and very unlikely to ever be reached as it is not considered realistic to have a struct or enum variant with that many fields.

The possible encodings are:

- `0`/`b000` Variable integer: Skip over individual bytes until the end marker is found.
- `1`/`b001` Length delimited: Parse the delimiter, which is always a _varint_, and skip over the length.
- Is the value a nested struct or enum?
- Step into the nested type and skip over all its fields.
- `2`/`b010` Fixed 1-byte length: Skip over the fixed length of 1 byte (`bool`, `u8` and `i8`).
- `3`/`b011` Fixed 4-byte length: Skip over the fixed length of 4 bytes (`f32`).
- `4`/`b100` Fixed 8-byte length: Skip over the fixed length of 8 bytes (`f64`).

### Variant identifiers

The variant identifiers currently don't carry any additional information and encode the the number as is.

Therefore the current maximum possible variant number is **2<sup>32</sup> - 1** (**4,294,967,295**), although unlikely to ever be reached when using sequential numbers without gaps.
Therefore the current maximum possible variant number is **2<sup>32</sup> - 1** (**4,294,967,295**), although unlikely to ever be reached when using sequential numbers without gaps.

0 comments on commit 616dfbf

Please sign in to comment.