Skip to content

Encoding

Tony Arcieri edited this page Jun 3, 2017 · 2 revisions

zser uses a binary encoding format similar to Google Protocol Buffers designed to have a compact, self-describing structure that supports "Merkleized" content authentication using a git-like tree hashing algorithm.

zsint: Prefix Varints with a 64-bit Range

Like Protocol Buffers, zser uses variable-width integers to provide a more compact encoding. Protocol Buffers use a format called [LEB128], a Base128 encoding not unlike a 128-bit version of Base64. This encoding uses the high bit to signal there are more bytes remaining, and serializes values in little endian order. This format can serialize integers of arbitrary size.

However, while LEB128 makes it easy to serialize numbers of any size, it makes it harder to serialize numbers of a fixed width. There isn't a natural way to cap LEB128 to 64-bits. A 64-bit number represented with LEB128 requires 10 bytes.

An alternative approach to a variable-width integer encoding is employed by formats like UTF-8. This format encodes the number of total bytes into the prefix byte of the integer: in UTF-8's case, by the number of leading 1 digits in the first byte.

zsints use a similar encoding: they include a count of the number of bytes in the number using the trailing zeroes in the first byte:

Prefix Precision Total Bytes
xxxxxxx1 7 bits 1 byte
xxxxxx10 14 bits 2 bytes
xxxxx100 21 bits 3 bytes
xxxx1000 28 bits 4 bytes
xxx10000 35 bits 5 bytes
xx100000 42 bits 6 bytes
x1000000 49 bits 7 bytes
10000000 56 bits 8 bytes
00000000 64 bits 9 bytes

Using trailing zeroes means we can take advantage of CPU instructions such as CTZ (count trailing zeroes) or BSF (bit scan forward) to count the number of zeroes in the prefix byte.

This approach also provides a natural way to encode a 64-bit integer, since 8 * 8 = 64. Eight zeroes in the first byte (i.e. if the first byte is 0) indicates the next 8 bytes form a 64-bit little endian integer value, allowing the full 64-bit range to be naturally expressed in 9-bytes at most.

Anything smaller is still encoded as little endian, but shifted over by the number of bits in the length prefix. Once we've decoded the bytes as little endian, all we have to do to finish decoding is a right bitwise shift to remove the prefix bits.

This means the decoder can be implemented completely free of loops and optimized implementations can decode varints in around 15 nanoseconds.

Clone this wiki locally