-
Notifications
You must be signed in to change notification settings - Fork 0
/
FORMAT.txt
375 lines (327 loc) · 12.1 KB
/
FORMAT.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
binschema data format
---------------------
Binschema deals with 3 main concepts:
- encoded messages, which are sequences of bytes
- values, which are structured trees of data
- schemas, which are like data types for values
- a schema defines a set of possible values
- a schema defines how each value in that set is represented in
encoded messages
## possible schemas
The following "leaf" schemas exist:
- u8 and i8: (ints) encoded as-is
- u16 and i16: (ints) encoded little-endian
- u32, u64, and u128: (ints) encoded as var-len uints
- i32, i64, and i128: (ints) encoded as var-len sints
- f32 and f64: (floats) encoded little-endian
- char: (unicode scalar) encoded as var-len uint
- bool: (boolean) encoded as single byte, 0 if false and 1 if true
- str: (valid unicode string) encoded as:
- byte length of UTF-8, encoded as var-len uint
- UTF-8 bytes
- bytes: (string of arbitrary bytes) encoded as:
- byte length, encoded as var-len uint
- the bytes
- unit: (unitary data type) encoded as nothing (empty byte sequence)
The following "branch" types of schemas exist, which include inner
schemas in their definitions:
- option
- defined by: a single inner schema
- possible values (any of):
- none
- some, containing a value of its inner schema
- represented as (concatenation of):
- single byte "someness", 0 if none and 1 if some
- the representation of the inner value
- seq
- defined by (all of):
- optionally: fixed length
- a single inner schema
- possible values:
a sequence of values of its inner schema. if the seq schema has
a fixed length, the number of values in the sequence of values
must equal that fixed length.
- represented as (concatenation of):
- if the seq schema **does not** have a fixed length:
the number of values in the sequence of values, encoded as
a var-len uint
- concatenation of the representations of the inner values
- tuple
- defined by: a sequence of inner schemas
- possible values:
a sequence of values, one for each element in the sequence of
inner schemas, each one a value of the corresponding inner
schema
- represented as:
concatenation of the representations of the inner values
- struct
- defined by:
a sequence of "fields", wherein each field is defined by:
- the field's name (a string)
- the field's inner schema
- possible values:
a sequence of values, one for each element in the sequence of
fields, each one a value of the corresponding field's inner
schema
- represented as:
concatenation of the representations of the inner values
- enum
- defined by:
a sequence of "variants", wherein each variant is defined by:
- the variant's name (a string)
- the variant's inner schema
- possible values:
a selection of one of the variants in the sequence of variants,
and a value of the selected variant's inner schema
- represented as (concatenation of):
- index of the selected variant within the sequence of
variants, ordinal-encoded with the maximum value for ordinal
encoding being the length of the sequence of variants minus
one
- representation of the inner value
Finally, the "recurse" schema exists. This is to allow the
representation of recursive schemas. A "recurse" schema is defined by
an integer, the recurse level. In terms of possible values and
representation, a recurse schema simply acts as a reference to the
schema that many levels up in the schema tree from the recurse schema.
For example, a linked list schema could be constructed as:
- struct
field 0 (name = "value"):
- i32
field 1 (name = "next"):
- option:
- recurse (level = 2)
In that example, recurse(0) would refer to the recurse itself,
recurse(1) would refer to the option, and recurse(2) refers to
the struct.
## the meta-schema
Schemas themselves are values which can be encoded. Schemas are thus
encoded with the following schema, the "meta-schema":
- enum <--------------------------\-\-\-\-\
variant 0 (name = "Scalar"): | | | | |
- enum | | | | |
variant 0 (name = "U8"): | | | | |
- unit | | | | |
variant 1 (name = "U16"): | | | | |
- unit | | | | |
variant 2 (name = "U32"): | | | | |
- unit | | | | |
variant 3 (name = "U64"): | | | | |
- unit | | | | |
variant 4 (name = "U128"): | | | | |
- unit | | | | |
variant 5 (name = "I8"): | | | | |
- unit | | | | |
variant 6 (name = "I16"): | | | | |
- unit | | | | |
variant 7 (name = "I32"): | | | | |
- unit | | | | |
variant 8 (name = "I64"): | | | | |
- unit | | | | |
variant 9 (name = "I128"): | | | | |
- unit | | | | |
variant 10 (name = "F32"): | | | | |
- unit | | | | |
variant 11 (name = "F64"): | | | | |
- unit | | | | |
variant 12 (name = "Char"): | | | | |
- unit | | | | |
variant 13 (name = "Bool"): | | | | |
- unit | | | | |
variant 1 (name = "Str"): | | | | |
- unit | | | | |
variant 2 (name = "Bytes"): | | | | |
- unit | | | | |
variant 3 (name = "Unit"): | | | | |
- unit | | | | |
variant 4 (name = "Option"): | | | | |
- recurse (level = 1) --------/ | | | |
variant 5 (name = "Seq"): | | | |
- struct | | | |
field 0 (name = "len"): | | | |
- option: | | | |
- u64 | | | |
field 1 (name = "inner"): | | | |
- recurse (level = 2) ------/ | | |
variant 6 (name = "Tuple"): | | |
- seq (variable length): | | |
- recurse (level = 2) --------/ | |
variant 7 (name = "Struct"): | |
- seq (variable length): | |
- struct | |
field 0 (name = "name"): | |
- str | |
field 1 (name = "inner"): | |
- recurse (level = 3) ------/ |
variant 8 (name = "Enum"): |
- seq (variable length): |
- struct |
field 0 (name = "name"): |
- str |
field 1 (name = "inner"): |
- recurse (level = 3) --------/
variant 9 (name = "Recurse"):
- u64
## variable length int encodings
#### var-len uint encoding
The lowest 7 bits of each encoded byte are the next lowest 7 bits of
the actual integer. The highest bit of each byte is 1 if there is at
least one additional byte after the current byte in the encoded
integer, and 0 if the current byte is the last byte in the encoded
integer.
Reference code:
const MORE_BIT: u8 = 0b10000000;
const LO_7_BITS: u8 = 0b01111111;
/// Write a variable length unsigned int.
pub fn write_var_len_uint<W>(
write: &mut W,
mut n: u128,
) -> Result<()>
where
W: Write,
{
let mut more = true;
while more {
let curr_7_bits = (n & (LO_7_BITS as u128)) as u8;
n >>= 7;
more = n != 0;
let curr_byte = ((more as u8) << 7) | curr_7_bits;
write.write_all(&[curr_byte])?;
}
Ok(())
}
/// Read a variable length unsigned int.
pub fn read_var_len_uint<R>(
read: &mut R,
) -> Result<u128>
where
R: Read,
{
let mut n: u128 = 0;
let mut shift = 0;
let mut more = true;
while more {
ensure!(
shift < 128,
"malformed data: too many bytes in var len uint",
);
let mut buf = [0];
read.read_exact(&mut buf)?;
let [curr_byte] = buf;
n |= ((curr_byte & LO_7_BITS) as u128) << shift;
shift += 7;
more = (curr_byte & MORE_BIT) != 0;
}
Ok(n)
}
#### var-len sint encoding
This is a modification of the var-len uint encoding.
If the integer is negative, the "neg" bit is considered to be 1,
otherwise it is considered to be 0. Before the ingteger is encoded
or after it is decoded, if the neg bit is 1, the integer is
bitwise-negated.
In the first encoded byte, the second-to-highest bit is used to store
the neg bit. As such, in the first encoded byte, only the 6 lowest
bits are used to encode bits of the actual integer, as opposed to the
typical 7. However, for all subsequent bytes in the encoded message,
if any exist, the typical 7 bits of content are present.
Reference code:
const MORE_BIT: u8 = 0b10000000;
const LO_7_BITS: u8 = 0b01111111;
const ENCODED_SIGN_BIT: u8 = 0b01000000;
const LO_6_BITS: u8 = 0b00111111;
/// Write a variable length signed int.
pub fn write_var_len_sint<W>(
write: &mut W,
mut n: i128,
) -> Result<()>
where
W: Write,
{
let neg = n < 0;
if neg {
n = !n;
}
let curr_7_bits =
((neg as u8) << 6)
| (n & (LO_6_BITS as i128)) as u8;
n >>= 6;
let mut more = n != 0;
let curr_byte = ((more as u8) << 7) | curr_7_bits;
write.write_all(&[curr_byte])?;
while more {
let curr_7_bits = (n & (LO_7_BITS as i128)) as u8;
n >>= 7;
more = n != 0;
let curr_byte = ((more as u8) << 7) | curr_7_bits;
write.write_all(&[curr_byte])?;
}
Ok(())
}
/// Read a variable length signed int.
pub fn read_var_len_sint<R>(
read: &mut R,
) -> Result<i128>
where
R: Read,
{
let mut n: i128 = 0;
let mut buf = [0];
read.read_exact(&mut buf)?;
let [curr_byte] = buf;
let neg = (curr_byte & ENCODED_SIGN_BIT) != 0;
n |= (curr_byte & LO_6_BITS) as i128;
let mut more = (curr_byte & MORE_BIT) != 0;
let mut shift = 6;
while more {
// TODO: should use crate-specific error types
ensure!(
shift < 128,
"malformed data: too many bytes in var len sint",
);
let mut buf = [0];
read.read_exact(&mut buf)?;
let [curr_byte] = buf;
n |= ((curr_byte & LO_7_BITS) as i128) << shift;
shift += 7;
more = (curr_byte & MORE_BIT) != 0;
}
if neg {
n = !n;
}
Ok(n)
}
#### ordinal encoding
Ordinal encoding is used to encode integers in a number of bytes which
is a function of some statically known maximum value. The number of
bytes is calculated as the minimum number of bytes necessary to store
the maximum value. Then, the value is encoded in that many bytes.
Reference code:
/// Number of bytes needed to encode an ordinal based on max ordinal
/// value.
fn ord_byte_len(max_ord: usize) -> usize {
let mut mask = !0;
let mut bytes = 0;
while (mask & max_ord) != 0 {
mask <<= 8;
bytes += 1;
}
bytes
}
/// Write an enum ordinal. Assumes `ord` < `num_variants`.
pub fn write_ord<W>(
write: &mut W,
ord: usize,
num_variants: usize,
) -> Result<()>
where
W: Write,
{
debug_assert!(ord < num_variants, "enum ord out of bounds");
// if the ord is greater than 2^64... congratulations, future man, on
// having several dozen exabytes of RAM. you get a free bug.
let all_bytes = u64::to_le_bytes(ord as _);
let byte_len = ord_byte_len(num_variants - 1);
let used_bytes = &all_bytes[..byte_len];
write.write_all(used_bytes)
}