-
Notifications
You must be signed in to change notification settings - Fork 46
Serialization format for template references
In the context of saved
variables with template types: Given an object a[1].b.c[2][3]
, we need to decide how to store that reference in a checkpoint.
Our discussion included three alternatives. Given an object a[1].b.c[2][3]
,
- Simple string:
"a[1].b.c[2][3]"
- printf-style format string plus indices:
["a[%u].b.c[%u][%u]",[1,2,3]]
- Full structure of name components and indices:
[["a",[1]],["b",[]],["c",[2,3]]]
We want a format that is easy to understand for a human reader; in particular, when you inspect a checkpoint, you should be able to guess that it's a reference. We also want a format that is easy to manipulate by a script, such as a checkpoint updater.
1 is the most human readable form, 3 is the least human readable. 3 is the most script-friendly format, 1 is the least script-friendly.
We wrote Python expressions to translate between the three formats; here 3 was a clear winner. For instance:
- 1->3:
[[part.split('[', 1)[0], [int(i) for i in re.findall(r'\[(\d)\]', part)]] for part in x.split('.')]
- 2->3:
list(zip((part.split('[',1)[0] for part in x[0].split('.')), (x[1][a:b] for (a,b) in itertools.pairwise(itertools.accumulate([0] + [part.count('[%u]') for part in x[0].split('.')])))))
- 3->2:
['.'.join(s+'[%u]'*len(indices) for (s, indices) in x), [i for (s, indices) in x for i in indices]]
However, when we look closer at the kind of operations you want to do, it's typically to identify references to a particular object, and maybe rename that object, and perhaps split a one-dimensional array into two arrays. For instance, transform a[i<4].b.*
into a[i<2][j<2].d.*
. Here,
- resort to regexps
refs = [["a[%u].d" + name[7:], [indices[0] // 2, indices[0] % 2, indices[1:]] if name[:7].rstrip('.') == 'a[%u].b' else [name, indices] for [name, indices] in refs]
refs = [[ref[0][0], [ref[0][1] // 2, ref[0][1] % 2]], ["d"], ref[2:] if ref[0][0] == 'a' and ref[1][0] == 'b' else ref for ref in refs]
Here, 1 is much worse, but there is no clear winner between 2 and 3. The added structure of indices in 3 doesn't really help since the typically knows this statically for a given transformation, and direct slicing of strings in 2 is rather convenient and readable. But 2 also has a pitfall; it's easy to accidentally include an object named a[%u].bb
.
The conclusion is that we go for alternative 2: compared to 3 it's significantly easier to read for a human, without any big practical disadvantages in scripting; compared to 1 it is much friendlier for scripting and still acceptably human readable.