Skip to content

Serialization format for template references

Erik Carstensen edited this page Sep 22, 2022 · 1 revision

In the context of saved variables with template types: Given an object a[1].b.c[2][3], we need to decide how to store that reference in a checkpoint.

Our discussion included three alternatives. Given an object a[1].b.c[2][3],

  1. Simple string: "a[1].b.c[2][3]"
  2. printf-style format string plus indices: ["a[%u].b.c[%u][%u]",[1,2,3]]
  3. Full structure of name components and indices: [["a",[1]],["b",[]],["c",[2,3]]]

We want a format that is easy to understand for a human reader; in particular, when you inspect a checkpoint, you should be able to guess that it's a reference. We also want a format that is easy to manipulate by a script, such as a checkpoint updater.

1 is the most human readable form, 3 is the least human readable. 3 is the most script-friendly format, 1 is the least script-friendly.

We wrote Python expressions to translate between the three formats; here 3 was a clear winner. For instance:

  • 1->3: [[part.split('[', 1)[0], [int(i) for i in re.findall(r'\[(\d)\]', part)]] for part in x.split('.')]
  • 2->3: list(zip((part.split('[',1)[0] for part in x[0].split('.')), (x[1][a:b] for (a,b) in itertools.pairwise(itertools.accumulate([0] + [part.count('[%u]') for part in x[0].split('.')])))))
  • 3->2: ['.'.join(s+'[%u]'*len(indices) for (s, indices) in x), [i for (s, indices) in x for i in indices]]

However, when we look closer at the kind of operations you want to do, it's typically to identify references to a particular object, and maybe rename that object, and perhaps split a one-dimensional array into two arrays. For instance, transform a[i<4].b.* into a[i<2][j<2].d.*. Here,

  1. resort to regexps
  2. refs = [["a[%u].d" + name[7:], [indices[0] // 2, indices[0] % 2, indices[1:]] if name[:7].rstrip('.') == 'a[%u].b' else [name, indices] for [name, indices] in refs]
  3. refs = [[ref[0][0], [ref[0][1] // 2, ref[0][1] % 2]], ["d"], ref[2:] if ref[0][0] == 'a' and ref[1][0] == 'b' else ref for ref in refs]

Here, 1 is much worse, but there is no clear winner between 2 and 3. The added structure of indices in 3 doesn't really help since the typically knows this statically for a given transformation, and direct slicing of strings in 2 is rather convenient and readable. But 2 also has a pitfall; it's easy to accidentally include an object named a[%u].bb.

The conclusion is that we go for alternative 2: compared to 3 it's significantly easier to read for a human, without any big practical disadvantages in scripting; compared to 1 it is much friendlier for scripting and still acceptably human readable.