So, Complex Data Types

Lists, Maps, Arrays, Sorted Lists, etc. Oh my!

We want to express things in complex nested structures, but we want access to serialization to be really fast. We want to work with them interactively, but read inside them without tons of overhead cost.

Everyone loves JSON…

JSON itself is kinda a mess
Not strongly typed.
Is really a string based system. Annoying to serialize tightly.
If you say you support JSON, you are stuck with a mushy standard, with many odd corners.
Can’t afford to tie our type system to a third-party library.
But lots of utility in JSON the string representation…
lots of publicly available JSON out there

For CDT⇐⇒ binary, any doc format needs

Needs to be able to be serialized very quickly
Needs interactive mutation as well
Needs to be readable without deserializing — lift data elements on demand right out of ByteBuffers
Needs to be typed — at least failing predictably when type mismatches occur
Needs to have something akin to JSON’s lists/maps

“Piles” of data

Basic Unit of Human Organization: The Pile 💩

Drawing from both Runnel and ValuePiles and other things, we have a “Pile” package.

Piles are structured data sequences, which allow for a list of data elements in order, followed by a meta stanza that organizes types and sizes of each element.

Ints (8/16/32/64), bools, varints, floats (32/64), strings, nulls, byte[], and nested Piles
Nested Piles are the big deal, each nested pile has its own data block and metadata. (2 types of piles allowed)
Piles are serialized entirely in order, no backtracking; closed with a meta block
Strongly typed access.
Basically a variable sized packed array

Again, a Pile is a heterogeneous Array with Typing

An example Pile ‘p’ can be thought of as an typed array: [ 10 (int), 12L (long), “foo” (string) ]

p.size() == 3
p.int32(0) == 10
p.int64(1) == 12L
p.string(2) == “foo”
p.string(0) --> ClassCastException
p.int64(0) == 10L (upconverts automatically, debatable decision)

Code… ish.

w = new PileWriter();
w.int8((byte)10);
w.str(“hello”);
w.bool(true);
sub = w.pile();
sub.str(“inner”);
sub.zigzag32(10020);
sub.end();
w.end();

buf=w.getBuffer();

rdr = new PileReader(buf);
byte i = rdr.int8(0); // works, returns 10
String j = rdr.str(0); // throws exception
inner = rdr.pile1(3);
String k = inner.str(0); // returns “inner”

inner.count() == 1

rdr.count() == 4

More Details

int8, int16, int32, int64, bool, null, float32, float64, char (fixed size)
zigzag32, zigzag64 (variable encoded signed ints, 1-9 bytes)
- int8/int16/int32/zigzag32 can be read as int32
- int8/int16/int32/zigzag32/int64/zigzag64 can be read as int64
string, as you might expect (var size)
(signifier, byte[]) pair: extra byte “signifier” for typing arrays.
Nested piles, 2 types, variable size.

Lots of bit twiddling tricks to save space. Fixed size ones need no size entry in metadata block. Highly dense structure, but every element completely randomly accessible with only metadata deserialization.

So that’s great, but primitive, right?

Intentional. Not a thing most people will interact with.
Tried to build the smallest thing that was useful.
Very testable, easily to understand.
No third party dependencies, only real code lift was our StringTool class.
- Clifford still has the fastest String→ByteBuffer code in the land!
Only existed to let me build something far more useful on top of it.

“Son” of a gun

Not JSON, just SON.

So, how do we get from low level Piles, to something JSON-esque?

Binary friendly, type extensible, compact and fast?
Maps and lists, and compact and fast?
Deal efficiently with repeated string key names (but compact and fast)?
Useful for CDTs ⇐⇒ binary interchange (messaging, storage at rest, mutable structures, text representation), still needs to be fast (and compact)!
Terracotta-SON? TC-SON

Built over Piles, but with more structure…

Streaming writer writes the same sort of data as piles, but as they are written…
Every map/list has a metadata stanza of typing. (this happens in the pile)
Every map gets a meta data block of <keyId>::<element offset>
Top level maps also have keep track of names used as keys in maps across all maps, included nested maps.
At the end of the top level Map, write the table of <names>::<key id>
High level, more complicated than that, but suffices for explanation.
Deserialization involves lazily materializing metadata, then accessing.

SonMaps and SonLists and SonValues

SonMap implements Iterable<SonMapValue>{
   SonMapValue get(String name);
   size();
   ...
}

SonList implements Iterable<SonValue> {
   SonValue get(int idx);
   size();
   ...
}

SonValue{
   SonType getType();
   int intValue();
   String stringValue();
   SonMap mapValue();
   SonList listValue();
   ...
}

This gives you type-checked access to values.

With apologies to Baskin Robbins, 3 not 31 flavors.

StreamingSonMapWriter & StreamingSonListWriter
- Write in order of values. Very fast for serialization to a buffer
- Manages an resizable byte buffer
ReadableSonMap and ReadableSonList
- Laid over a buffer
- Provides read access to any element, no matter how nested, without only metadat deser.
- Can be turned into …
MutableSonMap & MutableSonList
- Fully mutable, in memory representation, no buffer involved
- Good for interactive mutation
- builders included as well
- Mutable Maps and Lists are serializable as well.
- Can be turned into a buffer, suitable for ReadableSonMap.

So you can move from writing only, to read only buffer-based, to mutable and back willy nilly.

All the access to all the publicly intended API is through the Son.java interface, which as static factory methods for everything.

Extra types…

You can use the byte[] ‘signifier’ to overlay further types

For example, added:

Date (UTC, millis accuracy)
- 8 byte long + 1 signifier byte.
UUID - 2 longs
- 16 bytes + 1 signifier byte)

Both Date and UUID are supported in the parser and pretty printers. Reserved -1 … -128 for internal SON usage, 0 means plain byte[], SON Users could specify >0 as needed for typing

Pretty printing

4 different variants, each can be done compactly (no lin breaks) or normally (line breaks).

SON - normal son printing, only typing info for types that need it
SON_VERBOSE - type information for all
JSON_EXTENDED - turns non JSON types into strings, or longs (for ints) or doubles for floats. Best attempts.
JSON_LOSSY - you only get JSON native types.

Accessible via Son.SON_PRINTERS enum

Dot notation

Built a parser and matcher for general text notation matching:

.["al\'dksfh"]
."al\'dksfh"
.aldksfh
.[1]
.[-21]
.[1:-21]
.[1:-21, 10]
.[]

Simple approximate BNF, close, probs not accurate:

    <dotspec> := <singlestep>* <eof>
    <singlestep> := '.'? <angleStep> | <quoteStep> | <simpleStep>
    <simpleStep> := ("a-zA-Z" ("a-zA-Z0-9_)*)
    <quoteStep> := '"' <escapedChars> '"'
    <angleStep> := '[' ( ']' | ((<arrayStep> | <quoteStep> | ~[']']+ ) ']'))

    <arrayStep> := <arraySingle> (',' <arraySingle>>)*
    <arraySingle> := <integer> | <integer> ':' <integer>

    <integer> ('+'|'-')? ( ('0'-'9')+ )

So, a string like ".foo.bar.baz" walks three key→map transitions. ".foo.[0:3,7]" walks one map entry, "foo", then looks for it as an array, and matches 0,1,2,3,7 indexes if they exist.

You can skip the first period if you like, such as "foo.[0:3].[2]" is the same as ".foo.[0:3].[2]".

Dot notation here is implemented to walk the structural members of a map/list, not the primitive types. So you can’t match a literal string or integer, you can just traverse to a map or list.

You can use .[] to wildcard all array/map entries. See unit tests for more specifics.

Accessible via Son.dotParser() static method.

Missing?

No enums. Seems dumb, the cost of serializing the class name is prohibitive
SonMap is not a java.util.Map, SonList is not a java.util.List
- seems like the right decision, but debatable I suppose

Upshot

We get a rich, mutable list/map structured format, with typing, that is also a compact and fast-to-serialize binary-at-rest format, which can then be accessed randomly without the need for deserialization.

Historical Dev Notes

These are just running notes of how it was developed, of course just in plaintext originally, but kept here for … reasons.

UID notes

The UUID string representation is as described by this BNF:

   UUID                   = <time_low> "-" <time_mid> "-"
                            <time_high_and_version> "-"
                            <variant_and_sequence> "-"
                            <node>
   time_low               = 4*<hexOctet>
   time_mid               = 2*<hexOctet>
   time_high_and_version  = 2*<hexOctet>
   variant_and_sequence   = 2*<hexOctet>
   node                   = 6*<hexOctet>
   hexOctet               = <hexDigit><hexDigit>
   hexDigit               =
         "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
         | "a" | "b" | "c" | "d" | "e" | "f"
         | "A" | "B" | "C" | "D" | "E" | "F"

First cut done

first, respin this as nested value pile, that makes more sense.

	PileWriter - streaming writer
	PileReader - random access reader
	PileUpdater - random access updater (for fixed size elements)


    So, tail stanza could be much more efficient

    needs to have:
			<varint:count>
			[
				<byte:type>{<varint:size>},
				...
			]
			<int:metadata size>

		* so, compression:
			** rll encode the type/size stanza?
				*** for fixed size only
				*** this would make the internal pile size quite big, unless we got very clever. NO	** better to allow for fixed size array **
			** encode count in the size?
			** can we assume metadata size is a byte?
				*** <size><<2|01 is a byte size 6 bits of size (64 bits max size)
				*** <size><<2|10 is a short size (2^^14) size == 16k size
				*** <size><<2|11 is an int size (2^30) bits == more than enough
			** version is an issue.
                *** ignore

	we have how many types:	boolean, char, int, long, double, string, barr,

	JSON is, trivially:

		<doc> := { <element pair> , ... }

		<element pair> :=	<element name> : <element>

		<element name> := {<type>} <dqstring>

		<array> := [ { <element>, ...]

		<element> := <map>
						| <list>
						| <char>
						| <boolean>
						| <int>
						| <long>
						| <double>
						| <string>
						| <barr>

		<type> := 'a' | 'c' | 'b' | 'i' | 'l' | 'd' | 's' | 'x' | 'o'

		except there is no <char>, only <long>/<double>, no <barr>

		so you need {type} to allow for keyhole

		Hmmm. But the writer should be very different, correct?

		new SonWriter(ByteBuffer)
			.append("key", "val")
			.list("key")
				.val(10)
				.val("str")
				.end()
			.map("name").
				.append("key", "val")
				.end()
			.

		reader=new SonReader(ByteBuffer);
		reader.keys();
		reader.get<type>("name");
	    reader.getInt8("name"); etc.

		scope of property names is ... problematic;	you want to resuse them across levels, but not within.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

overview.adoc

overview.adoc

So, Complex Data Types

Everyone loves JSON…

For CDT⇐⇒ binary, any doc format needs

“Piles” of data

Basic Unit of Human Organization: The Pile 💩

Again, a Pile is a heterogeneous Array with Typing

Code… ish.

More Details

So that’s great, but primitive, right?

“Son” of a gun

Not JSON, just SON.

Built over Piles, but with more structure…

SonMaps and SonLists and SonValues

With apologies to Baskin Robbins, 3 not 31 flavors.

Extra types…

Pretty printing

Dot notation

Missing?

Upshot

Historical Dev Notes

UID notes

First cut done

Files

overview.adoc

Latest commit

History

overview.adoc

File metadata and controls

So, Complex Data Types

Everyone loves JSON…​

For CDT⇐⇒ binary, any doc format needs

“Piles” of data

Basic Unit of Human Organization: The Pile 💩

Again, a Pile is a heterogeneous Array with Typing

Code…​ ish.

More Details

So that’s great, but primitive, right?

“Son” of a gun

Not JSON, just SON.

Built over Piles, but with more structure…​

SonMaps and SonLists and SonValues

With apologies to Baskin Robbins, 3 not 31 flavors.

Extra types…​

Pretty printing

Dot notation

Missing?

Upshot

Historical Dev Notes

UID notes

First cut done

Everyone loves JSON…

Code… ish.

Built over Piles, but with more structure…

Extra types…