Skip to content

Latest commit

 

History

History
387 lines (293 loc) · 11.8 KB

overview.adoc

File metadata and controls

387 lines (293 loc) · 11.8 KB

So, Complex Data Types

Lists, Maps, Arrays, Sorted Lists, etc. Oh my!

We want to express things in complex nested structures, but we want access to serialization to be really fast. We want to work with them interactively, but read inside them without tons of overhead cost.

Everyone loves JSON…​

  • JSON itself is kinda a mess

  • Not strongly typed.

  • Is really a string based system. Annoying to serialize tightly.

  • If you say you support JSON, you are stuck with a mushy standard, with many odd corners.

  • Can’t afford to tie our type system to a third-party library.

  • But lots of utility in JSON the string representation…​

  • lots of publicly available JSON out there

For CDT⇐⇒ binary, any doc format needs

  • Needs to be able to be serialized very quickly

  • Needs interactive mutation as well

  • Needs to be readable without deserializing — lift data elements on demand right out of ByteBuffers

  • Needs to be typed — at least failing predictably when type mismatches occur

  • Needs to have something akin to JSON’s lists/maps

“Piles” of data

Basic Unit of Human Organization: The Pile 💩

Drawing from both Runnel and ValuePiles and other things, we have a “Pile” package.

Piles are structured data sequences, which allow for a list of data elements in order, followed by a meta stanza that organizes types and sizes of each element.

  • Ints (8/16/32/64), bools, varints, floats (32/64), strings, nulls, byte[], and nested Piles

  • Nested Piles are the big deal, each nested pile has its own data block and metadata. (2 types of piles allowed)

  • Piles are serialized entirely in order, no backtracking; closed with a meta block

  • Strongly typed access.

  • Basically a variable sized packed array

Again, a Pile is a heterogeneous Array with Typing

An example Pile ‘p’ can be thought of as an typed array: [ 10 (int), 12L (long), “foo” (string) ]

p.size() == 3
p.int32(0) == 10
p.int64(1) == 12L
p.string(2) == “foop.string(0) --> ClassCastException
p.int64(0) == 10L (upconverts automatically, debatable decision)

Code…​ ish.

w = new PileWriter();
w.int8((byte)10);
w.str(“hello”);
w.bool(true);
sub = w.pile();
sub.str(“inner”);
sub.zigzag32(10020);
sub.end();
w.end();

buf=w.getBuffer();

rdr = new PileReader(buf);
byte i = rdr.int8(0); // works, returns 10
String j = rdr.str(0); // throws exception
inner = rdr.pile1(3);
String k = inner.str(0); // returns “inner”

inner.count() == 1

rdr.count() == 4

More Details

  • int8, int16, int32, int64, bool, null, float32, float64, char (fixed size)

  • zigzag32, zigzag64 (variable encoded signed ints, 1-9 bytes)

    • int8/int16/int32/zigzag32 can be read as int32

    • int8/int16/int32/zigzag32/int64/zigzag64 can be read as int64

  • string, as you might expect (var size)

  • (signifier, byte[]) pair: extra byte “signifier” for typing arrays.

  • Nested piles, 2 types, variable size.

Lots of bit twiddling tricks to save space. Fixed size ones need no size entry in metadata block. Highly dense structure, but every element completely randomly accessible with only metadata deserialization.

So that’s great, but primitive, right?

  • Intentional. Not a thing most people will interact with.

  • Tried to build the smallest thing that was useful.

  • Very testable, easily to understand.

  • No third party dependencies, only real code lift was our StringTool class.

    • Clifford still has the fastest String→ByteBuffer code in the land!

  • Only existed to let me build something far more useful on top of it.

“Son” of a gun

Not JSON, just SON.

So, how do we get from low level Piles, to something JSON-esque?

  • Binary friendly, type extensible, compact and fast?

  • Maps and lists, and compact and fast?

  • Deal efficiently with repeated string key names (but compact and fast)?

  • Useful for CDTs ⇐⇒ binary interchange (messaging, storage at rest, mutable structures, text representation), still needs to be fast (and compact)!

  • Terracotta-SON? TC-SON

Built over Piles, but with more structure…​

  • Streaming writer writes the same sort of data as piles, but as they are written…​

  • Every map/list has a metadata stanza of typing. (this happens in the pile)

  • Every map gets a meta data block of <keyId>::<element offset>

  • Top level maps also have keep track of names used as keys in maps across all maps, included nested maps.

  • At the end of the top level Map, write the table of <names>::<key id>

  • High level, more complicated than that, but suffices for explanation.

  • Deserialization involves lazily materializing metadata, then accessing.

SonMaps and SonLists and SonValues

SonMap implements Iterable<SonMapValue>{
   SonMapValue get(String name);
   size();
   ...
}

SonList implements Iterable<SonValue> {
   SonValue get(int idx);
   size();
   ...
}

SonValue{
   SonType getType();
   int intValue();
   String stringValue();
   SonMap mapValue();
   SonList listValue();
   ...
}

This gives you type-checked access to values.

With apologies to Baskin Robbins, 3 not 31 flavors.

  • StreamingSonMapWriter & StreamingSonListWriter

    • Write in order of values. Very fast for serialization to a buffer

    • Manages an resizable byte buffer

  • ReadableSonMap and ReadableSonList

    • Laid over a buffer

    • Provides read access to any element, no matter how nested, without only metadat deser.

    • Can be turned into …​

  • MutableSonMap & MutableSonList

    • Fully mutable, in memory representation, no buffer involved

    • Good for interactive mutation

    • builders included as well

    • Mutable Maps and Lists are serializable as well.

    • Can be turned into a buffer, suitable for ReadableSonMap.

So you can move from writing only, to read only buffer-based, to mutable and back willy nilly.

All the access to all the publicly intended API is through the Son.java interface, which as static factory methods for everything.

Extra types…​

You can use the byte[] ‘signifier’ to overlay further types

For example, added:

  • Date (UTC, millis accuracy)

    • 8 byte long + 1 signifier byte.

  • UUID - 2 longs

    • 16 bytes + 1 signifier byte)

Both Date and UUID are supported in the parser and pretty printers. Reserved -1 …​ -128 for internal SON usage, 0 means plain byte[], SON Users could specify >0 as needed for typing

Pretty printing

4 different variants, each can be done compactly (no lin breaks) or normally (line breaks).

  1. SON - normal son printing, only typing info for types that need it

  2. SON_VERBOSE - type information for all

  3. JSON_EXTENDED - turns non JSON types into strings, or longs (for ints) or doubles for floats. Best attempts.

  4. JSON_LOSSY - you only get JSON native types.

Accessible via Son.SON_PRINTERS enum

Dot notation

Built a parser and matcher for general text notation matching:

.["al\'dksfh"]
."al\'dksfh"
.aldksfh
.[1]
.[-21]
.[1:-21]
.[1:-21, 10]
.[]

Simple approximate BNF, close, probs not accurate:

    <dotspec> := <singlestep>* <eof>
    <singlestep> := '.'? <angleStep> | <quoteStep> | <simpleStep>
    <simpleStep> := ("a-zA-Z" ("a-zA-Z0-9_)*)
    <quoteStep> := '"' <escapedChars> '"'
    <angleStep> := '[' ( ']' | ((<arrayStep> | <quoteStep> | ~[']']+ ) ']'))

    <arrayStep> := <arraySingle> (',' <arraySingle>>)*
    <arraySingle> := <integer> | <integer> ':' <integer>

    <integer> ('+'|'-')? ( ('0'-'9')+ )

So, a string like ".foo.bar.baz" walks three key→map transitions. ".foo.[0:3,7]" walks one map entry, "foo", then looks for it as an array, and matches 0,1,2,3,7 indexes if they exist.

You can skip the first period if you like, such as "foo.[0:3].[2]" is the same as ".foo.[0:3].[2]".

Dot notation here is implemented to walk the structural members of a map/list, not the primitive types. So you can’t match a literal string or integer, you can just traverse to a map or list.

You can use .[] to wildcard all array/map entries. See unit tests for more specifics.

Accessible via Son.dotParser() static method.

Missing?

  • No enums. Seems dumb, the cost of serializing the class name is prohibitive

  • SonMap is not a java.util.Map, SonList is not a java.util.List

    • seems like the right decision, but debatable I suppose

Upshot

We get a rich, mutable list/map structured format, with typing, that is also a compact and fast-to-serialize binary-at-rest format, which can then be accessed randomly without the need for deserialization.

Historical Dev Notes

These are just running notes of how it was developed, of course just in plaintext originally, but kept here for …​ reasons.

UID notes

The UUID string representation is as described by this BNF:
   UUID                   = <time_low> "-" <time_mid> "-"
                            <time_high_and_version> "-"
                            <variant_and_sequence> "-"
                            <node>
   time_low               = 4*<hexOctet>
   time_mid               = 2*<hexOctet>
   time_high_and_version  = 2*<hexOctet>
   variant_and_sequence   = 2*<hexOctet>
   node                   = 6*<hexOctet>
   hexOctet               = <hexDigit><hexDigit>
   hexDigit               =
         "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
         | "a" | "b" | "c" | "d" | "e" | "f"
         | "A" | "B" | "C" | "D" | "E" | "F"

First cut done

  • first, respin this as nested value pile, that makes more sense.

	PileWriter - streaming writer
	PileReader - random access reader
	PileUpdater - random access updater (for fixed size elements)


    So, tail stanza could be much more efficient

    needs to have:
			<varint:count>
			[
				<byte:type>{<varint:size>},
				...
			]
			<int:metadata size>

		* so, compression:
			** rll encode the type/size stanza?
				*** for fixed size only
				*** this would make the internal pile size quite big, unless we got very clever. NO	** better to allow for fixed size array **
			** encode count in the size?
			** can we assume metadata size is a byte?
				*** <size><<2|01 is a byte size 6 bits of size (64 bits max size)
				*** <size><<2|10 is a short size (2^^14) size == 16k size
				*** <size><<2|11 is an int size (2^30) bits == more than enough
			** version is an issue.
                *** ignore

	we have how many types:	boolean, char, int, long, double, string, barr,

	JSON is, trivially:

		<doc> := { <element pair> , ... }

		<element pair> :=	<element name> : <element>

		<element name> := {<type>} <dqstring>

		<array> := [ { <element>, ...]

		<element> := <map>
						| <list>
						| <char>
						| <boolean>
						| <int>
						| <long>
						| <double>
						| <string>
						| <barr>

		<type> := 'a' | 'c' | 'b' | 'i' | 'l' | 'd' | 's' | 'x' | 'o'

		except there is no <char>, only <long>/<double>, no <barr>

		so you need {type} to allow for keyhole

		Hmmm. But the writer should be very different, correct?

		new SonWriter(ByteBuffer)
			.append("key", "val")
			.list("key")
				.val(10)
				.val("str")
				.end()
			.map("name").
				.append("key", "val")
				.end()
			.

		reader=new SonReader(ByteBuffer);
		reader.keys();
		reader.get<type>("name");
	    reader.getInt8("name"); etc.

		scope of property names is ... problematic;	you want to resuse them across levels, but not within.