dotnet · justinormont · May 22, 2018 · May 16, 2018 · May 16, 2018 · May 17, 2018
diff --git a/docs/code/IDataViewDesignPrinciples.md b/docs/code/IDataViewDesignPrinciples.md
diff --git a/docs/code/IDataViewImplementation.md b/docs/code/IDataViewImplementation.md
diff --git a/docs/code/IDataViewTypeSystem.md b/docs/code/IDataViewTypeSystem.md
diff --git a/docs/code/IdvFileFormat.md b/docs/code/IdvFileFormat.md
@@ -0,0 +1,191 @@
+# IDV File Format
+
+This document describes ML.NET's Binary dataview file format, version 1.1.1.5
+written by the `BinarySaver` and `BinaryLoader` classes, commonly known as the
+`.idv` format.
+
+## Goal of the Format
+
+A dataview is a collection of columns, over some number of rows. (Do not
+confuse column with features. Columns can be and often are vector valued, and
+it is expected though not required that commonly all features will be together
+in one vector valued column.)
+
+The actual values are stored in blocks. A block holds values for a single
+column across multiple rows. Block format is dictated by a codec. There is a
+table-of-contents and lookup table to facilitate quasi-random access to
+particular blocks. (Quasi in the sense that you can only seek to a block, not
+to a particular within a block.)
+
+## General Data Format
+
+Before we discuss the format itself we will establish some conventions on how
+individual scalar values, strings, and other data is serialized. All basic
+pieces of data (e.g., a single number, or a single string) are encoded in ways
+reflecting the semantics of the .NET `BinaryWriter` class, those semantics
+being:
+
+* All numbers are stored as little-endian, using their natural fix-length
+  binary encoding.
+
+* Strings are stored using an unsigned
+  [LEB128](https://en.wikipedia.org/wiki/LEB128) number describing the number
+  of bytes, followed by that many bytes containing the UTF-8 encoded string.
+
+A note about this: LEB128 is a simple encoding to encode arbitrarily large
+integers. Each byte of 8-bits follows this convention. The most significant
+bit is 0 if and only if this is the end of the LEB128 encoding. The remaining
+7 bits are a part of the number being encoded. The bytes are stored
+little-endian, that is, the first byte holds the 7 least significant bits, the
+second byte (if applicable) holds the next 7 least significant bits, etc., and
+the last byte holds the 7 most significant bits. LEB128 is used one or two
+places in this format. (I might tend to prefer use of LEB128 in places where
+we are writing values that, on balance, we expect to be relatively small, and
+only in cases where there is no potential for benefit for random access to the
+associated stream, since LEB128 is incompatible with random access. However,
+this is not formulated into anything approaching a definite policy.)
+
+## Header
+
+Every binary instances stream has a header composed of 256 bytes, at the start
+of the stream. Not all bytes are used. Those bytes that are not explicitly
+used have undefined content, and can have anything in them. We strongly
+encourage writers of this format to insert obscene messages in this dead
+space. The content is defined as follows (the offsets being the start of that
+column).
+
+Offsets | Type  | Name and Description
+--------|-------|---------------------
+0       | ulong | **Signature**: The magic number of this file.
+8       | ulong | **Version**: Indicates the version of the data file.
+16      | ulong | **CompatibleVersion**: Indicates the minimum reader version that can interpret this file, possibly with some data loss.
+24      | long  | **TableOfContentsOffset**: The offset to the column table of contents structure.
+32      | long  | **TailOffset**: The eight-byte tail signature starts at this offset. So, the entire dataset stream should be considered to have byte length of eight plus this value.
+40      | long  | **RowCount**: The number of rows in this data file.
+48      | int   | **ColumnCount**: The number of columns in this data file.
+
+Notes on these:
+
+* The signature of this file is `0x00425644004C4D43`, which is, when written
+  little-endian to a file, `CML DVB ` with null characters in the place of
+  spaces. These letters are intended  to suggest "CloudML DataView Binary."
+
+* The tail signature is the byte-reversed version of this, that is,
+  `0x434D4C0044564200`.
+
+* Versions are encoded as four 16-bit unsigned numbers passed into a single
+  ulong, with higher order bits being a more major version. The first
+  supported version of the is 1.1.1.4, that is, `0x0001000100010004`.
+  (Versions prior to 1.1.1.4 did exist, but were not released, so we do not
+  support them, though we do describe them in this document for the sake of
+  completeness.)
+
+## Table of Contents Format
+
+The table of contents are packed entries, with there being as many entries as
+there are columns. The version field here indicates the versions where that
+entry is written. ≥ indicates the field occurred in versions after and
+including that version, = indicates the field occurs only in that version.
+
+Description | Entry Type | Version
+------------|------------|--------
+Column name | string     | ≥1.1.1.1
+Codec loadname | string  | ≥1.1.1.1
+Codec parameterization length | LEB128 integer | ≥1.1.1.1
+Codec parameterization, which must have precisely the length indicated above | arbitrary, but with specified length | ≥1.1.1.1
+Compression kind | CompressionKind (byte) | ≥1.1.1.1
+Rows per block in this column | LEB128 integer | ≥1.1.1.1
+Lookup table offset | long | ≥1.1.1.1
+Slot names offset, or 0 if this column has no slot names, if 1.1.1.2 behave as if there are no slot names, with this having value 0) | long | =1.1.1.3
+Slot names byte size (present only if slot names offset is greater than 0) | long | =1.1.1.3
+Slot names count (present only if slot names offset is greater than 0) | int | =1.1.1.3
+Metadata table of contents offset, or 0 if there is no metadata (1.1.1.4) | long | ≥1.1.1.4
+
+For those working in the ML.NET codebase: The three `Codec` fields are handled
+by the `CodecFactory.WriteCodec/TryReadCodec` methods, with the definition
+stream being at the start of the codec loadname, and being at the end of the
+codec parameterization, both in the case of success or failure.
+
+CompressionCodec enums are described below, and describe the compression
+algorithm used to compress blocks.
+
+### Compression Kind
+
+The enum for compression kind is one byte, and follows this scheme:
+
+Compression Kind                                               | Code
+---------------------------------------------------------------|-----
+None                                                           | 0
+DEFLATE (i.e., [RFC1951](http://www.ietf.org/rfc/rfc1951.txt)) | 1
+zlib (i.e., [RFC1950](http://www.ietf.org/rfc/rfc1950.txt))    | 2
+
+None means no compression. DEFLATE is the default scheme. There is a tendency
+to conflate zlib and DEFLATE, so to be clear: zlib can be (somewhat inexactly)
+considered a wrapped version of DEFLATE, but it is still a distinct (but
+closely related) format. However, both are implemented by the zlib library,
+which is probably the source of the confusion.
+
+## Metadata Table of Contents Format
+
+The metadata table of contents begins with a LEB128 integer describing the
+number of entries. (Should be a positive value, since if a column has no
+metadata the expectation is that the offset for the metadata TOC will be
+stored as 0.) What follows that are that many packed entries. Each entry is
+somewhat akin to the column table of contents entry, with some simplifications
+considering that there will be exactly one "block" with one item.
+
+Description                                            | Entry Type
+-------------------------------------------------------|------------
+Metadata kind                                          | string
+Codec loadname                                         | string
+Codec parameterization length                          | LEB128 integer
+Codec parameterization, which must have precisely the length indicated above | arbitrary, but with specified length
+Compression kind                                       | CompressionKind(byte)
+Offset of the block where the metadata item is written | long
+Byte length of the block                               | LEB128 integer
+
+The "block" written is written in exactly same format as the main content
+blocks. This will be very slightly inefficient as that scheme is sometimes
+written to accommodate many entries, but I don't expect that to be much of a
+burden.
+
+## Lookup Table Format
+
+Each table of contents entry is associated with a lookup table starting at the
+indicated lookup table offset. It is written as packed binary, with each
+lookup entry consisting of 16 bytes. So in all, the lookup table takes 16
+bytes, times the total number of blocks for this column.
+
+Description                                               | Entry Type
+----------------------------------------------------------|-----------
+Block offset, position in the file where the block starts | long
+Block length, its size in bytes in the file               | int
+Uncompressed block length, its size in bytes if the block bytes were decompressed according to the column's compression codec | int
+
+## Slot Names
+
+If slot names are stored, they are stored as pairs of integer index/string
+pairs. As many pairs are stored as count of slot names were present in the
+table of contents entry. Note that this only appeared in version 1.1.1.3. With
+1.1.1.4 and later, slot names were just considered yet another piece of
+metadata.
+
+Description       | Entry Type
+------------------|-----------
+Index of the slot | int
+The slot name     | string
+
+## Block Format
+
+Columns are ordered into blocks, with each block holding the binary encoded
+values for one particular columns across a range of rows. So for example, if
+the column's table of contents describes it as having 1000 rows per block, the
+first block will contain the values for the column for rows 0 through 999,
+second block 1000 through 1999, etc., with all blocks containing the same
+number of blocks, except the last block which will contain fewer items (unless
+the number of rows just so happens to be a multiple of the block size).
+
+Each column is a possibly compressed sequence of bytes, compressed according
+to the compression type field in the table of contents.  It begins and ends at
+the offsets indicated in the metadata entry stored in the directory. The
+uncompressed bytes will be stored in the format as described by the codec.
diff --git a/docs/code/KeyValues.md b/docs/code/KeyValues.md
@@ -0,0 +1,149 @@
+# Key Values
+
+Most commonly, key-values are used to encode items where it is convenient or
+efficient to represent values using numbers, but you want to maintain the
+logical "idea" that these numbers are keys indexing some underlying, implicit
+set of values, in a way more explicit than simply mapping to a number would
+allow you to do.
+
+A more formal description of key values and types is
+[here](IDataViewTypeSystem.md#key-types). *This* document's motivation is less
+to describe what key types and values are, and more to instead describe why
+key types are necessary and helpful things to have. Necessarily, this document,
+is more anecdotal in its descriptions to motivate its content.
+
+Let's take a few examples of transforms that produce keys:
+
+* The `TermTransform` forms a dictionary of unique observed values to a key.
+  The key type's count indicates the number of items in the set, and through
+  the `KeyValue` metadata "remembers" what each key is representing.
+
+* The `HashTransform` performs a hash of input values, and produces a key
+  value with count equal to the range of the hash function, which, if a b bit
+  hash was used, will produce a 2ᵇ hash.
+
+* The `CharTokenizeTransform` will take input strings and produce key values
+  representing the characters observed in the string.
+
+## Keys as Intermediate Values
+
+Explicitly invoking transforms that produce key values, and using those key
+values, is sometimes helpful. However, given that most trainers expect the
+feature vector to be a vector of floating point values and *not* keys, in
+typical usage the majority of usages of keys is as some sort of intermediate
+value on the way to that final feature vector. (Unless, say, doing something
+like preparing labels for a multiclass learner.)
+
+So why not go directly to the feature vector, and forget this key stuff?
+Actually, to take text as the canonical example, we used to. However, by
+structuring the transforms from, say, text to key to vector, rather than text
+to vector *directly*, we are able to simplify a lot of code on the
+implementation side, which is both less for us to maintain, and also for users
+gives consistency in behavior.
+
+So for example, the `CharTokenize` above might appear to be a strange choice:
+*why* represent characters as keys? The reason is that the ngram transform is
+written to ingest keys, not text, and so we can use the same transform for
+both the n-gram featurization of words, as well as n-char grams.
+
+Now, much of this complexity is hidden from the user: most users will just use
+the `text` transform, select some options for n-grams, and chargrams, and not
+be aware of these internal invisible keys. Similarly, use the categorical or
+categorical hash transforms, without knowing that internally it is just the
+term or hash transform followed by a `KeyToVector` transform. But, keys are
+still there, and it would be impossible to really understand ML.NET's
+featurization pipeline without understanding keys. Any user that wants to
+understand how, say, the text transform resulted in a particular featurization
+will have to inspect the key values to get that understanding.
+
+## Keys are not Numbers
+
+As an actual CLR data type, key values are stored as some form of unsigned
+integer (most commonly `uint`). The most common confusion that arises from
+this is to ascribe too much importance to the fact that it is a `uint`, and
+think these are somehow just numbers. This is incorrect.
+
+For keys, the concept of order and difference has no inherent, real meaning as
+it does for numbers, or at least, the meaning is different and highly domain
+dependent. Consider a numeric `U4` type, with values `0`, `1`, and `2`. The
+difference between `0` and `1` is `1`, and the difference between `1` and `2`
+is `1`, because they're numbers. Very well: now consider that you train a term
+transform over the input tokens `apple`, `pear`, and `orange`: this will also
+map to the keys logically represented as the numbers `0`, `1`, and `2`
+respectively. Yet for a key, is the difference between keys `0` and `1`, `1`?
+No, the difference is `0` maps to `apple` and `1` to `pear`. Also order
+doesn't mean one key is somehow "larger," it just means we saw one before
+another -- or something else, if sorting by value happened to be selected.
+
+Also: ML.NET's vectors can be sparse. Implicit entries in a sparse vector are
+assumed to have the `default` value for that type -- that is, implicit values
+for numeric types will be zero. But what would be the implicit default value
+for a key value be? Take the `apple`, `pear`, and `orange` example above -- it
+would inappropriate for the default value to be `0`, because that means the
+result is `apple`, would be appropriate. The only really appropriate "default"
+choice is that the value is unknown, that is, missing.
+
+An implication of this is that there is a distinction between the logical
+value of a key-value, and the actual physical value of the value in the
+underlying type. This will be covered more later.
+
+## As an Enumeration of a Set: `KeyValues` Metadata
+
+While keys can be used for many purposes, they are often used to enumerate
+items from some underlying set. In order to map keys back to this original
+set, many transform producing key values will also produce `KeyValues`
+metadata associated with that output column.
+
+Valid `KeyValues` metadata is a vector of length equal to the count of the
+type of the column. This can be of varying types: it is often text, but does
+not need to be. For example, a `term` applied to a column would have
+`KeyValue` metadata of item type equal to the item type of the input data.
+
+How this metadata is used downstream depends on the purposes of who is
+consuming it, but common uses are: in multiclass classification, for
+determining the human readable class names, or if used in featurization,
+determining the names of the features.
+
+Note that `KeyValues` data is optional, and sometimes is not even sensible.
+For example, if we consider a clustering algorithm, the prediction of the
+cluster of an example would. So for example, if there were five clusters, then
+the prediction would indicate the cluster by `U4<0-4>`. Yet, these clusters
+were found by the algorithm itself, and they have no natural descriptions.
+
+## Actual Implementation
+
+This may be of use only to writers or extenders of ML.NET, or users of our
+API. How key values are presented *logically* to users of ML.NET, is distinct
+from how they are actually stored *physically* in actual memory, both in
+ML.NET source and through the API. For key values:
+
+* All key values are stored in unsigned integers.
+* The missing key values is always stored as `0`. See the note above about the
+  default value, to see why this must be so.
+* Valid non-missing key values are stored from `1`, onwards, irrespective of
+whatever we claim in the key type that minimum value is.
+
+So when, in the prior example, the term transform would map `apple`, `pear`,
+and `orange` seemingly to `0`, `1`, and `2`, values of `U4<0-2>`, in reality,
+if you were to fire up the debugger you would see that they were stored with
+`1`, `2`, and `3`, with unrecognized values being mapped to the "default"
+missing value of `0`.
+
+Nevertheless, we almost never talk about this, no more than we would talk
+about our "strings" really being implemented as string slices: this is purely
+an implementation detail, relevant only to people working with key values at
+the source level. To a regular non-API user of ML.NET, key values appear
+*externally* to be simply values, just as strings appear to be simply strings,
+and so forth.
+
+There is another implication: a hypothetical type `U1<4000-4002>` is actually
+a sensible type in this scheme. The `U1` indicates that is stored in one byte,
+which would on first glance seem to conflict with values like `4000`, but
+remember that the first valid key-value is stored as `1`, and we've identified
+the valid range as spanning the three values 4000 through 4002. That is,
+`4000` would be represented physically as `1`.
+
+The reality cannot be seen by any conventional means I am aware of, save for
+viewing ML.NET's workings in the debugger or using the API and inspecting
+these raw values yourself: that `4000` you would see is really stored as the
+`byte` `1`, `4001` as `2`, `4002` as `3`, and a missing value stored as `0`.