Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migration of first IDataView docs #173

Merged
merged 5 commits into from
May 22, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
471 changes: 471 additions & 0 deletions docs/code/IDataViewDesignPrinciples.md

Large diffs are not rendered by default.

518 changes: 518 additions & 0 deletions docs/code/IDataViewImplementation.md

Large diffs are not rendered by default.

843 changes: 843 additions & 0 deletions docs/code/IDataViewTypeSystem.md

Large diffs are not rendered by default.

191 changes: 191 additions & 0 deletions docs/code/IdvFileFormat.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,191 @@
# IDV File Format

This document describes ML.NET's Binary dataview file format, version 1.1.1.5
written by the `BinarySaver` and `BinaryLoader` classes, commonly known as the
`.idv` format.

## Goal of the Format

A dataview is a collection of columns, over some number of rows. (Do not
confuse column with features. Columns can be and often are vector valued, and
it is expected though not required that commonly all features will be together
in one vector valued column.)

The actual values are stored in blocks. A block holds values for a single
column across multiple rows. Block format is dictated by a codec. There is a
table-of-contents and lookup table to facilitate quasi-random access to
particular blocks. (Quasi in the sense that you can only seek to a block, not
to a particular within a block.)

## General Data Format

Before we discuss the format itself we will establish some conventions on how
individual scalar values, strings, and other data is serialized. All basic
pieces of data (e.g., a single number, or a single string) are encoded in ways
reflecting the semantics of the .NET `BinaryWriter` class, those semantics
being:

* All numbers are stored as little-endian, using their natural fix-length
binary encoding.

* Strings are stored using an unsigned
[LEB128](https://en.wikipedia.org/wiki/LEB128) number describing the number
of bytes, followed by that many bytes containing the UTF-8 encoded string.

A note about this: LEB128 is a simple encoding to encode arbitrarily large
integers. Each byte of 8-bits follows this convention. The most significant
bit is 0 if and only if this is the end of the LEB128 encoding. The remaining
7 bits are a part of the number being encoded. The bytes are stored
little-endian, that is, the first byte holds the 7 least significant bits, the
second byte (if applicable) holds the next 7 least significant bits, etc., and
the last byte holds the 7 most significant bits. LEB128 is used one or two
places in this format. (I might tend to prefer use of LEB128 in places where
we are writing values that, on balance, we expect to be relatively small, and
only in cases where there is no potential for benefit for random access to the
associated stream, since LEB128 is incompatible with random access. However,
this is not formulated into anything approaching a definite policy.)

## Header

Every binary instances stream has a header composed of 256 bytes, at the start
of the stream. Not all bytes are used. Those bytes that are not explicitly
used have undefined content, and can have anything in them. We strongly
encourage writers of this format to insert obscene messages in this dead
space. The content is defined as follows (the offsets being the start of that
column).

Offsets | Type | Name and Description
--------|-------|---------------------
0 | ulong | **Signature**: The magic number of this file.
8 | ulong | **Version**: Indicates the version of the data file.
16 | ulong | **CompatibleVersion**: Indicates the minimum reader version that can interpret this file, possibly with some data loss.
24 | long | **TableOfContentsOffset**: The offset to the column table of contents structure.
32 | long | **TailOffset**: The eight-byte tail signature starts at this offset. So, the entire dataset stream should be considered to have byte length of eight plus this value.
40 | long | **RowCount**: The number of rows in this data file.
48 | int | **ColumnCount**: The number of columns in this data file.

Notes on these:

* The signature of this file is `0x00425644004C4D43`, which is, when written
little-endian to a file, `CML DVB ` with null characters in the place of
spaces. These letters are intended to suggest "CloudML DataView Binary."

* The tail signature is the byte-reversed version of this, that is,
`0x434D4C0044564200`.

* Versions are encoded as four 16-bit unsigned numbers passed into a single
ulong, with higher order bits being a more major version. The first
supported version of the is 1.1.1.4, that is, `0x0001000100010004`.
(Versions prior to 1.1.1.4 did exist, but were not released, so we do not
support them, though we do describe them in this document for the sake of
completeness.)

## Table of Contents Format

The table of contents are packed entries, with there being as many entries as
there are columns. The version field here indicates the versions where that
entry is written. ≥ indicates the field occurred in versions after and
including that version, = indicates the field occurs only in that version.

Description | Entry Type | Version
------------|------------|--------
Column name | string | ≥1.1.1.1
Codec loadname | string | ≥1.1.1.1
Codec parameterization length | LEB128 integer | ≥1.1.1.1
Codec parameterization, which must have precisely the length indicated above | arbitrary, but with specified length | ≥1.1.1.1
Compression kind | CompressionKind (byte) | ≥1.1.1.1
Rows per block in this column | LEB128 integer | ≥1.1.1.1
Lookup table offset | long | ≥1.1.1.1
Slot names offset, or 0 if this column has no slot names, if 1.1.1.2 behave as if there are no slot names, with this having value 0) | long | =1.1.1.3
Slot names byte size (present only if slot names offset is greater than 0) | long | =1.1.1.3
Slot names count (present only if slot names offset is greater than 0) | int | =1.1.1.3
Metadata table of contents offset, or 0 if there is no metadata (1.1.1.4) | long | ≥1.1.1.4

For those working in the ML.NET codebase: The three `Codec` fields are handled
by the `CodecFactory.WriteCodec/TryReadCodec` methods, with the definition
stream being at the start of the codec loadname, and being at the end of the
codec parameterization, both in the case of success or failure.

CompressionCodec enums are described below, and describe the compression
algorithm used to compress blocks.

### Compression Kind

The enum for compression kind is one byte, and follows this scheme:

Compression Kind | Code
---------------------------------------------------------------|-----
None | 0
DEFLATE (i.e., [RFC1951](http://www.ietf.org/rfc/rfc1951.txt)) | 1
zlib (i.e., [RFC1950](http://www.ietf.org/rfc/rfc1950.txt)) | 2

None means no compression. DEFLATE is the default scheme. There is a tendency
to conflate zlib and DEFLATE, so to be clear: zlib can be (somewhat inexactly)
considered a wrapped version of DEFLATE, but it is still a distinct (but
closely related) format. However, both are implemented by the zlib library,
which is probably the source of the confusion.

## Metadata Table of Contents Format

The metadata table of contents begins with a LEB128 integer describing the
number of entries. (Should be a positive value, since if a column has no
metadata the expectation is that the offset for the metadata TOC will be
stored as 0.) What follows that are that many packed entries. Each entry is
somewhat akin to the column table of contents entry, with some simplifications
considering that there will be exactly one "block" with one item.

Description | Entry Type
-------------------------------------------------------|------------
Metadata kind | string
Codec loadname | string
Codec parameterization length | LEB128 integer
Codec parameterization, which must have precisely the length indicated above | arbitrary, but with specified length
Compression kind | CompressionKind(byte)
Offset of the block where the metadata item is written | long
Byte length of the block | LEB128 integer

The "block" written is written in exactly same format as the main content
blocks. This will be very slightly inefficient as that scheme is sometimes
written to accommodate many entries, but I don't expect that to be much of a
burden.

## Lookup Table Format

Each table of contents entry is associated with a lookup table starting at the
indicated lookup table offset. It is written as packed binary, with each
lookup entry consisting of 16 bytes. So in all, the lookup table takes 16
bytes, times the total number of blocks for this column.

Description | Entry Type
----------------------------------------------------------|-----------
Block offset, position in the file where the block starts | long
Block length, its size in bytes in the file | int
Uncompressed block length, its size in bytes if the block bytes were decompressed according to the column's compression codec | int

## Slot Names

If slot names are stored, they are stored as pairs of integer index/string
pairs. As many pairs are stored as count of slot names were present in the
table of contents entry. Note that this only appeared in version 1.1.1.3. With
1.1.1.4 and later, slot names were just considered yet another piece of
metadata.

Description | Entry Type
------------------|-----------
Index of the slot | int
The slot name | string

## Block Format

Columns are ordered into blocks, with each block holding the binary encoded
values for one particular columns across a range of rows. So for example, if
the column's table of contents describes it as having 1000 rows per block, the
first block will contain the values for the column for rows 0 through 999,
second block 1000 through 1999, etc., with all blocks containing the same
number of blocks, except the last block which will contain fewer items (unless
the number of rows just so happens to be a multiple of the block size).

Each column is a possibly compressed sequence of bytes, compressed according
to the compression type field in the table of contents. It begins and ends at
the offsets indicated in the metadata entry stored in the directory. The
uncompressed bytes will be stored in the format as described by the codec.
149 changes: 149 additions & 0 deletions docs/code/KeyValues.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
# Key Values

Most commonly, key-values are used to encode items where it is convenient or
efficient to represent values using numbers, but you want to maintain the
logical "idea" that these numbers are keys indexing some underlying, implicit
set of values, in a way more explicit than simply mapping to a number would
allow you to do.

A more formal description of key values and types is
[here](IDataViewTypeSystem.md#key-types). *This* document's motivation is less
to describe what key types and values are, and more to instead describe why
key types are necessary and helpful things to have. Necessarily, this document,
is more anecdotal in its descriptions to motivate its content.

Let's take a few examples of transforms that produce keys:

* The `TermTransform` forms a dictionary of unique observed values to a key.
The key type's count indicates the number of items in the set, and through
the `KeyValue` metadata "remembers" what each key is representing.

* The `HashTransform` performs a hash of input values, and produces a key
value with count equal to the range of the hash function, which, if a b bit
hash was used, will produce a 2ᵇ hash.

* The `CharTokenizeTransform` will take input strings and produce key values
representing the characters observed in the string.

## Keys as Intermediate Values

Explicitly invoking transforms that produce key values, and using those key
values, is sometimes helpful. However, given that most trainers expect the
feature vector to be a vector of floating point values and *not* keys, in
typical usage the majority of usages of keys is as some sort of intermediate
value on the way to that final feature vector. (Unless, say, doing something
like preparing labels for a multiclass learner.)

So why not go directly to the feature vector, and forget this key stuff?
Actually, to take text as the canonical example, we used to. However, by
structuring the transforms from, say, text to key to vector, rather than text
to vector *directly*, we are able to simplify a lot of code on the
implementation side, which is both less for us to maintain, and also for users
gives consistency in behavior.

So for example, the `CharTokenize` above might appear to be a strange choice:
*why* represent characters as keys? The reason is that the ngram transform is
written to ingest keys, not text, and so we can use the same transform for
both the n-gram featurization of words, as well as n-char grams.
Copy link
Contributor

@justinormont justinormont May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have hyphenated n-grams here (3x w/ n-char). My liking is "ngram" and "chargram". #Pending

Copy link
Contributor Author

@TomFinley TomFinley May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, could you explain this? Because this library in C# its identifier can't be n-gram so we call it NGram, but everywhere I look actual prose usage of the term is -n-gram, including back when I was a wee little grad student. I see paper titles form ICML as "N-gram" or "n-gram", I don't see an "ngram." Unless the "cool kids" are doing something different nowadays, with their long hair and rock music?


In reply to: 188887783 [](ancestors = 188887783)

Copy link
Contributor

@justinormont justinormont May 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started an email poll for terminology. No conclusion currently, but there was feedback that we'll need to also (besides for documentation) define the terms when used in code:

  • "Within source code, we should probably use nGram as a variable name, as it is technically two words, so we should use camel case."
  • "I won’t express an opinion but can you please also determine how it should be cased in PascalCasing (ie., in types and methods). Specifically should the G be capitalized – Ngram or NGram" #Resolved

Copy link
Contributor Author

@TomFinley TomFinley May 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Justin, leaving as n-gram. As near as I see, whenw written in prose, it's nearly universally referred to this way. Maybe in less formal writing someone might omit the hyphen, I see that Google actually has a piece of software branded "NGram", but otherwise, I do not see that your preferred usage is used at all. Thanks though!


In reply to: 189188391 [](ancestors = 189188391)


Now, much of this complexity is hidden from the user: most users will just use
the `text` transform, select some options for n-grams, and chargrams, and not
be aware of these internal invisible keys. Similarly, use the categorical or
categorical hash transforms, without knowing that internally it is just the
term or hash transform followed by a `KeyToVector` transform. But, keys are
still there, and it would be impossible to really understand ML.NET's
featurization pipeline without understanding keys. Any user that wants to
understand how, say, the text transform resulted in a particular featurization
will have to inspect the key values to get that understanding.

## Keys are not Numbers

As an actual CLR data type, key values are stored as some form of unsigned
integer (most commonly `uint`). The most common confusion that arises from
this is to ascribe too much importance to the fact that it is a `uint`, and
think these are somehow just numbers. This is incorrect.

For keys, the concept of order and difference has no inherent, real meaning as
it does for numbers, or at least, the meaning is different and highly domain
dependent. Consider a numeric `U4` type, with values `0`, `1`, and `2`. The
difference between `0` and `1` is `1`, and the difference between `1` and `2`
is `1`, because they're numbers. Very well: now consider that you train a term
transform over the input tokens `apple`, `pear`, and `orange`: this will also
map to the keys logically represented as the numbers `0`, `1`, and `2`
respectively. Yet for a key, is the difference between keys `0` and `1`, `1`?
No, the difference is `0` maps to `apple` and `1` to `pear`. Also order
doesn't mean one key is somehow "larger," it just means we saw one before
another -- or something else, if sorting by value happened to be selected.

Also: ML.NET's vectors can be sparse. Implicit entries in a sparse vector are
assumed to have the `default` value for that type -- that is, implicit values
for numeric types will be zero. But what would be the implicit default value
for a key value be? Take the `apple`, `pear`, and `orange` example above -- it
would inappropriate for the default value to be `0`, because that means the
result is `apple`, would be appropriate. The only really appropriate "default"
choice is that the value is unknown, that is, missing.

An implication of this is that there is a distinction between the logical
value of a key-value, and the actual physical value of the value in the
underlying type. This will be covered more later.

## As an Enumeration of a Set: `KeyValues` Metadata

While keys can be used for many purposes, they are often used to enumerate
items from some underlying set. In order to map keys back to this original
set, many transform producing key values will also produce `KeyValues`
metadata associated with that output column.

Valid `KeyValues` metadata is a vector of length equal to the count of the
type of the column. This can be of varying types: it is often text, but does
not need to be. For example, a `term` applied to a column would have
`KeyValue` metadata of item type equal to the item type of the input data.

How this metadata is used downstream depends on the purposes of who is
consuming it, but common uses are: in multiclass classification, for
determining the human readable class names, or if used in featurization,
determining the names of the features.

Note that `KeyValues` data is optional, and sometimes is not even sensible.
For example, if we consider a clustering algorithm, the prediction of the
cluster of an example would. So for example, if there were five clusters, then
the prediction would indicate the cluster by `U4<0-4>`. Yet, these clusters
were found by the algorithm itself, and they have no natural descriptions.

## Actual Implementation

This may be of use only to writers or extenders of ML.NET, or users of our
API. How key values are presented *logically* to users of ML.NET, is distinct
from how they are actually stored *physically* in actual memory, both in
ML.NET source and through the API. For key values:

* All key values are stored in unsigned integers.
* The missing key values is always stored as `0`. See the note above about the
default value, to see why this must be so.
* Valid non-missing key values are stored from `1`, onwards, irrespective of
whatever we claim in the key type that minimum value is.

So when, in the prior example, the term transform would map `apple`, `pear`,
and `orange` seemingly to `0`, `1`, and `2`, values of `U4<0-2>`, in reality,
if you were to fire up the debugger you would see that they were stored with
`1`, `2`, and `3`, with unrecognized values being mapped to the "default"
missing value of `0`.

Nevertheless, we almost never talk about this, no more than we would talk
about our "strings" really being implemented as string slices: this is purely
an implementation detail, relevant only to people working with key values at
the source level. To a regular non-API user of ML.NET, key values appear
*externally* to be simply values, just as strings appear to be simply strings,
and so forth.

There is another implication: a hypothetical type `U1<4000-4002>` is actually
a sensible type in this scheme. The `U1` indicates that is stored in one byte,
which would on first glance seem to conflict with values like `4000`, but
remember that the first valid key-value is stored as `1`, and we've identified
the valid range as spanning the three values 4000 through 4002. That is,
`4000` would be represented physically as `1`.

The reality cannot be seen by any conventional means I am aware of, save for
viewing ML.NET's workings in the debugger or using the API and inspecting
these raw values yourself: that `4000` you would see is really stored as the
`byte` `1`, `4001` as `2`, `4002` as `3`, and a missing value stored as `0`.
Loading