Skip to content

Commit

Permalink
Justin semantic comments
Browse files Browse the repository at this point in the history
  • Loading branch information
TomFinley committed May 17, 2018
1 parent 9ace94b commit 8f1da5c
Show file tree
Hide file tree
Showing 3 changed files with 73 additions and 56 deletions.
42 changes: 17 additions & 25 deletions docs/code/IDataViewDesignPrinciples.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,10 +64,9 @@ The IDataView design fulfills the following design requirements:
kinds, and supports composing multiple primitive components to achieve
higher-level semantics. See [here](#components).

* **Open component system**: While the AzureML Algorithms team has developed,
and continues to develop, a large library of IDataView components,
additional components that interoperate with these may be implemented in
other code bases. See [here](#components).
* **Open component system**: While the ML.NET code has a growing large library
of IDataView components, additional components that interoperate with these
may be implemented in other code bases. See [here](#components).

* **Cursoring**: The rows of a view are accessed sequentially via a row
cursor. Multiple cursors can be active on the same view, both sequentially
Expand Down Expand Up @@ -136,11 +135,8 @@ The IDataView system design does *not* include the following:

* **Data file formats**: The IDataView system does not dictate storage or
transport formats. It *does* include interfaces for loader and saver
components. The AzureML Algorithms team has implemented loaders and savers
for some binary and text file formats, but additional loaders and savers can
(and will) be implemented. In particular, implementing a loader from XDF
will be straightforward. Implementing a saver to XDF will likely require the
XDF format to be extended to support vector-valued columns.
components. The ML.NET code has implementations of loaders and savers for
some binary and text file formats.

* **Multi-node computation over multiple data partitions**: The IDataView
design is focused on single node computation. We expect that in multi-node
Expand Down Expand Up @@ -197,16 +193,16 @@ experience and performance.

Machine learning and advanced analytics applications often involve high-
dimensional data. For example, a common technique for learning from text,
known as bag-of-words, represents each word in the text as a numeric feature
containing the number of occurrences of that word. Another technique is
indicator or one-hot encoding of categorical values, where, for example, a
text-valued column containing a person's last name is expanded to a set of
features, one for each possible name (Tesla, Lincoln, Gandhi, Zhang, etc.),
with a value of one for the feature corresponding to the name, and the
remaining features having value zero. Variations of these techniques use
hashing in place of dictionary lookup. With hashing, it is common to use 20
bits or more for the hash value, producing $2^20$ (about a million) features
or more.
known as [bag-of-words](https://en.wikipedia.org/wiki/Bag-of-words_model),
represents each word in the text as a numeric feature containing the number of
occurrences of that word. Another technique is indicator or one-hot encoding
of categorical values, where, for example, a text-valued column containing a
person's last name is expanded to a set of features, one for each possible
name (Tesla, Lincoln, Gandhi, Zhang, etc.), with a value of one for the
feature corresponding to the name, and the remaining features having value
zero. Variations of these techniques use hashing in place of dictionary
lookup. With hashing, it is common to use 20 bits or more for the hash value,
producing `2^^20` (about a million) features or more.

These techniques typically generate an enormous number of features.
Representing each feature as an individual column is far from ideal, both from
Expand All @@ -225,8 +221,8 @@ corresponding vector values may have any length. A tokenization transform,
that maps a text value to the sequence of individual terms in that text,
naturally produces variable-length vectors of text. Then, a hashing ngram
transform may map the variable-length vectors of text to a bag-of-ngrams
representation, which naturally produces numeric vectors of length $2^k$, where
$k$ is the number of bits used in the hash function.
representation, which naturally produces numeric vectors of length `2^^k`,
where `k` is the number of bits used in the hash function.

### Key Types

Expand Down Expand Up @@ -409,10 +405,6 @@ needed, the operating system disk cache transparently enhances performance.
Further, when the data is known to fit in memory, caching, as described above,
provides even better performance.

Note: Implementing a loader for XDF files should be straightforward. To
implement a saver, the XDF format will likely need to be extended to support
vector-valued columns, and perhaps metadata encoding.

### Randomization

Some training algorithms benefit from randomizing the order of rows produced
Expand Down
84 changes: 55 additions & 29 deletions docs/code/IDataViewImplementation.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ result that if a pipeline was composed in some other fashion, there would be
some error.

The only thing you can really assume is that an `IDataView` behaves "sanely"
according to the contracts of the `IDataView` interface, so that future TLC
according to the contracts of the `IDataView` interface, so that future ML.NET
developers can form some reasonable expectations of how your code behaves, and
also have a prayer of knowing how to maintain the code. It is hard enough to
write software correctly even when the code you're working with actually does
Expand Down Expand Up @@ -166,8 +166,8 @@ has the following problems:
* **Every** call had to verify that the column was active,
* **Every** call had to verify that `TValue` was of the right type,
* When these were part of, say, a transform in a chain (as they often are,
considering how common transforms are used by TLC's users) each access would
be accompanied by a virtual method call to the upstream cursor's
considering how common transforms are used by ML.NET's users) each access
would be accompanied by a virtual method call to the upstream cursor's
`GetColumnValue`.

In contrast, consider the situation with these getter delegates. The
Expand Down Expand Up @@ -211,14 +211,14 @@ consuming different data from the contemporaneous cursor? There are many
examples of this throughout the codebase.

Nevertheless: in very specific circumstances we have relaxed this. For
example, the TLC API serves up corrupt `IDataView` implementations that have
their underlying data change, since reconstituting a data pipeline on fresh
data is at the present moment too resource intensive. Nonetheless, this is
wrong: for example, the `TrainingCursorBase` and related subclasses rely upon
the data not changing. Since, however, that is used for *training* and the
prediction engines of the API as used for *scoring*, we accept these. However
this is not, strictly speaking, correct, and this sort of corruption of
`IDataView` should only be considered as a last resort, and only when some
example, some ML.NET API code serves up corrupt `IDataView` implementations
that have their underlying data change, since reconstituting a data pipeline
on fresh data is at the present moment too resource intensive. Nonetheless,
this is wrong: for example, the `TrainingCursorBase` and related subclasses
rely upon the data not changing. Since, however, that is used for *training*
and the prediction engines of the API as used for *scoring*, we accept these.
However this is not, strictly speaking, correct, and this sort of corruption
of `IDataView` should only be considered as a last resort, and only when some
great good can be accomplished through this. We certainly did not accept this
corruption lightly!

Expand Down Expand Up @@ -265,19 +265,19 @@ same data view.) So some rules:
## Versioning

This requirement for consistency of a data model often has implications across
versions of TLC, and our requirements for data model backwards compatibility.
As time has passed, we often feel like it would make sense if a transform
behaved *differently*, that is, if it organized or calculated its output in a
different way than it currently does. For example, suppose we wanted to switch
the hash transform to something a bit more efficient than murmur hashes, for
example. If we did so, presumably the same input values would map to different
outputs. We are free to do so, of course, yet: when we deserialize a hash
transform from before we made this change, that hash transform should continue
to output values as it did, before we made that change. (This, of course,
assuming that the transform was released as part of a "blessed" non-preview
point release of TLC. We can, and have, broken backwards compatibility for
something that has not yet been incorporated in any sort of blessed release,
though we prefer to not.)
versions of ML.NET, and our requirements for data model backwards
compatibility. As time has passed, we often feel like it would make sense if a
transform behaved *differently*, that is, if it organized or calculated its
output in a different way than it currently does. For example, suppose we
wanted to switch the hash transform to something a bit more efficient than
murmur hashes, for example. If we did so, presumably the same input values
would map to different outputs. We are free to do so, of course, yet: when we
deserialize a hash transform from before we made this change, that hash
transform should continue to output values as it did, before we made that
change. (This, of course, assuming that the transform was released as part of
a "blessed" non-preview point release of ML.NET. We can, and have, broken
backwards compatibility for something that has not yet been incorporated in
any sort of blessed release, though we prefer to not.)

## What is Not Functionally Identical

Expand Down Expand Up @@ -334,10 +334,9 @@ aside (which we can hardly help), we expect the models to be the same.

# On Loaders, Data Models, and Empty `IMultiStreamSource`s

When you run TLC you have the option of specifying not only *one* data input,
but any number of data input files, including zero. :) This is how [the
examples here](../public/command/DataCommands.md#look-ma-no-files) work. But
there's also a more general principle at work here: when deserializing a data
When you create a loader you have the option of specifying not only *one* data
input, but any number of data input files, including zero. But there's also a
more general principle at work here with zero files: when deserializing a data
loader from a data model with an `IMultiStreamSource` with `Count == 0` (e.g.,
as would be constructed with `new MultiFileSource(null)`), we have a protocol
that *every* `IDataLoader` should work in that circumstance, and merely be a
Expand Down Expand Up @@ -472,7 +471,34 @@ indication that this function will not move the cursor (in which case `IRow`
is helpful), or that will not access any values (in which case `ICursor` is
helpful).

# Metadata
# Schema

The schema contains information about the columns. As we see in [the design
principles](IDataViewDesignPrinciples.md), it has index, data type, and
optional metadata.

While *programmatically* accesses to an `IDataView` are by index, from a
user's perspective the indices are by name; most training algorithms
conceptually train on the `Features` column (under default settings). For this
reason nearly all usages of an `IDataView` will be prefixed with a call to the
schema's `TryGetColumnIndex`.

Regarding name hiding, the principles mention that when multiple columns have
the same name, other columns are "hidden." The convention all implementations
of `ISchema` obey is that the column with the *largest* index. Note however
that this is merely convention, not part of the definition of `ISchema`.

Implementations of `TryGetColumnIndex` should be O(1), that is, practically,
this mapping ought to be backed with a dictionary in most cases. (There are
obvious exceptions like, say, things like `LineLoader` which produce exactly
one column. There, a simple equality test suffices.)

It is best if `GetColumnType` returns the *same* object every time. That is,
things like key-types and vector-types, when returned, should not be created
in the function itself (thereby creating a new object every time), but rather
stored somewhere and returned.

## Metadata

Since metadata is *optional*, one is not obligated to necessarily produce it,
or conform to any particular schemas for any particular kinds (beyond, say,
Expand Down
3 changes: 1 addition & 2 deletions docs/code/IDataViewTypeSystem.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,7 @@ the specific interface is written using fixed pitch font as `IDataView`.

IDataView is the data pipeline machinery for ML.NET. The ML.NET codebase has
an extensive library of IDataView related components (loaders, transforms,
savers, trainers, predictors, etc.). The team is actively working on many
more.
savers, trainers, predictors, etc.). More are being worked on.

The name IDataView was inspired from the database world, where the term table
typically indicates a mutable body of data, while a view is the result of a
Expand Down

0 comments on commit 8f1da5c

Please sign in to comment.