Justin semantic comments

dotnet · May 17, 2018 · 8f1da5c · 8f1da5c
1 parent 9ace94b
commit 8f1da5c
Show file tree

Hide file tree

Showing 3 changed files with 73 additions and 56 deletions.
diff --git a/docs/code/IDataViewDesignPrinciples.md b/docs/code/IDataViewDesignPrinciples.md
@@ -64,10 +64,9 @@ The IDataView design fulfills the following design requirements:
   kinds, and supports composing multiple primitive components to achieve
   higher-level semantics. See [here](#components).
 
-* **Open component system**: While the AzureML Algorithms team has developed,
-  and continues to develop, a large library of IDataView components,
-  additional components that interoperate with these may be implemented in
-  other code bases. See [here](#components).
+* **Open component system**: While the ML.NET code has a growing large library
+  of IDataView components, additional components that interoperate with these
+  may be implemented in other code bases. See [here](#components).
 
 * **Cursoring**: The rows of a view are accessed sequentially via a row
   cursor. Multiple cursors can be active on the same view, both sequentially
@@ -136,11 +135,8 @@ The IDataView system design does *not* include the following:
 
 * **Data file formats**: The IDataView system does not dictate storage or
   transport formats. It *does* include interfaces for loader and saver
-  components. The AzureML Algorithms team has implemented loaders and savers
-  for some binary and text file formats, but additional loaders and savers can
-  (and will) be implemented. In particular, implementing a loader from XDF
-  will be straightforward. Implementing a saver to XDF will likely require the
-  XDF format to be extended to support vector-valued columns.
+  components. The ML.NET code has implementations of loaders and savers for
+  some binary and text file formats.
 
 * **Multi-node computation over multiple data partitions**: The IDataView
   design is focused on single node computation. We expect that in multi-node
@@ -197,16 +193,16 @@ experience and performance.
 
 Machine learning and advanced analytics applications often involve high-
 dimensional data. For example, a common technique for learning from text,
-known as bag-of-words, represents each word in the text as a numeric feature
-containing the number of occurrences of that word. Another technique is
-indicator or one-hot encoding of categorical values, where, for example, a
-text-valued column containing a person's last name is expanded to a set of
-features, one for each possible name (Tesla, Lincoln, Gandhi, Zhang, etc.),
-with a value of one for the feature corresponding to the name, and the
-remaining features having value zero. Variations of these techniques use
-hashing in place of dictionary lookup. With hashing, it is common to use 20
-bits or more for the hash value, producing $2^20$ (about a million) features
-or more.
+known as [bag-of-words](https://en.wikipedia.org/wiki/Bag-of-words_model),
+represents each word in the text as a numeric feature containing the number of
+occurrences of that word. Another technique is indicator or one-hot encoding
+of categorical values, where, for example, a text-valued column containing a
+person's last name is expanded to a set of features, one for each possible
+name (Tesla, Lincoln, Gandhi, Zhang, etc.), with a value of one for the
+feature corresponding to the name, and the remaining features having value
+zero. Variations of these techniques use hashing in place of dictionary
+lookup. With hashing, it is common to use 20 bits or more for the hash value,
+producing `2^^20` (about a million) features or more.
 
 These techniques typically generate an enormous number of features.
 Representing each feature as an individual column is far from ideal, both from
@@ -225,8 +221,8 @@ corresponding vector values may have any length. A tokenization transform,
 that maps a text value to the sequence of individual terms in that text,
 naturally produces variable-length vectors of text. Then, a hashing ngram
 transform may map the variable-length vectors of text to a bag-of-ngrams
-representation, which naturally produces numeric vectors of length $2^k$, where
-$k$ is the number of bits used in the hash function.
+representation, which naturally produces numeric vectors of length `2^^k`,
+where `k` is the number of bits used in the hash function.
 
 ### Key Types
 
@@ -409,10 +405,6 @@ needed, the operating system disk cache transparently enhances performance.
 Further, when the data is known to fit in memory, caching, as described above,
 provides even better performance.
 
-Note: Implementing a loader for XDF files should be straightforward. To
-implement a saver, the XDF format will likely need to be extended to support
-vector-valued columns, and perhaps metadata encoding.
-
 ### Randomization
 
 Some training algorithms benefit from randomizing the order of rows produced

diff --git a/docs/code/IDataViewImplementation.md b/docs/code/IDataViewImplementation.md
@@ -73,7 +73,7 @@ result that if a pipeline was composed in some other fashion, there would be
 some error.
 
 The only thing you can really assume is that an `IDataView` behaves "sanely"
-according to the contracts of the `IDataView` interface, so that future TLC
+according to the contracts of the `IDataView` interface, so that future ML.NET
 developers can form some reasonable expectations of how your code behaves, and
 also have a prayer of knowing how to maintain the code. It is hard enough to
 write software correctly even when the code you're working with actually does
@@ -166,8 +166,8 @@ has the following problems:
 * **Every** call had to verify that the column was active,
 * **Every** call had to verify that `TValue` was of the right type,
 * When these were part of, say, a transform in a chain (as they often are,
-  considering how common transforms are used by TLC's users) each access would
-  be accompanied by a virtual method call to the upstream cursor's
+  considering how common transforms are used by ML.NET's users) each access
+  would be accompanied by a virtual method call to the upstream cursor's
   `GetColumnValue`.
 
 In contrast, consider the situation with these getter delegates. The
@@ -211,14 +211,14 @@ consuming different data from the contemporaneous cursor? There are many
 examples of this throughout the codebase.
 
 Nevertheless: in very specific circumstances we have relaxed this. For
-example, the TLC API serves up corrupt `IDataView` implementations that have
-their underlying data change, since reconstituting a data pipeline on fresh
-data is at the present moment too resource intensive. Nonetheless, this is
-wrong: for example, the `TrainingCursorBase` and related subclasses rely upon
-the data not changing. Since, however, that is used for *training* and the
-prediction engines of the API as used for *scoring*, we accept these. However
-this is not, strictly speaking, correct, and this sort of corruption of
-`IDataView` should only be considered as a last resort, and only when some
+example, some ML.NET API code serves up corrupt `IDataView` implementations
+that have their underlying data change, since reconstituting a data pipeline
+on fresh data is at the present moment too resource intensive. Nonetheless,
+this is wrong: for example, the `TrainingCursorBase` and related subclasses
+rely upon the data not changing. Since, however, that is used for *training*
+and the prediction engines of the API as used for *scoring*, we accept these.
+However this is not, strictly speaking, correct, and this sort of corruption
+of `IDataView` should only be considered as a last resort, and only when some
 great good can be accomplished through this. We certainly did not accept this
 corruption lightly!
 
@@ -265,19 +265,19 @@ same data view.) So some rules:
 ## Versioning
 
 This requirement for consistency of a data model often has implications across
-versions of TLC, and our requirements for data model backwards compatibility.
-As time has passed, we often feel like it would make sense if a transform
-behaved *differently*, that is, if it organized or calculated its output in a
-different way than it currently does. For example, suppose we wanted to switch
-the hash transform to something a bit more efficient than murmur hashes, for
-example. If we did so, presumably the same input values would map to different
-outputs. We are free to do so, of course, yet: when we deserialize a hash
-transform from before we made this change, that hash transform should continue
-to output values as it did, before we made that change. (This, of course,
-assuming that the transform was released as part of a "blessed" non-preview
-point release of TLC. We can, and have, broken backwards compatibility for
-something that has not yet been incorporated in any sort of blessed release,
-though we prefer to not.)
+versions of ML.NET, and our requirements for data model backwards
+compatibility. As time has passed, we often feel like it would make sense if a
+transform behaved *differently*, that is, if it organized or calculated its
+output in a different way than it currently does. For example, suppose we
+wanted to switch the hash transform to something a bit more efficient than
+murmur hashes, for example. If we did so, presumably the same input values
+would map to different outputs. We are free to do so, of course, yet: when we
+deserialize a hash transform from before we made this change, that hash
+transform should continue to output values as it did, before we made that
+change. (This, of course, assuming that the transform was released as part of
+a "blessed" non-preview point release of ML.NET. We can, and have, broken
+backwards compatibility for something that has not yet been incorporated in
+any sort of blessed release, though we prefer to not.)
 
 ## What is Not Functionally Identical
 
@@ -334,10 +334,9 @@ aside (which we can hardly help), we expect the models to be the same.
 
 # On Loaders, Data Models, and Empty `IMultiStreamSource`s
 
-When you run TLC you have the option of specifying not only *one* data input,
-but any number of data input files, including zero. :) This is how [the
-examples here](../public/command/DataCommands.md#look-ma-no-files) work. But
-there's also a more general principle at work here: when deserializing a data
+When you create a loader you have the option of specifying not only *one* data
+input, but any number of data input files, including zero. But there's also a
+more general principle at work here with zero files: when deserializing a data
 loader from a data model with an `IMultiStreamSource` with `Count == 0` (e.g.,
 as would be constructed with `new MultiFileSource(null)`), we have a protocol
 that *every* `IDataLoader` should work in that circumstance, and merely be a
@@ -472,7 +471,34 @@ indication that this function will not move the cursor (in which case `IRow`
 is helpful), or that will not access any values (in which case `ICursor` is
 helpful).
 
-# Metadata
+# Schema
+
+The schema contains information about the columns. As we see in [the design
+principles](IDataViewDesignPrinciples.md), it has index, data type, and
+optional metadata.
+
+While *programmatically* accesses to an `IDataView` are by index, from a
+user's perspective the indices are by name; most training algorithms
+conceptually train on the `Features` column (under default settings). For this
+reason nearly all usages of an `IDataView` will be prefixed with a call to the
+schema's `TryGetColumnIndex`.
+
+Regarding name hiding, the principles mention that when multiple columns have
+the same name, other columns are "hidden." The convention all implementations
+of `ISchema` obey is that the column with the *largest* index. Note however
+that this is merely convention, not part of the definition of `ISchema`.
+
+Implementations of `TryGetColumnIndex` should be O(1), that is, practically,
+this mapping ought to be backed with a dictionary in most cases. (There are
+obvious exceptions like, say, things like `LineLoader` which produce exactly
+one column. There, a simple equality test suffices.)
+
+It is best if `GetColumnType` returns the *same* object every time. That is,
+things like key-types and vector-types, when returned, should not be created
+in the function itself (thereby creating a new object every time), but rather
+stored somewhere and returned.
+
+## Metadata
 
 Since metadata is *optional*, one is not obligated to necessarily produce it,
 or conform to any particular schemas for any particular kinds (beyond, say,

diff --git a/docs/code/IDataViewTypeSystem.md b/docs/code/IDataViewTypeSystem.md
@@ -16,8 +16,7 @@ the specific interface is written using fixed pitch font as `IDataView`.
 
 IDataView is the data pipeline machinery for ML.NET. The ML.NET codebase has
 an extensive library of IDataView related components (loaders, transforms,
-savers, trainers, predictors, etc.). The team is actively working on many
-more.
+savers, trainers, predictors, etc.). More are being worked on.
 
 The name IDataView was inspired from the database world, where the term table
 typically indicates a mutable body of data, while a view is the result of a