Schema comprehension doc #572

Zruty0 · 2018-07-23T18:02:38Z

Added a document that describe typed schema comprehension.

Fixes #554

dnfclas · 2018-07-23T18:02:49Z

All CLA requirements met. #Closed

Zruty0 · 2018-07-23T19:32:51Z

@dotnet-bot test Linux Release
#Closed

eerhardt · 2018-07-23T19:42:39Z

@Zruty0 - we are in the middle of moving our CI system from Jenkins to VSTS. You can ignore those 2 failed runs. (Plus you are just modifying .md files anyway.) #Resolved

eerhardt

Thanks for this helpful document, @Zruty0. It looks really good.

eerhardt · 2018-07-23T19:53:56Z

docs/code/SchemaComprehension.md

+
+### Streaming data views
+
+What if the original data doesn't support seeking, kile if it's some form of `IEnumerable<IrisData>` instead of `IList<IrisData>`? Well, we can simply use another helper function:


(type-o) kile #Resolved

eerhardt · 2018-07-23T19:57:08Z

docs/code/SchemaComprehension.md

+
+Let's see how we can create a new `IDataView` out of an in-memory array, run some operations on it, and then read it back into the array.
+
+```(csharp)


I'm not sure what exactly about your string isn't working, but I don't get syntax highlighting when viewing the document.

Typically, I use the format ```C# instead. #Resolved

eerhardt · 2018-07-23T20:03:19Z

docs/code/SchemaComprehension.md

+Below are the most notable examples of the differences:
+
+* `IDataView` vector columns may have a fixed (and known) size, C# arrays can not. You can use `[VectorType(N)]` attribute to an array field to specify that the column is a vector of fixed size N. This is often necessary: most ML components don't work with variable-size vectors, they require fixed-size ones.
+* `IDataView`'s **key types** don't have an underlying C# type either. To declare a key-type column, you need to make your field an `uint`, and decorate it with `[KeyType(Min=A, Count=B)]` to denote that the field is a key with the specified range of values.


This makes it sound like Min and Count are required values, but they do not appear to be. (And I'm assuming there are plenty of scenarios where a user doesn't know up front what are all the possible values. #Resolved

eerhardt · 2018-07-23T20:06:55Z

docs/code/SchemaComprehension.md

+var predictionEngine = env.CreatePredictionEngine<IrisData, IrisVectorData>(dv, outputSchemaDefinition: schemaDef);
+```
+
+In addition to the above, you can use `SchemaDefinition` to add per-column metadata, or even a 'value generator' (so that the column value is not read from the field, but computed using a delegate).


It might be interesting to have a code snippet example for this scenario? #Resolved

In fact, I think the 'generator' bit didn't make it into ML.NET

In reply to: 204536819 [](ancestors = 204536819)

eerhardt · 2018-07-23T20:08:31Z

docs/code/SchemaComprehension.md

+* Reading a different subset of columns on every row: the cursor always populates the entire row object.
+* Reading column metadata from the data view.
+* Accessing the 'hidden' data view columns by index.
+* Creating 'cursor sets'.


A link or definition of cursor sets may be helpful here. #Resolved

Zruty0 · 2018-07-24T15:06:38Z

I hope they will still go away, because they block the merging.

In reply to: 407177294 [](ancestors = 407177294)

sfilipi · 2018-07-24T16:12:59Z

docs/code/SchemaComprehension.md

@@ -0,0 +1,210 @@
+# Schema comprehension in ML.NET
+
+This document describes in detail the under-the-hood mechanism that ML.NET uses to automate the creation of `IDataView` schema, with the goal to make it as convenient to the end user as possible, while not incurring extra computational costs.


IDataView` schem [](start = 109, length = 16)

Might be useful to link to the IDV doc. #Closed

sfilipi · 2018-07-24T16:13:52Z

docs/code/SchemaComprehension.md

+
+## Introduction
+
+Every dataset in ML.NET is an `IDataView`, which is, for the purposes of this document, a collection of rows that share the same columns. The set of columns, their names, types and other metadata is known as the *schema* of the `IDataView`, and it's represented as an `ISchema` object.


is an [](start = 24, length = 5)

would it be more clear if it says "gets loaded into an IDV" #Closed

Hmm, I think it's more correct to say 'is represented as', because you don't necessarily LOAD a dataset.

In reply to: 204819663 [](ancestors = 204819663)

sfilipi · 2018-07-24T16:16:50Z

docs/code/SchemaComprehension.md

+
+## Introduction
+
+Every dataset in ML.NET is an `IDataView`, which is, for the purposes of this document, a collection of rows that share the same columns. The set of columns, their names, types and other metadata is known as the *schema* of the `IDataView`, and it's represented as an `ISchema` object.


schema [](start = 212, length = 9)

link to the schema section of the IDV Design Principles #Closed

Added one above

In reply to: 204820706 [](ancestors = 204820706)

sfilipi · 2018-07-24T16:19:49Z

docs/code/SchemaComprehension.md

+These items above are very similar to the definition of fields in a C# class: names and types of columns correspond to names and types of fields, and metadata can correspond to field attributes. 
+Because of this similarity, ML.NET offers a common convenient mechanism for creating a schema: it is done via defining a C# class.
+
+For example, the below class definition can be used to define a data view with 5 float columns:


data view [](start = 64, length = 9)

wondering if it would help to state at the beginning that IDataView and 'data view' are interchangeable, because you give the definition of one, and use the other term for it. #Closed

sfilipi · 2018-07-24T16:23:42Z

docs/code/SchemaComprehension.md

+        .ToArray();
+}
+```
+After this code runs, `arr` will contain two `IrisVectorData` objects, each having `Features` filled with the actual values of the features.


features [](start = 131, length = 8)

I'd add (the 4 concatenated columns) after features, to make it more explicit. #Closed

sfilipi · 2018-07-24T16:25:49Z

docs/code/SchemaComprehension.md

+```(csharp)
+var streamingDv = env.CreateStreamingDataView<IrisData>(dataEnumerable);
+```
+The only subtle difference is, the resulting `streamingDv` will not support shuffling (a property that's useful to some ML application).


shuffling [](start = 76, length = 9)

Maybe link to what data shuffling is.
#Closed

I'm not sure what to link to...

In reply to: 204823733 [](ancestors = 204823733)

Yeah, the Wikipedia one isn't great..

In reply to: 204855447 [](ancestors = 204855447,204823733)

sfilipi · 2018-07-24T16:48:13Z

docs/code/SchemaComprehension.md

+`IDataView` [type system](IDataViewTypeSystem.md) differs slightly from the C# type system, so a 1-1 mapping between column types and C# types is not always feasible. 
+Below are the most notable examples of the differences:
+
+* `IDataView` vector columns may have a fixed (and known) size, C# arrays can not. You can use `[VectorType(N)]` attribute to an array field to specify that the column is a vector of fixed size N. This is often necessary: most ML components don't work with variable-size vectors, they require fixed-size ones.


C# arrays can not [](start = 64, length = 17)

this might get confusing if you think about initialized arrays. #Closed

Tried to clarify

In reply to: 204831222 [](ancestors = 204831222)

sfilipi · 2018-07-24T16:54:14Z

docs/code/SchemaComprehension.md

+Below are the most notable examples of the differences:
+
+* `IDataView` vector columns may have a fixed (and known) size, C# arrays can not. You can use `[VectorType(N)]` attribute to an array field to specify that the column is a vector of fixed size N. This is often necessary: most ML components don't work with variable-size vectors, they require fixed-size ones.
+* `IDataView`'s **key types** don't have an underlying C# type either. To declare a key-type column, you need to make your field an `uint`, and decorate it with `[KeyType(Min=A, Count=B)]` to denote that the field is a key with the specified range of values.


*key types [](start = 16, length = 12)

link maybe #Closed

sfilipi · 2018-07-24T17:04:41Z

docs/code/SchemaComprehension.md

+| `BL`             | `DvBool`           | `bool`, `bool?`         |
+| `TS`             | `DvTimeSpan`       |                         |
+| `DT`             | `DvDateTime`       |                         |
+| `DZ`             | `DvDateTimeZone`   |                         |


They don't map to TimeSpan, DateTime and DataTimeZone? #Resolved

No

In reply to: 204836126 [](ancestors = 204836126)

sfilipi · 2018-07-24T17:14:23Z

docs/code/SchemaComprehension.md

+It was our design decision to not allow these scenarios, thus simplifying the other, more common scenarios. 
+
+Here is the list of things that are only possible via the low-level interface:
+* Creating or reading a data view, where even column *types* are not known at compile time (so you cannot create a C# class to define the schema)


Creating or reading a data view, where even column types are not known at compile time (so you cannot create a C# class to define the schema) [](start = 2, length = 143)

example of scenario when this might occur #Closed

sfilipi · 2018-07-24T17:14:46Z

docs/code/SchemaComprehension.md

+* Creating or reading a data view, where even column *types* are not known at compile time (so you cannot create a C# class to define the schema)
+* Reading a different subset of columns on every row: the cursor always populates the entire row object.
+* Reading column metadata from the data view.
+* Accessing the 'hidden' data view columns by index.


'hidden' data view column [](start = 16, length = 25)

link or define "hidden" #Closed

sfilipi · 2018-07-24T17:15:09Z

docs/code/SchemaComprehension.md

+
+Here is the list of things that are only possible via the low-level interface:
+* Creating or reading a data view, where even column *types* are not known at compile time (so you cannot create a C# class to define the schema)
+* Reading a different subset of columns on every row: the cursor always populates the entire row object.


different subset of columns on every row [](start = 12, length = 40)

what does 'different' mean here? #Closed

Tried to rephrase

In reply to: 204839399 [](ancestors = 204839399)

sfilipi

* Added a doc for schema comprehension

Added a doc for schema comprehension

8a50aaf

Zruty0 requested review from eerhardt, TomFinley and sfilipi July 23, 2018 18:02

eerhardt approved these changes Jul 23, 2018

View reviewed changes

sfilipi reviewed Jul 24, 2018

View reviewed changes

PR comments

6b93312

sfilipi approved these changes Jul 24, 2018

View reviewed changes

Zruty0 merged commit 8cfa2ed into dotnet:master Jul 24, 2018

Zruty0 deleted the feature/554-schema-doc branch July 24, 2018 19:32

eerhardt pushed a commit to eerhardt/machinelearning that referenced this pull request Jul 27, 2018

Schema comprehension doc (dotnet#572)

a6f4635

* Added a doc for schema comprehension

codemzs pushed a commit to codemzs/machinelearning that referenced this pull request Aug 1, 2018

Schema comprehension doc (dotnet#572)

5287ce8

* Added a doc for schema comprehension

ghost locked as resolved and limited conversation to collaborators Mar 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Schema comprehension doc #572

Schema comprehension doc #572

Zruty0 commented Jul 23, 2018

dnfclas commented Jul 23, 2018 •

edited by Zruty0

Loading

Zruty0 commented Jul 23, 2018 •

edited

Loading

eerhardt commented Jul 23, 2018 •

edited by Zruty0

Loading

eerhardt left a comment

eerhardt Jul 23, 2018 •

edited by Zruty0

Loading

eerhardt Jul 23, 2018 •

edited by Zruty0

Loading

eerhardt Jul 23, 2018 •

edited by Zruty0

Loading

eerhardt Jul 23, 2018 •

edited by Zruty0

Loading

Zruty0 Jul 24, 2018

eerhardt Jul 23, 2018 •

edited by Zruty0

Loading

Zruty0 commented Jul 24, 2018

sfilipi Jul 24, 2018 •

edited

Loading

sfilipi Jul 24, 2018 •

edited

Loading

Zruty0 Jul 24, 2018

sfilipi Jul 24, 2018 •

edited

Loading

Zruty0 Jul 24, 2018

sfilipi Jul 24, 2018 •

edited

Loading

sfilipi Jul 24, 2018 •

edited

Loading

sfilipi Jul 24, 2018 •

edited

Loading

Zruty0 Jul 24, 2018

sfilipi Jul 24, 2018

sfilipi Jul 24, 2018 •

edited

Loading

Zruty0 Jul 24, 2018

sfilipi Jul 24, 2018 •

edited

Loading

sfilipi Jul 24, 2018 •

edited by Zruty0

Loading

Zruty0 Jul 24, 2018

sfilipi Jul 24, 2018 •

edited

Loading

sfilipi Jul 24, 2018 •

edited

Loading

sfilipi Jul 24, 2018 •

edited

Loading

Zruty0 Jul 24, 2018

sfilipi left a comment


		### Streaming data views

		What if the original data doesn't support seeking, kile if it's some form of `IEnumerable<IrisData>` instead of `IList<IrisData>`? Well, we can simply use another helper function:


		Let's see how we can create a new `IDataView` out of an in-memory array, run some operations on it, and then read it back into the array.

		```(csharp)

		@@ -0,0 +1,210 @@
		# Schema comprehension in ML.NET

		This document describes in detail the under-the-hood mechanism that ML.NET uses to automate the creation of `IDataView` schema, with the goal to make it as convenient to the end user as possible, while not incurring extra computational costs.


		## Introduction

		Every dataset in ML.NET is an `IDataView`, which is, for the purposes of this document, a collection of rows that share the same columns. The set of columns, their names, types and other metadata is known as the schema of the `IDataView`, and it's represented as an `ISchema` object.

Schema comprehension doc #572

Schema comprehension doc #572

Conversation

Zruty0 commented Jul 23, 2018

dnfclas commented Jul 23, 2018 • edited by Zruty0 Loading

Zruty0 commented Jul 23, 2018 • edited Loading

eerhardt commented Jul 23, 2018 • edited by Zruty0 Loading

eerhardt left a comment

Choose a reason for hiding this comment

eerhardt Jul 23, 2018 • edited by Zruty0 Loading

Choose a reason for hiding this comment

eerhardt Jul 23, 2018 • edited by Zruty0 Loading

Choose a reason for hiding this comment

eerhardt Jul 23, 2018 • edited by Zruty0 Loading

Choose a reason for hiding this comment

eerhardt Jul 23, 2018 • edited by Zruty0 Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eerhardt Jul 23, 2018 • edited by Zruty0 Loading

Choose a reason for hiding this comment

Zruty0 commented Jul 24, 2018

sfilipi Jul 24, 2018 • edited Loading

Choose a reason for hiding this comment

sfilipi Jul 24, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sfilipi Jul 24, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sfilipi Jul 24, 2018 • edited Loading

Choose a reason for hiding this comment

sfilipi Jul 24, 2018 • edited Loading

Choose a reason for hiding this comment

sfilipi Jul 24, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sfilipi Jul 24, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sfilipi Jul 24, 2018 • edited Loading

Choose a reason for hiding this comment

sfilipi Jul 24, 2018 • edited by Zruty0 Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sfilipi Jul 24, 2018 • edited Loading

Choose a reason for hiding this comment

sfilipi Jul 24, 2018 • edited Loading

Choose a reason for hiding this comment

sfilipi Jul 24, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sfilipi left a comment

Choose a reason for hiding this comment

dnfclas commented Jul 23, 2018 •

edited by Zruty0

Loading

Zruty0 commented Jul 23, 2018 •

edited

Loading

eerhardt commented Jul 23, 2018 •

edited by Zruty0

Loading

eerhardt Jul 23, 2018 •

edited by Zruty0

Loading

eerhardt Jul 23, 2018 •

edited by Zruty0

Loading

eerhardt Jul 23, 2018 •

edited by Zruty0

Loading

eerhardt Jul 23, 2018 •

edited by Zruty0

Loading

eerhardt Jul 23, 2018 •

edited by Zruty0

Loading

sfilipi Jul 24, 2018 •

edited

Loading

sfilipi Jul 24, 2018 •

edited

Loading

sfilipi Jul 24, 2018 •

edited

Loading

sfilipi Jul 24, 2018 •

edited

Loading

sfilipi Jul 24, 2018 •

edited

Loading

sfilipi Jul 24, 2018 •

edited

Loading

sfilipi Jul 24, 2018 •

edited

Loading

sfilipi Jul 24, 2018 •

edited

Loading

sfilipi Jul 24, 2018 •

edited by Zruty0

Loading

sfilipi Jul 24, 2018 •

edited

Loading

sfilipi Jul 24, 2018 •

edited

Loading

sfilipi Jul 24, 2018 •

edited

Loading