Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schema comprehension doc #572

Merged
merged 2 commits into from
Jul 24, 2018
Merged

Conversation

Zruty0
Copy link
Contributor

@Zruty0 Zruty0 commented Jul 23, 2018

Added a document that describe typed schema comprehension.

Fixes #554

@dnfclas
Copy link

dnfclas commented Jul 23, 2018

CLA assistant check
All CLA requirements met. #Closed

@Zruty0
Copy link
Contributor Author

Zruty0 commented Jul 23, 2018

@dotnet-bot test Linux Release
#Closed

@eerhardt
Copy link
Member

eerhardt commented Jul 23, 2018

@Zruty0 - we are in the middle of moving our CI system from Jenkins to VSTS. You can ignore those 2 failed runs. (Plus you are just modifying .md files anyway.) #Resolved

Copy link
Member

@eerhardt eerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this helpful document, @Zruty0. It looks really good.


### Streaming data views

What if the original data doesn't support seeking, kile if it's some form of `IEnumerable<IrisData>` instead of `IList<IrisData>`? Well, we can simply use another helper function:
Copy link
Member

@eerhardt eerhardt Jul 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(type-o) kile #Resolved


Let's see how we can create a new `IDataView` out of an in-memory array, run some operations on it, and then read it back into the array.

```(csharp)
Copy link
Member

@eerhardt eerhardt Jul 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what exactly about your string isn't working, but I don't get syntax highlighting when viewing the document.

Typically, I use the format ```C# instead. #Resolved

Below are the most notable examples of the differences:

* `IDataView` vector columns may have a fixed (and known) size, C# arrays can not. You can use `[VectorType(N)]` attribute to an array field to specify that the column is a vector of fixed size N. This is often necessary: most ML components don't work with variable-size vectors, they require fixed-size ones.
* `IDataView`'s **key types** don't have an underlying C# type either. To declare a key-type column, you need to make your field an `uint`, and decorate it with `[KeyType(Min=A, Count=B)]` to denote that the field is a key with the specified range of values.
Copy link
Member

@eerhardt eerhardt Jul 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes it sound like Min and Count are required values, but they do not appear to be. (And I'm assuming there are plenty of scenarios where a user doesn't know up front what are all the possible values. #Resolved

var predictionEngine = env.CreatePredictionEngine<IrisData, IrisVectorData>(dv, outputSchemaDefinition: schemaDef);
```

In addition to the above, you can use `SchemaDefinition` to add per-column metadata, or even a 'value generator' (so that the column value is not read from the field, but computed using a delegate).
Copy link
Member

@eerhardt eerhardt Jul 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be interesting to have a code snippet example for this scenario? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, I think the 'generator' bit didn't make it into ML.NET


In reply to: 204536819 [](ancestors = 204536819)

* Reading a different subset of columns on every row: the cursor always populates the entire row object.
* Reading column metadata from the data view.
* Accessing the 'hidden' data view columns by index.
* Creating 'cursor sets'.
Copy link
Member

@eerhardt eerhardt Jul 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A link or definition of cursor sets may be helpful here. #Resolved

@Zruty0
Copy link
Contributor Author

Zruty0 commented Jul 24, 2018

I hope they will still go away, because they block the merging.


In reply to: 407177294 [](ancestors = 407177294)

@@ -0,0 +1,210 @@
# Schema comprehension in ML.NET

This document describes in detail the under-the-hood mechanism that ML.NET uses to automate the creation of `IDataView` schema, with the goal to make it as convenient to the end user as possible, while not incurring extra computational costs.
Copy link
Member

@sfilipi sfilipi Jul 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IDataView` schem [](start = 109, length = 16)

Might be useful to link to the IDV doc. #Closed


## Introduction

Every dataset in ML.NET is an `IDataView`, which is, for the purposes of this document, a collection of rows that share the same columns. The set of columns, their names, types and other metadata is known as the *schema* of the `IDataView`, and it's represented as an `ISchema` object.
Copy link
Member

@sfilipi sfilipi Jul 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is an [](start = 24, length = 5)

would it be more clear if it says "gets loaded into an IDV" #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think it's more correct to say 'is represented as', because you don't necessarily LOAD a dataset.


In reply to: 204819663 [](ancestors = 204819663)


## Introduction

Every dataset in ML.NET is an `IDataView`, which is, for the purposes of this document, a collection of rows that share the same columns. The set of columns, their names, types and other metadata is known as the *schema* of the `IDataView`, and it's represented as an `ISchema` object.
Copy link
Member

@sfilipi sfilipi Jul 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

schema [](start = 212, length = 9)

link to the schema section of the IDV Design Principles #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added one above


In reply to: 204820706 [](ancestors = 204820706)

These items above are very similar to the definition of fields in a C# class: names and types of columns correspond to names and types of fields, and metadata can correspond to field attributes.
Because of this similarity, ML.NET offers a common convenient mechanism for creating a schema: it is done via defining a C# class.

For example, the below class definition can be used to define a data view with 5 float columns:
Copy link
Member

@sfilipi sfilipi Jul 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data view [](start = 64, length = 9)

wondering if it would help to state at the beginning that IDataView and 'data view' are interchangeable, because you give the definition of one, and use the other term for it. #Closed

.ToArray();
}
```
After this code runs, `arr` will contain two `IrisVectorData` objects, each having `Features` filled with the actual values of the features.
Copy link
Member

@sfilipi sfilipi Jul 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

features [](start = 131, length = 8)

I'd add (the 4 concatenated columns) after features, to make it more explicit. #Closed

```(csharp)
var streamingDv = env.CreateStreamingDataView<IrisData>(dataEnumerable);
```
The only subtle difference is, the resulting `streamingDv` will not support shuffling (a property that's useful to some ML application).
Copy link
Member

@sfilipi sfilipi Jul 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shuffling [](start = 76, length = 9)

Maybe link to what data shuffling is.
#Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what to link to...


In reply to: 204823733 [](ancestors = 204823733)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the Wikipedia one isn't great..


In reply to: 204855447 [](ancestors = 204855447,204823733)

`IDataView` [type system](IDataViewTypeSystem.md) differs slightly from the C# type system, so a 1-1 mapping between column types and C# types is not always feasible.
Below are the most notable examples of the differences:

* `IDataView` vector columns may have a fixed (and known) size, C# arrays can not. You can use `[VectorType(N)]` attribute to an array field to specify that the column is a vector of fixed size N. This is often necessary: most ML components don't work with variable-size vectors, they require fixed-size ones.
Copy link
Member

@sfilipi sfilipi Jul 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C# arrays can not [](start = 64, length = 17)

this might get confusing if you think about initialized arrays. #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to clarify


In reply to: 204831222 [](ancestors = 204831222)

Below are the most notable examples of the differences:

* `IDataView` vector columns may have a fixed (and known) size, C# arrays can not. You can use `[VectorType(N)]` attribute to an array field to specify that the column is a vector of fixed size N. This is often necessary: most ML components don't work with variable-size vectors, they require fixed-size ones.
* `IDataView`'s **key types** don't have an underlying C# type either. To declare a key-type column, you need to make your field an `uint`, and decorate it with `[KeyType(Min=A, Count=B)]` to denote that the field is a key with the specified range of values.
Copy link
Member

@sfilipi sfilipi Jul 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*key types [](start = 16, length = 12)

link maybe #Closed

| `BL` | `DvBool` | `bool`, `bool?` |
| `TS` | `DvTimeSpan` | |
| `DT` | `DvDateTime` | |
| `DZ` | `DvDateTimeZone` | |
Copy link
Member

@sfilipi sfilipi Jul 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They don't map to TimeSpan, DateTime and DataTimeZone? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No


In reply to: 204836126 [](ancestors = 204836126)

It was our design decision to not allow these scenarios, thus simplifying the other, more common scenarios.

Here is the list of things that are only possible via the low-level interface:
* Creating or reading a data view, where even column *types* are not known at compile time (so you cannot create a C# class to define the schema)
Copy link
Member

@sfilipi sfilipi Jul 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating or reading a data view, where even column types are not known at compile time (so you cannot create a C# class to define the schema) [](start = 2, length = 143)

example of scenario when this might occur #Closed

* Creating or reading a data view, where even column *types* are not known at compile time (so you cannot create a C# class to define the schema)
* Reading a different subset of columns on every row: the cursor always populates the entire row object.
* Reading column metadata from the data view.
* Accessing the 'hidden' data view columns by index.
Copy link
Member

@sfilipi sfilipi Jul 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'hidden' data view column [](start = 16, length = 25)

link or define "hidden" #Closed


Here is the list of things that are only possible via the low-level interface:
* Creating or reading a data view, where even column *types* are not known at compile time (so you cannot create a C# class to define the schema)
* Reading a different subset of columns on every row: the cursor always populates the entire row object.
Copy link
Member

@sfilipi sfilipi Jul 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

different subset of columns on every row [](start = 12, length = 40)

what does 'different' mean here? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to rephrase


In reply to: 204839399 [](ancestors = 204839399)

Copy link
Member

@sfilipi sfilipi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@Zruty0 Zruty0 merged commit 8cfa2ed into dotnet:master Jul 24, 2018
@Zruty0 Zruty0 deleted the feature/554-schema-doc branch July 24, 2018 19:32
eerhardt pushed a commit to eerhardt/machinelearning that referenced this pull request Jul 27, 2018
* Added a doc for schema comprehension
codemzs pushed a commit to codemzs/machinelearning that referenced this pull request Aug 1, 2018
* Added a doc for schema comprehension
@ghost ghost locked as resolved and limited conversation to collaborators Mar 29, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants