Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add F# samples #36

Merged
merged 5 commits into from
Aug 7, 2018
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions .vsts-dotnet-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,14 @@ phases:
inputs:
projects: '.\samples\csharp\getting-started\GettingStarted.sln'

- phase: FSharpGettingStarted
queue: Hosted VS2017
steps:
- task: DotNetCoreCLI@2
displayName: Build F# GettingStarted
inputs:
projects: '.\samples\fsharp\getting-started\GettingStarted.sln'

- phase: BinaryClasification_Titanic
queue: Hosted VS2017
steps:
Expand Down
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,10 @@
[ML.NET](https://www.microsoft.com/net/learn/apps/machine-learning-and-ai/ml-dotnet) is a cross-platform open-source machine learning framework that makes machine learning accessible to .NET developers.

ML.NET samples are divided in three categories:
* **Getting started** - basic "hello world" samples for each ML task.
* **Getting started (C#)** - basic "hello world" samples for each ML task, in C#
* **Getting started (F#)** - basic "hello world" samples for each ML task, in F#
* **Examples** - examples of how you can use various ML.NET components (learners, transforms, ...).
* **End-to-end apps** - real world examples of web, desktop, mobile, and other applications infused with ML solutions via [ML.NET APIs](https://docs.microsoft.com/dotnet/api/?view=ml-dotnet).
* **End-to-end (C#)** - real world examples of web, desktop, mobile, and other applications infused with ML solutions via [ML.NET APIs](https://docs.microsoft.com/dotnet/api/?view=ml-dotnet).

All samples in this repo are using the latest released [Microsoft.ML](https://www.nuget.org/packages/Microsoft.ML/) NuGet package. If you would like to see the examples referencing the source code, check out [scenario tests](https://github.com/dotnet/machinelearning/tree/master/test/Microsoft.ML.Tests/Scenarios) in [ML.NET repository](https://github.com/dotnet/machinelearning).

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
<Project Sdk="Microsoft.NET.Sdk">

<PropertyGroup>
<OutputType>Exe</OutputType>
<TargetFramework>netcoreapp2.0</TargetFramework>
</PropertyGroup>

<ItemGroup>
<Compile Include="Program.fs" />
<Folder Include="datasets\" />
<None Include="..\..\..\..\datasets\sentiment-imdb-train.txt" Link="datasets\sentiment-imdb-train.txt" />
<None Include="..\..\..\..\datasets\sentiment-yelp-test.txt" Link="datasets\sentiment-yelp-test.txt" />
</ItemGroup>

<ItemGroup>
<PackageReference Include="Microsoft.ML" Version="$(MicrosoftMLVersion)" />
</ItemGroup>

</Project>
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
module BinaryClassification_SentimentAnalysis

open System
open System.IO
open Microsoft.ML
open Microsoft.ML.Data
open Microsoft.ML.Models
open Microsoft.ML.Runtime.Api
open Microsoft.ML.Trainers
open Microsoft.ML.Transforms

type SentimentData() =
[<Column("0")>]
member val SentimentText: string = "" with get, set

[<Column("1", name="Label")>]
member val Sentiment : double = 0.0 with get, set

type SentimentPrediction() =
[<ColumnName("PredictedLabel")>]
member val Sentiment : bool = false with get, set

let sentiments =
[| SentimentData(SentimentText = "Contoso's 11 is a wonderful experience", Sentiment = 1.0)
SentimentData(SentimentText = "The acting in this movie is very bad", Sentiment = 0.0)
SentimentData(SentimentText = "Joe versus the Volcano Coffee Company is a great film.", Sentiment = 1.0) |]

let AppPath = Path.Combine(__SOURCE_DIRECTORY__, "../../../..")
let TrainDataPath = Path.Combine(AppPath, "datasets", "sentiment-imdb-train.txt")
let TestDataPath = Path.Combine(AppPath, "datasets", "sentiment-yelp-test.txt")
let modelPath = Path.Combine(AppPath, "SentimentModel.zip")

let TrainAsync() =
// LearningPipeline holds all steps of the learning process: data, transforms, learners.
let pipeline = LearningPipeline()

// The TextLoader loads a dataset. The schema of the dataset is specified by passing a class containing
// all the column names and their types.
pipeline.Add(TextLoader(TrainDataPath).CreateFrom<SentimentData>())

// TextFeaturizer is a transform that will be used to featurize an input column to format and clean the data.
pipeline.Add(TextFeaturizer("Features", "SentimentText"))

// FastTreeBinaryClassifier is an algorithm that will be used to train the model.
// It has three hyperparameters for tuning decision tree performance.
pipeline.Add(FastTreeBinaryClassifier(NumLeaves = 5, NumTrees = 5, MinDocumentsInLeafs = 2))

Console.WriteLine("=============== Training model ===============")
// The pipeline is trained on the dataset that has been loaded and transformed.
let model = pipeline.Train<SentimentData, SentimentPrediction>()

// Saving the model as a .zip file.
model.WriteAsync(modelPath) |> Async.AwaitTask |> Async.RunSynchronously

Console.WriteLine("=============== End training ===============")
Console.WriteLine(sprintf "The model is saved to %s" modelPath)

model

let Evaluate(model: PredictionModel<SentimentData, SentimentPrediction> ) =
// To evaluate how good the model predicts values, the model is ran against new set
// of data (test data) that was not involved in training.
let testData = TextLoader(TestDataPath).CreateFrom<SentimentData>()

// BinaryClassificationEvaluator performs evaluation for Binary Classification type of ML problems.
let evaluator = BinaryClassificationEvaluator()

Console.WriteLine("=============== Evaluating model ===============")

let metrics = evaluator.Evaluate(model, testData)
// BinaryClassificationMetrics contains the overall metrics computed by binary classification evaluators
// The Accuracy metric gets the accuracy of a classifier which is the proportion
//of correct predictions in the test set.

// The Auc metric gets the area under the ROC curve.
// The area under the ROC curve is equal to the probability that the classifier ranks
// a randomly chosen positive instance higher than a randomly chosen negative one
// (assuming 'positive' ranks higher than 'negative').

// The F1Score metric gets the classifier's F1 score.
// The F1 score is the harmonic mean of precision and recall:
// 2 * precision * recall / (precision + recall).

Console.WriteLine(sprintf "Accuracy: %0.2f" metrics.Accuracy)
Console.WriteLine(sprintf "Auc: %0.2f" metrics.Auc)
Console.WriteLine(sprintf "F1Score: %0.2f" metrics.F1Score)
Console.WriteLine("=============== End evaluating ===============")
Console.WriteLine()

// STEP 1: Create a model
let model = TrainAsync()

// STEP2: Test accuracy
Evaluate(model)

// STEP 3: Make a prediction
let predictions = model.Predict(sentiments)

for (sentiment, prediction) in Seq.zip sentiments predictions do
Console.WriteLine( sprintf "Sentiment: %s | Prediction: %s sentiment" sentiment.SentimentText (if prediction.Sentiment then "Positive" else "Negative"))

Console.ReadLine() |> ignore

Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# Sentiment Analysis for User Reviews
In this introductory sample, you'll see how to use [ML.NET](https://www.microsoft.com/net/learn/apps/machine-learning-and-ai/ml-dotnet) to predict a sentiment (positive or negative) for customer reviews. In the world of machine learning, this type of prediction is known as **binary classification**.

## Problem
This problem is centered around predicting if a customer's review has positive or negative sentiment. We will use IMDB and Yelp comments that were processed by humans and each comment has been assigned a label:
* 0 - negative
* 1 - positive

Using those datasets we will build a model that will analyze a string and predict a sentiment value of 0 or 1.

## ML task - Binary classification
The generalized problem of **binary classification** is to classify items into one of two classes (classifying items into more than two classes is called **multiclass classification**).

* predict if an insurance claim is valid or not.
* predict if a plane will be delayed or will arrive on time.
* predict if a face ID (photo) belongs to the owner of a device.

The common feature for all those examples is that the parameter we want to predict can take only one of two values. In other words, this value is represented by `boolean` type.

## Solution
To solve this problem, first we will build an ML model. Then we will train the model on existing data, evaluate how good it is, and lastly we'll consume the model to predict a sentiment for new reviews.

![Build -> Train -> Evaluate -> Consume](https://github.com/dotnet/machinelearning-samples/raw/master/samples/getting-started/shared_content/modelpipeline.png)

### 1. Build model

Building a model includes: uploading data (`sentiment-imdb-train.txt` with `TextLoader`), transforming the data so it can be used effectively by an ML algorithm (with `TextFeaturizer`), and choosing a learning algorithm (`FastTreeBinaryClassifier`). All of those steps are stored in a `LearningPipeline`:
```fsharp
// LearningPipeline holds all steps of the learning process: data, transforms, learners.
let pipeline = LearningPipeline()
// The TextLoader loads a dataset. The schema of the dataset is specified by passing a class containing
// all the column names and their types.
pipeline.Add(TextLoader(TrainDataPath).CreateFrom<SentimentData>())
// TextFeaturizer is a transform that will be used to featurize an input column to format and clean the data.
pipeline.Add(TextFeaturizer("Features", "SentimentText"))
// FastTreeBinaryClassifier is an algorithm that will be used to train the model.
// It has three hyperparameters for tuning decision tree performance.
pipeline.Add(FastTreeBinaryClassifier(NumLeaves = 5, NumTrees = 5, MinDocumentsInLeafs = 2)
```
### 2. Train model
Training the model is a process of running the chosen algorithm on a training data (with known sentiment values) to tune the parameters of the model. It is implemented in the `Train()` API. To perform training we just call the method and provide the types for our data object `SentimentData` and prediction object `SentimentPrediction`.
```fsharp
let model = pipeline.Train<SentimentData, SentimentPrediction>()
```
### 3. Evaluate model
We need this step to conclude how accurate our model operates on new data. To do so, the model from the previous step is run against another dataset that was not used in training (`sentiment-yelp-test.txt`). This dataset also contains known sentiments. `BinaryClassificationEvaluator` calculates the difference between known fares and values predicted by the model in various metrics.
```fsharp
let testData = TextLoader(TestDataPath).CreateFrom<SentimentData>()

let evaluator = BinaryClassificationEvaluator()
let metrics = evaluator.Evaluate(model, testData)
```
>*To learn more on how to understand the metrics, check out the Machine Learning glossary from the [ML.NET Guide](https://docs.microsoft.com/en-us/dotnet/machine-learning/) or use any available materials on data science and machine learning*.

If you are not satisfied with the quality of the model, there are a variety of ways to improve it, which will be covered in the *examples* category.

>*Keep in mind that for this sample the quality is lower than it could be because the datasets were reduced in size for performance purposes. You can use bigger labeled sentiment datasets available online to significantly improve the quality.*

### 4. Consume model
After the model is trained, we can use the `Predict()` API to predict the sentiment for new reviews.

```fsharp
let predictions = model.Predict(sentiments)
```
Where `sentiments` contains new user reviews that we want to analyze.

```fsharp
let sentiments =
[| SentimentData(SentimentText = "Contoso's 11 is a wonderful experience", Sentiment = 1.0)
SentimentData(SentimentText = "The acting in this movie is very bad", Sentiment = 0.0)
SentimentData(SentimentText = "Joe versus the Volcano Coffee Company is a great film.", Sentiment = 1.0) |]
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
<Project Sdk="Microsoft.NET.Sdk">

<PropertyGroup>
<OutputType>Exe</OutputType>
<TargetFramework>netcoreapp2.0</TargetFramework>
</PropertyGroup>

<ItemGroup>
<Compile Include="Program.fs" />
<Folder Include="datasets\" />
<None Include="..\..\..\..\datasets\iris-full.txt" Link="datasets\iris-full.txt" />
</ItemGroup>

<ItemGroup>
<PackageReference Include="Microsoft.ML" Version="$(MicrosoftMLVersion)" />
</ItemGroup>

</Project>
94 changes: 94 additions & 0 deletions samples/fsharp/getting-started/Clustering_Iris/Program.fs
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
module Clustering_Iris

open System
open System.IO
open Microsoft.ML
open Microsoft.ML.Runtime.Api
open Microsoft.ML.Data
open Microsoft.ML.Trainers
open Microsoft.ML.Transforms

let AppPath = Path.Combine(__SOURCE_DIRECTORY__, "../../../..")
let DataPath = Path.Combine(AppPath, "datasets", "iris-full.txt")
let ModelPath = Path.Combine(AppPath, "IrisClustersModel.zip")

type IrisData() =
[<Column("0")>]
member val Label = 0.0 with get,set

[<Column("1")>]
member val SepalLength = 0.0 with get, set

[<Column("2")>]
member val SepalWidth = 0.0 with get, set

[<Column("3")>]
member val PetalLength = 0.0 with get, set

[<Column("4")>]
member val PetalWidth = 0.0 with get, set

type ClusterPrediction() =
[<ColumnName("PredictedLabel")>]
member val SelectedClusterId = 0 with get, set

[<ColumnName("Score")>]
member val Distance : float[] = null with get, set

let Train() =
// LearningPipeline holds all steps of the learning process: data, transforms, learners.
let pipeline = LearningPipeline()
// The TextLoader loads a dataset. The schema of the dataset is specified by passing a class containing
// all the column names and their types.
pipeline.Add(TextLoader(DataPath).CreateFrom<IrisData>(useHeader=true))
// ColumnConcatenator concatenates all columns into Features column
pipeline.Add(ColumnConcatenator("Features",
"SepalLength",
"SepalWidth",
"PetalLength",
"PetalWidth"))
// KMeansPlusPlusClusterer is an algorithm that will be used to build clusters. We set the number of clusters to 3.
pipeline.Add(KMeansPlusPlusClusterer(K = 3))

Console.WriteLine("=============== Training model ===============")
let model = pipeline.Train<IrisData, ClusterPrediction>()
Console.WriteLine("=============== End training ===============")

// Saving the model as a .zip file.
model.WriteAsync(ModelPath) |> Async.AwaitTask |> Async.RunSynchronously
Console.WriteLine("The model is saved to {0}", ModelPath)

model

module TestIrisData =
let Setosa1 = IrisData(SepalLength = 5.1, SepalWidth = 3.3, PetalLength = 1.6, PetalWidth = 0.2)
let Setosa2 = IrisData(SepalLength = 0.2, SepalWidth = 5.1, PetalLength = 3.5, PetalWidth = 1.4)
let Virginica1 = IrisData(SepalLength = 6.4, SepalWidth = 3.1, PetalLength = 5.5, PetalWidth = 2.2)
let Virginica2 = IrisData(SepalLength = 2.5, SepalWidth = 6.3, PetalLength = 3.3, PetalWidth = 6.0)
let Versicolor1 = IrisData(SepalLength = 6.4, SepalWidth = 3.1, PetalLength = 4.5, PetalWidth = 1.5)
let Versicolor2 = IrisData(SepalLength = 7.0, SepalWidth = 3.2, PetalLength = 4.7, PetalWidth = 1.4)

// STEP 1: Create a model
let model = Train()

// STEP 2: Make a prediction
Console.WriteLine()
let prediction1 = model.Predict(TestIrisData.Setosa1)
let prediction2 = model.Predict(TestIrisData.Setosa2)
Console.WriteLine(sprintf "Clusters assigned for setosa flowers:")
Console.WriteLine(sprintf " {%d}" prediction1.SelectedClusterId)
Console.WriteLine(sprintf " {%d}" prediction2.SelectedClusterId)

let prediction3 = model.Predict(TestIrisData.Virginica1)
let prediction4 = model.Predict(TestIrisData.Virginica2)
Console.WriteLine(sprintf "Clusters assigned for virginica flowers:")
Console.WriteLine(sprintf " {%d}" prediction3.SelectedClusterId)
Console.WriteLine(sprintf " {%d}" prediction4.SelectedClusterId)

let prediction5 = model.Predict(TestIrisData.Versicolor1)
let prediction6 = model.Predict(TestIrisData.Versicolor2)
Console.WriteLine(sprintf "Clusters assigned for versicolor flowers:")
Console.WriteLine(sprintf " {%d}" prediction5.SelectedClusterId)
Console.WriteLine(sprintf " {%d}" prediction6.SelectedClusterId)
Console.ReadLine() |> ignore

Loading