Skip to content

Latest commit

 

History

History
123 lines (119 loc) · 12.1 KB

06. Clustering model with Azure ML designer.MD

File metadata and controls

123 lines (119 loc) · 12.1 KB

Create a clustering model with Azure ML designer

Identify clustering machine learning scenarios

  1. Clustering is a form of ML that is used to group similar items into clusters based on their features. For example, a researcher might take measurements of penguins, and group them based on similarities in their proportions.
  2. Clustering is an example of unsupervised ML, in which you train a model to separate items into clusters based purely on their characteristics, or features. There is no previously known cluster value (or label) from which to train the model.
  3. Scenarios for clustering machine learning models.
    1. Cluster customer attribute data into segments for marketing analysis.
    2. Cluster geographic coordinates into regions of high traffic in a city for a ride-share application.
    3. Cluster written feedback into topics to prioritize customer service changes.

Understand steps for clustering

  1. Prepare data - To train a clustering model, you need a dataset that includes multiple observations of the items you want to cluster, including numeric features that can be used to determine similarities between individual cases that will help separate them into clusters.
  2. Train model - To train a clustering model, you need to apply a clustering algorithm to the data, using only the features that you have selected for clustering. You'll train the model with a subset of the data, and use the rest to test the trained model. The K-Means Clustering algorithm groups items into the number of clusters, or centroids, you specify - a value referred to as K. The K-Means algorithm works by:
    1. Initializing K coordinates as randomly selected points called centroids in n-dimensional space (where n is the number of dimensions in the feature vectors).
    2. Plotting the feature vectors as points in the same space, and assigning each point to its closest centroid.
    3. Moving the centroids to the middle of the points allocated to it (based on the mean distance).
    4. Reassigning the points to their closest centroid after the move.
    5. Repeating steps 3 and 4 until the cluster allocations stabilize or the specified number of iterations has completed.
  3. Evaluate performance. The metrics in each row are:
    1. Average Distance to Other Center: This indicates how close, on average, each point in the cluster is to the centroids of all other clusters.
    2. Average Distance to Cluster Center: This indicates how close, on average, each point in the cluster is to the centroid of the cluster.
    3. Number of Points: The number of points assigned to the cluster.
    4. Maximal Distance to Cluster Center: The maximum of the distances between each point and the centroid of that point’s cluster. If this number is high, the cluster may be widely dispersed. This statistic in combination with the Average Distance to Cluster Center helps you determine the cluster’s spread.
  4. Deploy a predictive service

Exercise - Explore clustering with Azure Machine Learning designer

  1. create a clustering model that identifies the species of a penguin from its body mass, flipper length, and other factors.
  2. https://microsoftlearning.github.io/AI-900-AIFundamentals/instructions/02c-create-clustering-model.html
  3. Create a new pipeline in designer. Draft Details : Train Penguin Clustering. Save
  4. Create a dataset. Data (Assets).
    1. Name: penguin-data | Description: Penguin data | Dataset type: Tabular | Web URL: Web URL: https://aka.ms/penguin-data | Skip data validation: do not select | File format: Delimited | Delimiter: Comma | Encoding: UTF-8 | Column headers: Only first file has headers Skip rows: None | Dataset contains multi-line data: do not select | Include all columns other than Path | Review | Create |
    2. Explore page to see a sample of the data. This data represents measurements of the culmen (bill) length and depth, flipper length, and body mass for multiple observations of penguins. There are three species of penguin represented in the dataset: Adelie, Gentoo, and Chinstrap.
  5. Load data to canvas.
    1. Designer -> select the Train Penguin Clustering
    2. Click on Data -> place the penguin-data dataset onto the canvas -> Preview data
      1. Slect the CulmenLength. column : Length of the penguin’s bill in millimeters.
      2. CulmenDepth: Depth of the penguin’s bill in millimeters.
      3. FlipperLength: Length of the penguin’s flipper in millimeters.
      4. BodyMass: Weight of the penguin in grams.
      5. Species: Species indicator (0:”Adelie”, 1:”Gentoo”, 2:”Chinstrap”)
      6. There are two missing values in the CulmenLength column (the CulmenDepth, FlipperLength, and BodyMass columns also have two missing values).
      7. The measurement values are in different scales (from tens of millimeters to thousands of grams).
  1. Apply transformations -> Asset library -> Component -> Add & Connect
    1. Select Columns in Dataset module -> Edit column -> select the column names CulmenLength, CulmenDepth, FlipperLength, and BodyMass
    1. Clean Missing Data module -> Edit column -> select With rules -> include All columns
    2. With the Clean Missing Data module still selected, in the settings pane: Minimum missing value ratio: 0.0 | Maximum missing value ratio: 1.0 | Cleaning mode: Remove entire row
    3. Normalize Data module -> set the Transformation method to MinMax -> Edit column -> select With rules and include All columns
  2. Run the pipeline -> create new experiment named mslearn-penguin-training -> Submit
  3. View the transformed data -> Completed Job detail -> Normalize Data -> Preview data -> Transformed dataset to view the results -> the Species column has been removed, there are no missing values, and the values for all four features have been normalized to a common scale.
  1. Add training modules - ready to use them to train a clustering model.
    1. Split Data module -> Splitting mode: Split Rows | Fraction of rows in the first output dataset: 0.7 | Randomized split: True | Random seed: 123 | Stratified split: False
    2. Train Clustering Model module -> should assign clusters to the data items by using all of the features you selected from the original dataset -> Edit column -> With rules option to include all columns
    3. K-Means Clustering module -> groups items into the number of clusters you specify - a value referred to as K -> set the Number of centroids parameter to 3.
    4. think of data observations, like the penguin measurements, as being multidimensional vectors. The K-Means algorithm works by:
      1. initializing K coordinates as randomly selected points called centroids in n-dimensional space (where n is the number of dimensions in the feature vectors).
      2. Plotting the feature vectors as points in the same space, and assigning each point to its closest centroid.
      3. Moving the centroids to the middle of the points allocated to it (based on the mean distance).
      4. Reassigning the points to their closest centroid after the move.
      5. Repeating steps 3 and 4 until the cluster allocations stabilize or the specified number of iterations has completed.
    5. After using 70% of the data to train the clustering model, you can use the remaining 30% to test it by using the model to assign the data to clusters.
    6. Assign Data to Clusters module
  2. Run the training pipeline- > using the existing experiment named mslearn-penguin-training -> Submit -> Finished Job detail -> Assign Data to Clusters -> Preview data -> Results dataset to view the results. Assignments column, which contains the cluster (0, 1, or 2) to which each row is assigned. There are also new columns indicating the distance from the point representing this row to the centers of each of the clusters - the cluster to which the point is closest is the one to which it is assigned.
    1. The model is predicting clusters for the penguin observations, but how reliable are its predictions? To assess that, you need to evaluate the model.
    2. Evaluating a clustering model is made difficult by the fact that there are no previously known true values for the cluster assignments. A successful clustering model is one that achieves a good level of separation between the items in each cluster, so we need metrics to help us measure that separation.
  1. Add an Evaluate Model module -> using the existing mslearn-penguin-training experiment-> Submit -> finished Job detail -> Evaluate Model -> Preview data -> Evaluation results. Review the metrics in each row:
    1. Average Distance to Other Center
    2. Average Distance to Cluster Center
    3. Number of Points
    4. Maximal Distance to Cluster Center
  2. Create an inference pipeline -> you have a working clustering model, you can use it to assign new data to clusters in an inference pipeline. The inference pipeline uses the model to assign new data observations to clusters.
    1. Create inference pipeline -> Real-time inference pipeline -> Draft details, rename Predict Penguin Clusters.
    2. Review the pipeline. The transformations and clustering model in your training pipeline are a part of this pipeline. The trained model will be used to score the new data. The pipeline also contains a web service output to return results.
    3. Add a web service input component for new data to be submitted.
    4. Remove the Select Columns in Dataset module, which is now redundant.
    5. The inference pipeline assumes that new data will match the schema of the original training data, so the penguin-data dataset from the training pipeline is included. However, this input data includes a column for the penguin species, which the model does not use. Delete both the penguin-data dataset
    6. Enter Data Manually module -> use the following CSV input, which contains feature values for three new penguin observations (including headers):
        CulmenLength,CulmenDepth,FlipperLength,BodyMass
        39.1,18.7,181,3750
        49.1,14.8,220,5150
        46.6,17.8,193,3800
      
    7. Submit the pipeline -> a new experiment named mslearn-penguin-inference -> finished Job detail -> Assign Data to Clusters -> Preview data -> Results dataset to see the predicted cluster assignments and metrics for the three penguin observations in the input data.
  3. Deploy a service -> Select Job detail -> Deploy -> Name: predict-penguin-clusters | Description: Cluster penguins | Compute type: Azure Container Instance
  4. Test the service -> Endpoints -> open the predict-penguin-clusters real-time endpoint -> Test -> see the output ‘Assignments’. Notice how the assigned cluster is the one with the shortest distance to cluster center.
      {
         "Inputs": {
             "input1": [
                 {
                     "CulmenLength": 49.1,
                     "CulmenDepth": 4.8,
                     "FlipperLength": 1220,
                     "BodyMass": 5150
                 }
             ]
         },
         "GlobalParameters":  {}
     }
    
  5. Clean-up

Quiz

  1. You are using an Azure Machine Learning designer pipeline to train and test a K-Means clustering model. You want your model to assign items to one of three clusters. Which configuration property of the K-Means Clustering module should you set to accomplish this?
  • Set Number of Centroids to 3
  • Set Random number seed to 3
  • Set Iterations to 3
  1. You use Azure Machine Learning designer to create a training pipeline for a clustering model. Now you want to use the model in an inference pipeline. Which module should you use to infer cluster predictions from the model?
  • Score Model
  • Assign Data to Clusters
  • Train Clustering Model