Hands-on lab on how to build AI and ML models using a visual interface (no code) on Watson Studio and Watson Machine Learning. Tools used are Data Refinery and Modeler Flows, available on: Watson Studio Cloud / Local / Desktop and Machine Learning on Z.
-
Download the data from here.
-
Load the data to your project. To do so, click on the plus (+) sign at the top right corner and select "Add data set".
Flows allow you to drag and drop nodes and connect them. Each node could be a dataset, a transformation, or a model, among other things.
-
On the top right, click data panel (10101) and drag and drop the
airline.csv
dataset to the modeler flow. A new node "airline.csv" should appear on the canvas: -
Under the Outputs Node List, drag and drop the "Data Audit" node.
-
Connect the data node (airline.csv) to the Data Audit node. Right click on the Data Audit node and select "Run". To see the results, go to the top right corner and click the round arrow pointing down. You will see "Data Audit of [29 fields]".
-
Double click on "Data Audit of [29 fields]" and scroll down until you see the column "% Complete" which indicates the percentage of non-missing values per column. Search for columns that are < 30% complete, i.e., more than 70% missing values. These should be:
-
Remove columns that have more than 70% missing values. To do so, drag and drop a Filter node (from Field Operations list) into the canvas and connect it to the Data node "airline.csv". Double click on the Filter node and then click on Filter on the collapsable menu on the right. Click on "Add Columns" to select the fields to filter out. Click Save.
-
Verify that the fields were filtered out by right clicking the Filter node then Preview. Scroll all the way to the right and make sure the intended fields were actually removed.
-
Remove cancelled or diverted flights. To do so, drag and drop a Select Node (from Record Operators) to the canvas. Connect it with the Filter node. Then, double click on it, go to Settings and create a condition, which will be "Cancelled = 1 or Diverted = 1".
-
Open the visualization tool. Go to your project, then Assets, then Data Sets, and Data Assets select look for
airline.csv
. Click on the three vertical dots menu and select click "Data Visualization" -
Create a histogram of the flight arrival delay. Select the Histogram type, select column ArrDelay
-
Plot flights per year using a histogram of the field Year. Unselect "Show distribution curve". Click Apply.
-
Visualize busiest airlines using a barplot of column UniqueCarrier. Click Apply.
-
Visualize busiest airports using a barplot of column Origin. Click Apply.
-
Visualize busiest times to fly with a histogram of departure time. Set "Bin width" to 2 to get hourly counts. Unselect "Show distribution curve"
-
Draw a correlation plot using the Scatterplot matrix chart. Explore with different attributes such as ArrDelay and DepDelay.
Now let's go back to our Modeler Flow to do some more data exploration:
-
Compute correlations between ArrDelay and the rest of the columns. First, drag and drop the Statistics node (under Outputs) and connect it to the Select node. Double click on the Statistics node, go to Settings, and click on "Add Columns" under "Examine". Select the ArrDelay column and unselect all statistics. Under correlate click "Add Columns" and select all columns. Unselect the Show correlation strength labels in output. Click save.
-
Right click on the Statistics node and click Run. See the results on the top right corner by clicking on the the round arrow pointing down and double click on Statistics.
-
Create a new column "class" using the Derive node (under Field operations). Double click on it and go to Settings. Set the name for the new column as "class". Then set "Derive as" to "Nominal". Configure values as follows:
-
Early: if ArrDelay < 0.0,
-
Delayed: if ArrDelay > 15.0,
-
Plot the distribution of "class" using the Distribution Node (under Graph operations).
-
Check the class distribution. Click the round arrow pointing down on the top right and double click the class distribution.
-
Add a Type node (under Field operations) and change the class role to "target".
-
Split data into train (80%) and test (20%) sets using the Partition node (under Field operations).
-
Add a C5 Model node (under model operations). Right click and RUN to train the model.
-
Right click the output model and click the View Model option to get model details. Why do you think one single variable is getting all the importance?
-
Go back to the Type node created a few steps back and change the ArrDelay and DepDelay columns' role to None (do the same for all the inputs that are not useful in practice).
-
Inspect model's quality by connecting the Analysis ode to the yellow node (the trained model). Right click and run to see quality metrics like confusion matrix and accuracy on the train and test sets.
-
Open the quality metrics report by double clicking the Analysis report (under Output operations) on the top right under the rounded pointing down arrow.
-
See how the accuracy and the confusion matrices change in train and test datasets.
-
Connect a Table mode (under Output operations) to the train model. Right click and RUN.
-
To see the predictions of the class field, on the top right under the rounded pointing down arrow double click the most recent Table report .
-
Go back to your project and look for your model. Under Actions, click "Publish".
-
On the published models (in the Models TAB), under actions, click "Deploy".
-
Congratulations! Your model has been deployed as a web service. From here, you can check the deployment details, schedule evaluations with new labeled data, test the API and update models once trained with new data.