Tutorial

Table of Contents

CNTK Tutorial: Getting Started
Where to Go Next

CNTK Tutorial: Getting Started

CNTK is a framework for describing learning machines. Although intended for neural networks, the learning machines are arbitrary in that the logic of the machine is described by a series of computational steps in a Computational Network. A Computational Network defines the function to be learned as a directed graph where each leaf node consists of an input value or parameter, and each non-leaf node represents a matrix or tensor operation upon its children. The beauty of CNTK is that once a computational network has been described, all the computation required to learn the network parameters is taken care of automatically. There is no need to derive gradients analytically or to code the interactions between variables for backpropagation.

In this tutorial we will walk through several complete CNTK examples, showing

how to describe a network in our network-description language BrainScript;
how to set the learning parameters; and
how to read in, and write out, data

We will start with a simple implementation of binary classification using the linear model Logistic Regression. From there, we will expand to multiclass classification towards the end of this tutorial. Our next tutorial will tackle a more complex multiclass classification problem that will greatly benefit from a deep network architecture.

Install CNTK

Before starting, please follow the instructions here to setup CNTK on your machine.

Logistic Regression (LR)

Logistic regression is a simple model for performing binary classification. Given some real-valued d-dimensional example x = (x_1, .., x_d)', we want to output one of two labels: 0 or 1, true or false, spam or not-spam, etc. So, we want to learn a function y = f(x; w, b) where y is in {0, 1} and w = (w_1, ... , w_d) and b are parameters that we will learn that describe the relationship between x and y.

Like other regression models, logistic regression models the relationship between the independent variable x and the dependent variable z by a linear combination of x with the parameters w and b. In particular, the evidence is described as a linear regression:

z = b + w_1 w_1 + w_2 x_2 + ...

The model learns the weights w and b and determines that, for example, if a high value of x_1 corresponds to a more likely label of y = 1, then the weight w_1 should also be high. If, on the other hand, the value of x_2 corresponds to a more likely label of y = 0, then w_2 should have a large negative value.

The logistic part of logistic regression comes into play because instead of predicting an unbounded number, we are instead interested in predicting the probability of one of two labels. To output a probability, we transform the evidence using the logistic function: sigma(x) = 1 / (1 + exp(-x)). The logistic function looks like this:

Logistic Function

So, applying the logistic function has the effect of 'squashing' the evidence to fall between 0 and 1. Large positive evidence values become close to 1, and large negative evidence values become close to 0. This allows our model to output probabilities.

Our model can then be seen as y ~ p(y | x; w, b) = sigma(wx+b) = 1 / (1 + exp(-wx-b)) Then, we predict y = 1 if p(y | x; w, b) > 0.5 and y = 0 otherwise.

Learning Parameters w and b

Without CNTK, we need to choose an optimization procedure and solve the derivatives of a cost function with respect to the parameters we want to learn. Seeing logistic regression as a probabilistic model, we can maximize the likelihood of the data. This turns out to be the same as minimizing the cross-entropy cost function, which for logistic regression is also known as the 'logistic loss function'. In general this cannot be done analytically, but we can determine analytic solutions for the gradients and then use gradient descent to converge towards the correct parameters.

CNTK uses the common stochastic gradient descent (SGD) algorithm to learn parameters in a Computational Network. The gradients are determined through automatic differentiation. Before going through how we would set up the problem in CNTK, let's take a look at the data we will use for our first classification problem.

Synthetic Data

We will start with an easy, but visualizable problem. This data is generated by sampling from two 2-dimensional gaussians with different means. The script used to generate the data is here and pre-generated data from this script is included for training and testing. Plotting the training data looks like this:

Generated 2d data

and our task is to classify and point in the 2D plane as originating from one of the two classes. As we said, it's an easy problem to start with :-)

Setting up Logistic Regression in CNTK

To setup and train the Computational Network, we will write a .cntk configuration file that

describes the network;
specifies the commands we want to perform on the network (train, test, output, etc.);
sets how we want to learn the parameters of the network (SGD and its parameters); and
how CNTK should read and write the data.

We will start with how to describe the Computational Network.

####Describing Logistic Regression as a Computational Network

To describe our Computational Network, we use CNTK's network-description language BrainScript. We will need to define:

the input features
the labels
the parameters to learn
the computational operations, and
the root nodes (outputs)

Let's go through them each in turn. Let's go through them each in turn. First, though, here is the .cntk block that describes the our network:

BrainScriptNetworkBuilder = [

    # sample and label dimensions
    SDim = 2       # sample dimension
    LDim = 1       # label dimension

    features = Input (SDim, 1)
    labels   = Input (LDim, 1)

    # parameters to learn
    b = Parameter (LDim, 1)     # bias
    w = Parameter (LDim, SDim)  # weights

    # operations
    p = Sigmoid (w * features + b)

    lr  = Logistic (labels, p)
    err = SquareError (labels, p)

    # root nodes
    featureNodes  = (features)
    labelNodes    = (labels)
    criteriaNodes = (lr)
    evalNodes     = (err)
    outputNodes   = (p)
]

Input features and labels

The two principal inputs to the network are the features and the labels. Because we have two-dimensional data, we set the features to be of type Input where the dimension of a sample is 2. All objects are matrices so we defined a single sample of our features to be a matrix with SDim (2) rows and 1 column.

We must also define the possible labels. Because we are performing binary classification, we could set this up either as a multi-class classification problem where we will predict the probability of label 0 and the probability of label 1, or we can see the problem as the equivalent true binary classification problem that only wants to predict one of the labels. In the latter case, the probability of the other label is of course p(y=0) = 1 - p(y=1). In this tutorial we match the original definition of LR and go with the one label prediction.

Model parameters

Above we defined the evidence for logistic regression as

z = b + w_1 x_1 + w_2 x_2 + ...

Because our input data has 2 dimensions, and we are only predicting a single label, w will be a row vector (expressed as a [1 x 2] matrix) and b will be a scalar value (expressed as a [1 x 1] matrix).

Computational operations

Next we need to express the logistic-regression function. This is straight-forward in BrainScript:

p = Sigmoid (w * features + b)

Under the hood, this will expand into these 3 successive operations as nodes in the graph:

multiply the weights w with the features,
add the bias term b, and
squash the evidence down to a probability p.

Next, we want to describe the criteria that are used to learn the parameters. The line

lr = Logistic (labels, p)

sets up a criterion node that is used for learning the parameters that are involved in any related computation. Here, we are specifying that CNTK should run the Logistic loss function with the correct answers labels and the current predictions s. The function Logistic is built-in to CNTK and computes the loss as follows:

-sum (labels * log(p) + (1 - labels) * log(1 - p) )

... continue here

which would have worked just as well in the configuration file (though NDL doesn't provide overloaded operators). This makes it easy to define your own loss functions.

The lines

err = SquareError (labels, p)
evalNodes = (err)

are used for model evaluation as the training occurs. While the learning is solely concerned with minimizing the logistic loss, in the end we're actually (probably) most interested in accuracy. We can then evaluate the model as it trains by setting up an evaluation node as we do here. Note that we make use of the SquareError function. That means that, though we are minimizing the logistic loss, we want to see how sure our model is, combined with how correct it is, on the test data. For example, if our model predicts that a sample has p(y=1 | x) = 0.99 and the true label is y=1, we've done very well. However, the squared error for this sample is still (1-0.99)^2 = 0.0001. When we talk about multi-class classification with 3 or more classes, we will use the classification error for model evaluation.

#####Root nodes

Finally, we describe the root nodes of the network. This section is for specifying some special nodes that have a particular meaning in CNTK. For example, the CriteriaNodes are those where an objective is specified that CNTK will try to achieve, the EvalNodes are the nodes used for perfoming evaluation as the model is trained, and the OutputNodes are the nodes whose values will be output. This final node is important because these are the actual values that the model is predicting for a given input. Again, this node is optional but useful for seeing how the model is performing on its ultimate goal.

####Specify the commands to run

The first line (after the comments) of the .cntk file specifies the commands that CNTK should run. These commands are specified by the user and defined in the configuration file. Our example includes the following:

    command=Train:Output:dumpNodeInfo:Test

As you can see, each of the commands is separated by a : and you can have as many commands as you require for your particular use case. Commands are simply a way to organize functions of your network, and the CNTK program will just execute them one by one. Let's go through the Train command with the details hidden away.

#####Command Train

# training config
    Train = [
	    action="train"
    
	    NDLNetworkBuilder = [
        ...
	    ]
    
	    SGD = [
	    ...
	    ]
    
	    reader = [
	    ...
    ]

We first specify that the action of this command is to "train". There is nothing special about the command name "Train" but there is something special about the action "train". Within the Train command, we describe our network (as discussed in the previous section), we specify how to learn on the computational network with the SGD Learner (next section), and we specify the reader that will import our training data into the network (section below on Data Readers).

#####Command Output

The Output command is defined in our CNTK configuration file as follows:

# output the results
Output=[
	action="write"
	reader=[
		readerType="UCIFastReader"
		file="Test.txt"
		features=[
			dim=$dimension$
			start=0
		]
		labels=[
			start=2
			dim=1
			labelType=regression
		]
	]
	outputPath = "LR.txt"		# dump the output as text
]

Again, there is nothing special about the word Output, but there is something special about the action write. We first say that this command will take the action to "write" something. For that, we specify the reader (details explained below) and the outputPath. The outputPath gives the prefix for the file that will be outputted; the actual file will have the name of the output variable appended to the end.

So, what does this output? It gives the prediction values at our output nodes (remember in our network description we set the output nodes to be s with OutputNodes=(s)). Our Output command will read in our test data "test.txt", run it through the learned model, and output the values for each of the examples into the file LR.txt.s.

#####Command Test

The Test command performs the action test and will make use of the specified EvalNodes from our network description. Just as above with the Output command, we will specify a reader for some test data but this time instead of simply outputting the prediction for each test sample, it will use the function SquareError as specified in the configuration and compute the error for our test set. Here is the description of the Test command:

Test=[
	action="test"
	reader=[
		readerType="UCIFastReader"
		file="Test.txt"
		features=[
			dim=2
			start=0
		]
		labels=[
			start=$dimension$
			dim=1
			labelDim=2
		]
	]
]

While Logistic measured the training loss and guided the model parameter updates, we use SquareError to measure the classification error after each iteration on the test set (this might normally be done with a separate validation set). It takes the matrix s, which is the prediction of our model, and compares it to the correct labels. The final output will give the error per sample of our network on the test/validation data.

Remember that in our CNTK configuration file it looks like this:

      EP = SquareError(labels, s)
      EvalNodes = (EP)

#####Command dumpNodeInfo

The dumpNodeInfo command is defined as follows:

    dumpNodeInfo=[
        action=dumpnode
        printValues=true
    ]

This command simply outputs the values of all parameters in the network. It can be useful for debugging and for doing further processing with what your network learned. For example, for the synthetic data described above, CNTK will learn the parameters W and the bias B. We can use CNTK to perform prediction on a test set (as described below), but we might also be interested in visualizing what decision boundary is described by our model. Let's do that here.

The dumpNodeInfo command outputs a file called LR.dnn.__AllNodes__.txt into the "Models" directory. Following the training of our example, that file contains:

B=LearnableParameter [1,1]   NeedGradient=true 
 -6.67130613 
 #################################################################### 
EP=SquareError ( labels , s ) 
features=InputValue [ 2 x 1 {1,2} ] 
labels=InputValue [ 1 x 1 {1,1} ] 
LR=Logistic ( labels , s ) 
s=Sigmoid ( z ) 
t=Times ( W , features ) 
W=LearnableParameter [1,2]   NeedGradient=true 
 1.23924482 1.59913719 
 #################################################################### 
z=Plus ( t , B )

It describes the network and, most importantly for our current purposes, gives the learned values of the bias B and the parameters W. Let's see how this translates to our decision boundary. Our evidence is E = x_1 W_1 + x_2 W_2 + B. If E > 0, then we predict y = 1. If E < 0, we predict y = 0. Therefore, our decision boundary occurs when E = 0. Then,

0 = x_1 W_1 + x_2 W_2 + B

x_2 = -x_1 (W_1 / W_2) - (B / W_2)

The above is just the standard equation of a line y = mx + b with y = x_2, x = x_1, the slope m = (-W_1 / W_2), and the bias b = -B / W_2. If we then plug in the values from the AllNodes file B = -6.67130613, W1 = 1.23924482, and W2 = 1.59913719, we see:

Decision Boundary

Not bad!

####Set up the learning algorithm

Finally, to be able to learn the parameters that we used above, CNTK needs to make use of a learning algorithm. That algorithm is stochastic gradient descent (SGD). SGD iteratively looks at some fixed subset of the training examples (called a "minibatch") and updates the parameters in the direction of the cost function gradients after every such step. Optimizing parameters for neural nets is in general not a convex optimization problem, so for all but the simplest cases there is not a single global optimum. By looking at a minibatch instead of the full data, every update follows just a rough approximation of the final training target. SGDs "stochastic" nature helps jumping out of local optima and has proven to be both simple and effective for finding good solutions. Using CNTK's implementation of SGD allows a number of settings to fit the problem at hand. For our simple problem, we set up SGD in the .cntk file as follows:

SGD = [	
	epochSize=0
	minibatchSize=25
	learningRatesPerMB=0.1
	maxEpochs=50
]

The epochSize determines how many examples will be examined per epoch. If it is set to 0 then that means all of the training data will be examined for every epoch (also can be thought of as iteration). Within each iteration, mini-batches of points are examined together and their updates are computed. At the end of the mini-batch, the average update is used to alter the parameters. In the case above, 25 samples will be examined at a time before the parameters are updated.

The learningRatesPerMB gives the SGD learning rate. This controls how big of a step to take in the direction of the gradient for each parameter update. Here, we use a constant learning rate of 0.1, but we can also set up a descending learning rate by separating the values with a :. Here we've used the learning rate per mini-batch, but we can also use learningRatesPerSample. Finally, to configure a complex descending learning rate, we can cascade the rates by using *. For example:

learningRatesPerMB=0.5:0.2*5:0.1

is the same as writing

learningRatesPerMB=0.5:0.2:0.2:0.2:0.2:0.2:0.1

and will slowly descend the learning rate per mini batch until it settles at 0.1.

Finally, maxEpochs gives the maximum number of epochs to perform before terminating the algorithm. In this case, we will run SGD for exactly 50 epochs because there are no other stopping criteria. We could include some early stopping criterion that would stop learning once the error reaches some threshold, for example. CNTK also supports other forms of early stopping, momentum (don't diverge too far from a previous update's direction), and other advanced gradient descent features. They will be covered in a later tutorial.

####How to read and write data (Data Readers) Reading and writing data is performed using Data Readers. A reader is defined as a sub-block in the configuration file, within a command. The following reader definition is the one used in our example:

reader = [
	readerType = "UCIFastReader"
	file = "Train.txt"
	features = [
	    start = 0
		dim = 2
	]
	labels = [
		start = 2
		dim = 1
		labelType = regression
	]
]

In our example we use the following parameters:

readerType: In our example we will use UCIFastReader, which is designed to read the UCI data sets format, which is a simple, tabular text format. Each line describes a sample that has a feature vector and label(s). By default the delimiters used to split columns in the data are either tabs or white spaces. However, one can define custom delimiters as well, e.g. customDelimiter = ";"
file: the file that contains the dataset
dim: the dimension of the feature vector or the label vector. Note that each column in the UCI data file represents one dimension of the input data
start: the start column (zero-based) of the feature vector or the label vector
labelDim: the number of possible label values. This parameter is required for categorical labels since the dimension of the label node will be determined by this value
labelMappingFile: the path to a file used to map from the label value to a numerical label identifier. The file typically lists all possible label values, one per line, which might be text or numeric. The zero-based line number is the identifier that will be used by CNTK to identify that label. It is important that the same label mapping file is used for training and evaluation. This can be done by moving the labelMappingFile parameter up so that it can be shared by both the training and evaluation blocks
labelType: if the label type is set to "regression", then there is no need for the label mapping file. In our binary classification problem we use the type "regression" because although we are performing classification, we are only predicting a single label and so in this sense the predictions conform to a regression model (even though we later "squish" the prediction to a probability).

This particular reader is very basic, and most likely your needs will quickly outgrow it. For example, it cannot be used to set up RNN training, as this would require grouping lines into sequences. Please check CNTK documentation for more sophisticated readers. Also stay tuned, since CNTK devs are working hard to add new variants.

####Putting it all together

In the previous several sections, we have explained how to describe the computational network, how to read in data, and how to setup the learning that will be done over the network. Here, let's put it all together into a single CNTK configuration file.

The final configuration file is here. (Also ensure that Test and Train files are present in your working directory) The order of defining commands is not important. For example, we can have command=Train:Test and then define Test first, followed by Train. What is important is the order of the commands within the command= line. In the example here, Train would happen before Test (and of course it makes sense to train your network before testing it!).

In our CNTK configuration file, the first line reads command=Train:Output:dumpNodeInfo:Test. So, when we run it through CNTK we will train the model, output the predictions, dump the node info, and finally test the network. For all of this magic to happen, we simply run:

cntk configFile=lr_ndl.cntk

And that's it! You have now successfully defined a computational network that performs binary classification, learned the parameters, and tested it. Here is the end of the output from running the above command:

Final Results: Minibatch[1-1]: 
Samples Seen = 500
EP: SquareError/Sample = 0.005790257
LR: Logistic/Sample = 0.05882774
COMPLETED

So on our test data (as we would expect given the decision boundary we plotted above), we did quite well: ~99.4% accuracy on the test set (using square error; a classification error shows 99.2%)! This is about as good as we can possibly do given that the classes slightly overlap.

####Using the GPU

As a side note, the example so far has been described to use CNTK in CPU mode only. To get the full power of CNTK and make use of the GPU, the deviceId parameter must be set. This can be done within the .cntk file, or at the command line as follows:

cntk configFile=lr_ndl.cntk deviceId=auto

This will make CNTK use the GPU for computation (and thus uncover the real power of the kit). See more on the deviceId parameter here (Windows) or here (Linux). Note that any key-value pair submitted as a parmater to the CNTK executable "patches" an existing configuration, with the exception of the key "configFile" which is always required to set a "base" configuration. Thus, the above is equivalent to having a deviceId key in the file itself.

Now let's make things a little more interesting with multiclass classification...

Multi-Class Classification

Let us make our simple problem slightly harder by adding a third class. We can borrow the majority of the stuff we have used for the binary classification problem above. However, we will use Softmax instead of Sigmoid. So, let us start with a quick introduction of the softmax before building the computational network for this problem.

Generated 3 classes data

Softmax

The softmax function is a generalization of the logistic function. It maps a vector of real values to a probability distribution. It is widely used for multi-class classification problems. In this context, if we are trying to predict the class of an instance out of K possible classes, the model would compute a K-dimensional vector whose components represent the confidence scores for each class. Then, the softmax function would map those scores to a probability distribution over the possible classes using the following formula:

softmax formula

where X is our input vector, and W is the weight matrix, that includes both the model parameters and the bias.

The Network Description

We want to define a computational network that looks like the one in the figure below. This network can be understood as a combination of three linear models, each of which will be trained to separate one of the three classes from the other two. Then, on top, we have the softmax layer that squashes the linear model scores into a probability distribution. Thus, for each instance, the network outputs three probability values that sum up to one. For example, if a given instance is of class (1), the model might output the following probabilities (95%, 3%, 2%), which basically means that the probability of the instance belonging to class (1) is 95%.

The network with softmax layer for 3-class problem with two input features

Armed with what we have already learned about CNTK's network builder, let's build the network that solves our 3-class problem. Actually, we are already almost done, as such a network is very similar to the one we already used for our binary classification problem, but with a few differences:

the Label dimension has been set to 3 instead of 2;
we took out the Sigmoid node;
we replaced the Logistic learning criterion by CrossEntropyWithSoftmax which will add the softmax layer and use cross entropy as the objective function; and
we replaced the SquareError evaluation node with ClassificationError. Here's our new network:

NDLNetworkBuilder = [

    run = ndlLR

    ndlLR = [
      # sample and label dimensions
      SDim=2
      LDim=3
    
      features=Input(SDim, 1)
      labels=Input(LDim, 1)
    
      # parameters to learn
      B = Parameter(LDim)
      W = Parameter(LDim, SDim)
    
      # operations
      t = Times(W, features)
      z = Plus(t, B)
    
      MC = CrossEntropyWithSoftmax(labels, z)
      EP = ClassificationError(labels, z)
    
      # root nodes
      FeatureNodes=(features)
      LabelNodes=(labels)
      CriteriaNodes=(MC)
      EvalNodes=(EP)
      OutputNodes=(z)
    ]
]

Let us put it all together

Similarly to the binary classification problem, the final configuration file is here together with Test, Train and Mapping. We run:

cntk configFile=3Classes_ndl.cntk

(This will run CNTK on CPU - see more on this in the previous section)

And that's it! You have now successfully defined a computational network that performs multi-class classification, learned the parameters, and tested it. Here is the end of the output from running the above command:

Final Results: Minibatch[1-1]: 
Samples Seen = 500
EP: ErrorPrediction/Sample = 0.088
MC: CrossEntropyWithSoftmax/Sample = 0.23359343
COMPLETED

So on our test data (as we would expect given the decision boundary we plotted below), we did well: 91.2% accuracy!

3 class decision boundaries

Where to Go Next

Check out Tutorial II to learn how to build more complex models like convolutional neural networks.

New Documentation Site

Iteration Plans

Provide feedback

Saved searches

Use saved searches to filter your results more quickly