Skip to content

MLP Forward Propagation on DFE: Structure

Yannick Goumaz edited this page Mar 4, 2022 · 21 revisions

Goal

Now that the results of executing the forward propagation on CPU for a simple network are known, the forward propagation is implemented on Maxeler DFEs to compare the execution time.

This page explains the basic elements of the FPGA application (kernels, manager and CPU code to run the program). To understand further optimization steps and the final architecture, head to MLP Forward Propagation on DFE: Results and Optimization.

Kernel

As seen in the Basics of DFE Applications chapter, a Maxeler FPGA application is made of kernel blocks. In the case of deep learning inference, one block can represent one layer of the network.

The following sections explain how the .maxj file of a kernel is constructed and how it works. It may be complicated to understand some concepts with words and code only. You can always refer to the detailed schema below which shows the complete process of the kernel in an illustrative way: Kernel structure schema.

Operations to perform

Each kernel should apply the following operations: take inputs of dimension equal to the layer input size, perform a dot product between each input and all weights of the layer, sum the bias values and apply the activation function. Outputs are the synaptic weights and the activations with a dimension equal to the output size of the layer.

Kernel operations
Basic schema of operations that need to be performed in the kernel of a specific layer.

Here IN dimension is equal to the input size of 1 image at a particular layer and OUT the output size of this image after the layer. In practice a batch of batch_size images can be streamed to the DFE, making the output and input dimension batch_size times larger. The weights and biases still keep their original INxOUT dimension, because they are the same for all images in the batch.

Counters and controlled streams

A basic Maxcompiler kernel takes an input stream of a given dimension and lets one value flow after another each clock cycle (called ticks) through the pipeline of operations. The output is generally of the same size as the input, with each input value mapped to an output.

In the case of this forward propagation application, the input is not of the same size as the output. MaxCompiler allows to control at which clock cycles an input or an output must be released using a boolean value.

Another useful tool for our application is the chain counter. The value of a counter is incremented each clock cycle, allowing to keep track of where we are in the computation. Chain counters allow to create nested loops to count multiple times over a defined maximal number of ticks. Once the counter has reached the maximum, it begins again at zero. Chain counters are defined the following way:

CounterChain chain = control.count.makeCounterChain();
DFEVar outter = chain.addCounter(outter-max, incremental-value1);
DFEVar inner = chain.addCounter(inner-max, incremental-value2);

For each value of outter, inner is incremented by incremental-value1 until inner-max is reached and then outter is incremented by incremental-value2. When outter reaches outter-max it also starts again at zero. With the help of these chain counters and controlled inputs/outputs, we can loop over the output and input sizes and choose when to release an input or an output into or out of the pipeline. For more information about these 2 topics, the MaxCompiler documentation provides complementary explanations: Controlled Inputs and Outputs and Nested loops.

The following chain counter allows to loop over outputs and inputs and release inputs at the correct moment.

CounterChain chain = control.count.makeCounterChain();
DFEVar outCounter = chain.addCounter(outputDim, 1);
DFEVar inCounter = chain.addCounter(inputDim, 1);

DFEVar input = io.input(IN_NAME, dfeFloat(8,24), outCounter.eq(0));

outputDim and inputDim are scalar input values which can be passed from the manager. They are declared the following way:

DFEVar inputDim = io.scalarInput(INSIZE_NAME, dfeUInt(MathUtils.bitsToAddress(inputSize+1))); 
DFEVar outputDim = io.scalarInput(OUTSIZE_NAME, dfeUInt(MathUtils.bitsToAddress(outputSize+1))); 

MathUtils.bitsToAddress() is a method which return the number of bits needed to store the value passed as argument (here inputSize and outputSize). These two arguments are java integers passed to the kernel at their initialization in the manager. For more information, head to this section: Manager: Custom Manager.

Dot product computation

The next step consists in computing the dot product between these input values and weights.

By releasing the input only when outCounter is equal to zero, every input values are released during the first inputDim ticks. Each input value can therefore be multiplied by a weight at each clock cycle and summed up with the previous multiplication. This sum value must be carried along inputDim ticks before being reset to zero (when inCounter is set back to zero).

DFEVar mul = input * weight;
	    	
DFEVar carriedSum = dfeFloat(8,24).newInstance(this);
DFEVar sum = (inCounter.eq(0)) ? constant.var(0).cast(dfeFloat(8,24)) : carriedSum;
newSum = sum + mul;	
carriedSum.connect(stream.offset(newSum, -loopLatency));

The code above shows how to reuse a sum which is carried along multiple ticks. We can observe that it is reset when inCounter is equal to zero, which means the dot product for the next output should be computed. We can also see the stream.offset() function allowing to get the value of a variable from a certain number of ticks before the current one.
The particularity here is the loopLatency variable. The naive implementation would be to use carriedSum.connect(stream.offset(newSum, -1)); because we want the value of the previous tick. In fact the multiplication of two floating point values is not taking only one tick but 12. But this value may vary between implementations. Hopefully MaxCompiler provides a tool called AutoLoop Offset, allowing to automatically calculate the lowest valid offset for a cycle in the graph. It is declared the following way:

OffsetExpr loopLatency = stream.makeOffsetAutoLoop("loop_offset_name");
DFEVar loopLatencyVal = loopLatency.getDFEVar(getKernel(), dfeUInt(32));

In addition a latency counter must be added to our previous chain counter:

CounterChain chain = control.count.makeCounterChain();
DFEVar outCounter = chain.addCounter(outputDim, 1);
DFEVar inCounter = chain.addCounter(inputDim, 1);
DFEVar latCounter = chain.addCounter(loopLatencyVal, 1);

Additional information about loops and automatic offsets can be found in the MaxCompiler documentation: Multi-tick Implementation and Autoloop Offsets

FMEM for multiple dot products

By analyzing the implementation so far, an important part allowing to compute the dot products of the next outputs is missing. The input values used so far are consumed for the first dot product but they are also needed to compute the dot product with the next weights.

To reuse these values, they must be stored in the memory. DFEs provide a type of memory called FMEM, for Fast Memory, which allows to store variables at a given memory address. The size of memory needed must be specified at allocation. Then we can use a port on the memory to specify that we want to write first and read the memory at the same clock cycle. A boolean also allows to control at which ticks the memory is written.

Memory<DFEVar> ibuf = mem.alloc(dfeFloat(8,24), inputDim);
DFEVar inPort = ibuf.port(inCounter, input, outCounter.eq(0), RamWriteMode.WRITE_FIRST);

With the code above, the input is now being written to the RAM each time it is released from the CPU. The inPort variable contains the input at address inCounter. This way, the values can be reused multiple times for the next dot products.

Bias and activation function

Once a dot product is calculated, its value can be propagated further to the two last operations: addition with the bias and application of the activation function. The variable s is the synaptic weight and x the activation of the layer.

DFEVar s = newSum + bias;
DFEVar x = tanh(s); 

The activation function is the hyperbolic tangent. MaxCompiler provides a kernel named KernelMath that allows to perform math operations. Tanh can be approximated using exponentials:

public DFEVar tanh(DFEVar input) {
    DFEVar x = input;
    DFEVar Exp2xPlus1 = KernelMath.exp(2*x, dfeFloat(8,24)) + 1.0;
    DFEVar DivResult = 2.0 / Exp2xPlus1;
    DFEVar Result = 1.0 - DivResult.cast(dfeFloat(8,24));
    return Result;
}

Output

The values s and x are defined as output from the kernel. We must be careful to output the values only when all their respective operations have been performed. That is, when inCounter reaches its last value, just before incrementing outCounter.

DFEVar outEnable = inCounter.eq(inputDim - 1) & l.eq(loopLatencyVal - 1);	
io.output(S_NAME, s, dfeFloat(8,24), outEnable);
io.output(X_NAME, x, dfeFloat(8,24), outEnable);

Store the weights and biases

The sections above explain the flow for all the streams except two: the weights and biases. The goal is to input a batch of multiple images in the kernel. Once a series of inputs corresponding to one single image have been processed by the operations described above, the next one must be able to follow the same procedure but the weights remain the same.

To do this, even before the first image is released, the weights and biases are all written to RAM first.

Params wCounterParams = control.count.makeParams (MathUtils.bitsToAddress(nbWeight) + 1).withMax(nbWeight).withWrapMode(WrapMode.STOP_AT_MAX);
Counter wCounter = control.count.makeCounter(wCounterParams);
DFEVar readingW = wCounter.getCount() < nbWeight ? constant.var(true) : constant.var(false);
DFEVar readingB = wCounter.getCount() < nbBiases ? constant.var(true) : constant.var(false);

DFEVar weights = io.input(W_NAME, dfeFloat(8,24), readingW);
DFEVar biases = io.input(B_NAME, dfeFloat(8,24), readingB);

Memory<DFEVar> wMem = mem.alloc(dfeFloat(8,24), nbWeights);
Memory<DFEVar> bMem = mem.alloc(dfeFloat(8,24), nbBiases);
wMem.write(wCounter.getCount().cast(dfeUInt(MathUtils.bitsToAddress(nbWeight))), weights, readingW);
bMem.write(wCounter.getCount().cast(dfeUInt(MathUtils.bitsToAddress(nbBiases))), biases, readingB);

First, we create a counter that will count during a number of ticks equivalent to the number of weights. A boolean checks if the number of elapsed ticks is smaller than the number of weights. Weights are controlled in input to only be released when this condition is true. Finally, the weights are written in the memory at the address corresponding to the value of the counter. The same process is used for bias values.

Some of our previous implementations must be adapted to only begin once the weights have been written to memory using the ~readingW condition:

CounterChain chain = control.count.makeCounterChain(~readingW);
DFEVector<DFEVar> input = io.input(IN_NAME, dfeFloat(8,24), ~readingW & h.eq(0) & l.eq(0));

Each time a multiplication has to be performed between a weight and an input, the memory containing the weights is read at the corresponding address. The address depends on the input number and which output's dot product is being computed:

DFEVar wcount = outCounter * inputDim + inCounter;
DFEVar weight = wMem.read(wcount.cast(dfeUInt(MathUtils.bitsToAddress(nbWeight))));

The bias value is read at the outCounter address.

DFEVar bias = bMem.read(outCounter.cast(dfeUInt(MathUtils.bitsToAddress(nbBiases))));

Kernel structure schema

The following schema summarizes all the details described above. It allows to have a global view of what is going on in the .maxj kernel file.

Kernel Schema MAXJ
Structure and computations of the kernel MAXJ file.

The name of the kernel is FLinearLayerBasicKernel.maxj and can be found here.

Manager

As seen in the Basics of DFE Applications chapter, a manager allows to define kernel blocks and to define input and output streams from and to the CPU. The manager is responsible for defining all the interactions between the resulting .max file of the application and the CPU code. It also allows to set the structure of the application by defining all needed kernels and their properties.

It is important to understand that one manager will produce one .max file after compilation and represent one network architecture of a certain depth and number of neurons per layer. Here we make the choice to declare the size of each layer as hard-coded constants.

private static final int SIZE_LAYER_0 = 784; // input layer
private static final int SIZE_LAYER_1 = 64; // hidden layer
private static final int SIZE_LAYER_2 = 10; // output layer

These constants can be used on the CPU side by adding the following lines of code to the custom manager:

addMaxFileConstant("SIZE_LAYER_0", SIZE_LAYER_0);
addMaxFileConstant("SIZE_LAYER_1", SIZE_LAYER_1);
addMaxFileConstant("SIZE_LAYER_2", SIZE_LAYER_2);

The depth of the network (which is equal to 2 for this application) is set by connecting the desired number of kernel in the custom manager (see next section).

The manager file is called ForwardPropBasicMAX5CManager.maxj and can be found here.

Custom manager

Because our desired application architecture involves multiple kernels and complex structure of input and output streams, we need to declare a custom manager in the main() function, named ForwardPropBasicMAX5CManager:

EngineParameters params = new EngineParameters(args);
ForwardPropBasicMAX5CManager mgr = new ForwardPropBasicMAX5CManager(params);

This manager object allows to define the structure and connections between all the kernel blocks.

The first step to perform in the custom manager is to define the kernels for our network. In our case we want two layers, therefore two kernels are declared. Integer values are passed as arguments to the kernels. These java variables allow to set the size of the scalar inputs and the size of the memories on the kernel side. They can be found in the Kernel sections above on this page.

private final String KERNEL_NAME1 = "FHIDDENLAYER_KERNEL";
private final String KERNEL_NAME2 = "FOUTPUTLAYER_KERNEL";

Kernel kernel1 = new FLinearLayerBasicKernel(makeKernelParameters(KERNEL_NAME1), nbWeights1, inputSize1, outputSize1);
Kernel kernel2 = new FLinearLayerBasicKernel(makeKernelParameters(KERNEL_NAME2), nbWeights2, inputSize2, outputSize2);
KernelBlock kernelBlock1 = addKernel(kernel1);
KernelBlock kernelBlock2 = addKernel(kernel2);

Then we need to set the input streams for the first kernel. As seen in the description of the kernel, the inputs are the weights, biases and input images. The following piece of code allows to get stream from the CPU and head them to the first kernel:

kernelBlock1.getInput(FLinearLayerBasicKernel.IN_NAME) <== addStreamFromCPU("input");
kernelBlock1.getInput(FLinearLayerBasicKernel.W_NAME) <== addStreamFromCPU("weights1");
kernelBlock1.getInput(FLinearLayerBasicKernel.B_NAME) <== addStreamFromCPU("biases1");

Then we need to get the synaptic weight s output of the first kernel and connect it to the CPU:

addStreamToCPU("s1") <== kernelBlock1.getOutput(FLinearLayerBasicKernel.S_NAME);

There is a particularity for the activation x output. It needs to be connected to two different streams: one to the CPU and the other one to the second layer as input. To perform this, Maxcompiler provides the Fanout block. It allows to create two connections from one output.

// Create Fanout and get the output of the first layer
Fanout x1Fanout = fanout("x1Fanout"); 
x1Fanout.getInput() <== kernelBlock1.getOutput(FLinearLayerBasicKernel.X_NAME);

// Connect to the CPU and second kernel
addStreamToCPU("x1") <== x1Fanout.addOutput("x11");
kernelBlock2.getInput(FLinearLayerBasicKernel.IN_NAME) <== x1Fanout.addOutput("x12"); 

Finally, we connect the two other inputs and the outputs of the second kernel to the CPU:

kernelBlock2.getInput(FLinearLayerBasicKernel.W_NAME) <== addStreamFromCPU("weights2");
kernelBlock2.getInput(FLinearLayerBasicKernel.B_NAME) <== addStreamFromCPU("biases2");
		
addStreamToCPU("s2") <== kernelBlock2.getOutput(FLinearLayerBasicKernel.S_NAME);
addStreamToCPU("x2") <== kernelBlock2.getOutput(FLinearLayerBasicKernel.X_NAME);

The following picture shows the constructed manager graph provided by MaxCompiler.

Kernel Schema MAXJ
Graph of the manager with the different input and output streams and defined kernels.

Engine interface

The next step is to define the engine interface. This SLiC (Simple Live CPU) interface object defines all interactions with the CPU, the sizes of each kernel, allows to define scalar inputs to pass from the CPU to each kernel and to set the number of ticks for each kernel to run.

The first task is to define the parameters that need to be given from the CPU code. In this case, the batch size (number of images in the input stream) needs to be passed as parameter from the CPU. This parameter can then be used by the engine interface to set stream sizes and number of ticks (see below).

InterfaceParam BS = ei.addParam("BS", CPUTypes.INT64);  // batch size

The value of the Autoloop offset (see Dot product computation) must also be known to set the number of ticks. The following piece of code allows to set them as parameters:

InterfaceParam L1 = ei.getAutoLoopOffset(KERNEL_NAME1, FLinearLayerBasicKernel.OFFSET); // automatic offset for layer 1
InterfaceParam L2 = ei.getAutoLoopOffset(KERNEL_NAME2, FLinearLayerBasicKernel.OFFSET); // automatic offset for layer 2

Now all necessary parameters are known and the settings of each kernel can be made. Each kernel must know for how many ticks it needs to run. For the layer kernels it corresponds to the time for writing the weights into the RAM added to the number of clock cycles needed for computing all dot products. Concerning the weights, it is simply equal to their amount. For the dot products, the outer loop chain counter must finish its loop a number of times equivalent to the batch size.

ei.setTicks(KERNEL_NAME1, SIZE_LAYER_0 * SIZE_LAYER_1 * L1 * BS + SIZE_LAYER_0 * SIZE_LAYER_1);
ei.setTicks(KERNEL_NAME2, SIZE_LAYER_1 * SIZE_LAYER_2 * L2 * BS + SIZE_LAYER_1 * SIZE_LAYER_2);

Next the kernel scalars are set. Note that here we use the hard coded values, but these scalars could also have been parameters added from the CPU.

ei.setScalar(KERNEL_NAME1, FLinearLayerBasicKernel.INSIZE_NAME, SIZE_LAYER_0);
ei.setScalar(KERNEL_NAME1, FLinearLayerBasicKernel.OUTSIZE_NAME, SIZE_LAYER_1);
ei.setScalar(KERNEL_NAME2, FLinearLayerBasicKernel.INSIZE_NAME, SIZE_LAYER_1);
ei.setScalar(KERNEL_NAME2, FLinearLayerBasicKernel.OUTSIZE_NAME, SIZE_LAYER_2);

The streams from and to the CPU must also be set to the engine interface. It allows to define their respective size. Notice the padding on the biases stream of the second layer. This is because the size of streams must be a multiple of 16, which is not the case without the additional padding.

private final CPUTypes cpuT = CPUTypes.FLOAT;

ei.setStream("input", cpuT, cpuT.sizeInBytes() * SIZE_LAYER_0 * BS);
ei.setStream("weights1", cpuT, cpuT.sizeInBytes() * SIZE_LAYER_0 * SIZE_LAYER_1);
ei.setStream("biases1", cpuT, cpuT.sizeInBytes() * SIZE_LAYER_1);
ei.setStream("s1", cpuT, cpuT.sizeInBytes() * SIZE_LAYER_1 * BS);
ei.setStream("x1", cpuT, cpuT.sizeInBytes() * SIZE_LAYER_1 * BS);
ei.setStream("weights2", cpuT, cpuT.sizeInBytes() * SIZE_LAYER_1 * SIZE_LAYER_2);
ei.setStream("biases2", cpuT, cpuT.sizeInBytes() * (SIZE_LAYER_2 + PADDING));
ei.setStream("s2", cpuT, cpuT.sizeInBytes() * SIZE_LAYER_2 * BS);
ei.setStream("x2", cpuT, cpuT.sizeInBytes() * SIZE_LAYER_2 * BS);

Finally, there is a way to tell the max file to ignore certain parameters and not request a value from the CPU side. This is the case for the Autoloop offset parameters, which are set from the kernel side.

ei.ignoreAutoLoopOffset(KERNEL_NAME1, FLinearLayerBasicKernel.OFFSET);
ei.ignoreAutoLoopOffset(KERNEL_NAME2, FLinearLayerBasicKernel.OFFSET);

CPU code

The code to run the compiled .max file from the CPU is very similar to the one to test the performance of the inference on CPU. The main difference is that there is no network object defined, because the structure is already all created on the max file side. The inputs are loaded the same way from the MNIST dataset.

Two header files must be included to the application:

#include <MaxSLiCInterface.h>
#include "BasicForwardProp.h"

Note that the name of the forward propagation maxfile depends on the compilation settings. Head to Build with multiple managers for more information.

The function forward_prop_dfe() is specific to this file and allows to run the maxfile on the DFE. The next sections detail the steps performed by the function.

Maxfile and engine

The maxfile is declared using a custom type and its value set with its init() method. An engine on which the max file is loaded is also declared.

max_file_t *max_file;
max_engine_t *max_engine;

max_file = BasicForwardProp_init();
max_engine = max_load(max_file, "*");

Loading weights and biases

The trained weights and biases are loaded from the model text files into 2D vectors.

vector<vector<float>> allWeights = load_weights();
vector<vector<float>> allBiases = load_biases();

SLiC interface actions

The next step consists in setting the engine interface parameters that we defined on the manager side.

We first need to declare output vectors of correct size.

vector<float> s1(BasicForwardProp_SIZE_LAYER_1 * batchSize);
vector<float> x1(BasicForwardProp_SIZE_LAYER_1 * batchSize);
vector<float> s2(BasicForwardProp_SIZE_LAYER_2 * batchSize);
vector<float> x2(BasicForwardProp_SIZE_LAYER_2 * batchSize);

Note the constant values which were set in the custom manager (see Manager).

An action object allows to define all the input and output streams and parameters. The use of a fanout block implies that we have to tell it which connections to enable from the CPU side. In this case we want all of them to be enabled. Therefore the routing_string contains both outputs of the fanout.

BasicForwardProp_actions_t actions;

actions.instream_weights1 = allWeights[0].data();
actions.instream_biases1 = allBiases[0].data();
actions.instream_weights2 = allWeights[1].data();
actions.instream_biases2 = allBiases[1].data();

actions.instream_input = flattenInput.data();
actions.outstream_s1 = (float *)s1.data();
actions.outstream_x1 = (float *)x1.data();
actions.outstream_s2 = (float *)s2.data();
actions.outstream_x2 = (float *)x2.data();
actions.routing_string = "x11 -> x1Fanout, x12 -> x1Fanout";

actions.param_BS = batchSize;

Run the DFE application

The following command runs the engine with the actions defined above.

BasicForwardProp_run(max_engine, &actions);

When the execution terminates, the max file can be unloaded from the engine. The output vectors now contain the data from the DFE.

max_unload(max_engine);

The name of the CPU code is basic-forward-test-DFE.cpp and can be found here.

Optimization and results

This page is explaining the basic concepts and working principles of the DFE adaptation of the forward propagation. But this basic implementation yields bad results and is not optimized at all. For detailed results and optimization process, head to this page: Forward Propagation on DFE: Results and Optimization.

Clone this wiki locally