Skip to content

Pipelining: Device to Device

Hüseyin Tuğrul BÜYÜKIŞIK edited this page Jun 13, 2017 · 32 revisions

System Level Pipelining

For versions before v1.2.0, only separable kernels had the option to be pipelined concurrently for multiple GPUs. But some projects having multiple non-separable kernels could not get any help from that feature. Also, a possible bug was noticed in "kernel level" pipelining feature. Generally more hardware-dependent in-kernel commands like "atomic functions" which use any global memory address whenever needed, make the "partitioning" impossible.

To bypass difficulties of partitioning of multiple non-partitionable kernels, version 1.2.0 of Cekirdekler API adds "device to device" pipelining feature. This technique binds a different device for each stage of a multi-stage pipeline and runs all stages concurrently. Concurrency of stages is also expanded to include double buffering that makes inter-stage data movements' latencies hidden behind the compute time of stages (if its greater).

This is how it works:

Creating a pipeline needs the stages to be created first. A pipeline stage is created with kernels, input-output(and hidden for sequential logic) buffers and a device.

Pipeline stage for addition of "1" to array elements:

Hardware.ClDevices gpu1 = Hardware.ClPlatforms.all().devicesWithMostComputeUnits()[0];
Pipeline.ClPipelineStage add = new Pipeline.ClPipelineStage();
add.addDevices( gpu1);
ClArray<float> inputArrayGPU1 = new ClArray<float>(1024);
ClArray<float> outputArrayGPU1 = new ClArray<float>(1024);
add.addInputBuffers(inputArrayGPU1);
add.addOutputBuffers(outputArrayGPU1);
add.addKernels(@"
__kernel void vecAdd(__global float * input, __global float * output)
{ 
    int id=get_global_id(0); 
    output[id]=input[id]+1.0f; 
}", "vecAdd", new int[] { 1024/*global range*/ }, new int[] { 256/* local range*/ }); 
add.enqueueMode=true; // optional, reduces API overhead but changes behavior(v1.2.7+)

pipeline stage for multiplication of array elements by 5:

Hardware.ClDevices gpu2 = Hardware.ClPlatforms.all().devicesWithMostComputeUnits()[1];
Pipeline.ClPipelineStage mul = new Pipeline.ClPipelineStage();
mul .addDevices( gpu2);
ClArray<float> inputArrayGPU2 = new ClArray<float>(1024);
ClArray<float> outputArrayGPU2 = new ClArray<float>(1024);
mul .addInputBuffers(inputArrayGPU2);
mul .addOutputBuffers(outputArrayGPU2);
mul .addKernels(@"
__kernel void vecMul(__global float * input, __global float * output)
{ 
    int id=get_global_id(0); 
    output[id]=input[id]*5.0f; 
}", "vecMul", new int[] { 1024/*global range*/ }, new int[] { 256/* local range*/ }); 
mul.enqueueMode=true; // optional, reduces API overhead but changes behavior(v1.2.7+)

setting the initializing parameters for a stage:

stage.initializerKernel("kernelName1 kernelName2;kernelName3,kernelName4", new int[] { N,N,N/2,N/4 }, new int[] { 256,256,256,256 });

names of the initializer kernels indicate the functions defined in addKernels() method. N is global range of each kernel, 256 is a trivial example of local range for kernels

building a pipeline from add+mul stages:

add.prependToStage(mul); // prepends add stage to mul stage
var pipeline = add.makePipeline(); // creates a 2-stage pipeline out of add and mul.

makePipeline() method also initializes all stages with the specified initializer kernels and their range values in the initializerKernel() methods of stages.

pushing data to pipeline actually runs it:

pipeline.pushData(new object[] { arrayToGetData }, new object[] { arrayToReceiveResult })

so the pushed data travels further in the pipeline at each pushData() method call until it arrives to target(result) array.

pushData() method returns a boolean value indicating that the result is ready or not. If true, the first input values has been computed on all pipeline stages and already copied to output array(or any result array given by client code). So it can be directly used like this:

if(pipeline.pushData(new object[] { arrayToGetData }, new object[] { arrayToReceiveResult }))
{
    Console.WriteLine("First result is ready!");
    // uses the result for other tasks when its ready (instead of garbage results for the first N iterations)
}

arrayToGetData: is given in an array of object that must be exact same length and type of the input array of first stage. If there are multiple input arrays in the first stage, then there has to be multiple arrays given as an array of objects.

arrayToReceiveResult: is a target for copying results from last stage's output arrays. Similar to input arrays, output arrays has to be same type and length with this receiver array.

To get the results sooner(while it doesn't change the throughput), data may not be pushed and popped, instead, input-output arrays of first(and last) stages can be directly accessed. Not using the pushData method's parameters makes the pipeline not flip the double buffers at the both ends of pipeline. This way, a result can be taken 2 iterations quicker.

Examples:

M is number of stages in the pipeline.

if(pipeline.pushData(new object[] { arrayToGetData }, new object[] { arrayToReceiveResult }))
{
    Console.WriteLine("Extra client arrays for both inputs and outputs of pipeline");
    Console.WriteLine("Needs M*2+1 iterations");  
    Console.WriteLine("arrayToGetData  and arrayToReceiveResult are used");  
}

if(pipeline.pushData(null, new object[] { arrayToReceiveResult }))
{
    Console.WriteLine("Extra client arrays for only outputs of pipeline");
    Console.WriteLine("Needs M*2 iterations"); 
    Console.WriteLine("input arrays of first stage are directly accessed");   
}

if(pipeline.pushData(new object[] { arrayToGetData }))
{
    Console.WriteLine("Extra client arrays for only inputs pipeline");
    Console.WriteLine("Needs M*2 iterations");  
    Console.WriteLine("output arrays of last stage are directly accessed");
}

if(pipeline.pushData())
{
    Console.WriteLine("Needs M*2-1 iterations"); 
    Console.WriteLine("input arrays of first stage and output arrays of last stage are directly accessed"); 
}

Note: only horizontal(and linear) pipelines are supported for now. Future versions will include multi-layered stages(that run concurrently to each other)


Revisiting rules:

  • Input arrays given as parameter in pushData method must match with the first stage input arrays by their types and number (and order).
  • Output arrays given as parameter in pushData method must match with the last stage output arrays by their types and number (and order).
  • Intermediate stages must have matching types (and length) of inputs and outputs with their neighbors.
  • If any pushData parameter is given null value, results for an input is computed 1 iteration sooner but the output or the input may need to be taken/put as a direct access to arrays of stages.
  • "numberOfArrayElementsPerWorkItem" field of each input-ouput array per stage is important for copying whole array to-from a device. A byte array with "12" as the field is a proper input for a float3 array (4 from float type, 3 from float3's width, multiplication gives 12 bytes per float3 element)
  • Inputs, hidden arrays(for sequential logic) and outputs must be added with the same order of kernel functions arguments. Kernels must have this ordering for parameters: inputs, hiddens, outputs.
  • Hidden arrays do not move data to other arrays. This means pci-e is not used for any read-write. They also don't have duplicates and used for sequential logic per stage.
  • Multiple kernels per stage(non-pipelined) (v1.2.3+) Kernel names can be separated with " ",",",";" and new line character. They run in the order they are written and use the global-local range values from the int array parameters in same method call for the execution. When some of them are separated with "@" character, those group of kernels are run at once without extra sync points between them. So using "@" character decreases number of both array copies and abundant sync points to decrease total latency of kernels. When kernel name separation is done without "@", first kernel should use input array(and optionally hidden array), intermediate kernels should use hidden arrays, last kernel should use output array(also optionally hidden array) since name separators other than "@" reads same input arrays and writes same output arrays repeatedly because device to device pipeline is meant for single kernel per device. "@" separated kernel names use same global and local ranges so the number of global and local ranges must be 1 for all of them. Example:
  • (v1.2.7+) enabling enqueue mode for a pipeline stage makes it synchronize only once with host after reading inputs only once before first kernel, then executing all kernels then writing output only once after all kernels. This is different than non-enqueue mode in behavior. Reduces API overhead for a good margin.
mulFunctionStage.addKernels(@"
__kernel void init(__global double * input,__global double * sequentialLogic, __global double * output)
{
     int threadId=get_global_id(0);
     sequentialLogic[threadId]+=0.1;
}

__kernel void sqrtFunction(__global double * input,__global double * sequentialLogic, __global double * output)
{
     int threadId=get_global_id(0);
     output[threadId]=sqrt(input[threadId])+sequentialLogic[threadId];
}", "sqrtFunction@init@init@init@init", new int[] { N}, new int[] { 256});

here a sqrtFunction kernel and 4x init functions use same N value as global range and 256 as local range values. If different separations are mixed:

"sqrtFunction@init@init@init@init someMoreKernels", new int[] { N,M}, new int[] { 256,128}
"a@b c@d,e@e;f f", new int[]{K/*@*/,L/*@*/,M/*@*/,N,N},new int[]{64/*@*/,64/*@*/,256/*@*/,128,256}

Separating kernel names with " ",",",";" and new line char makes each kernel execution re-read the input values (so no harm to change them by first kernels). "@" separation makes all those kernel names read the input only once, write result only once, so changing the input arrray at first kernel affects the next kernels in same group with names separated by "@" character. Similarly, additively changing output arrays in all kernels affect the final output with "@". Additively changing output arrays with kernels separated with " ",",",";", nl destroys first output results and only keeps the last kernel output. So no harm in changing output with first kernels as it will be overwritten by last kernel.

These behaviors change when enqueue mode for a stage is enabled: reads all inputs only once, executes all kernels, writes all outputs only once, finally synchronizes on host, reducing API overhead by at least 30x for thousands of light kernels.

TL;DR;

                     "kernel1          kernel2@kernel3@kernel4          kernel5           kernel6@kernel7"
                    read  write sync read                  write sync read  write sync read           write sync
enqueue mode on:    read                                                                              write sync
(v1.2.7+)

read pro: input read operations renew the inputs in device so changing them in kernels can't harm next kernel groups.

read con: excessive reads become a bottleneck

write pro: output data is preserved for all kernels, can be used just as another hidden kernel for sequential-logic.

write con: excessive writes become a bottleneck and latest kernel overwrites the previous data.

(v1.2.7+)enqueue mode: reads only once, computes all, writes only once, synchronizes only once, can use only single device per stage.

For now, kernel definition must have this pattern for parameters:

__kernel void test(input1,input2,hidden1,hidden2,hidden3,output1,output2){}

and orders of inputs in themselves must be same as host-side input additions, same as hidden and output arrays.

stage.addInputBuffers(input3);
stage.addInputBuffers(input2);
stage.addInputBuffers(input1);

must map to

__kernel void test(input3,input2,input1,...){}

stage.addOutputBuffers(output3);
stage.addOutputBuffers(output2);
stage.addOutputBuffers(output1);

must map to

__kernel void test(..., output3,output2,output1){}