Pipelining: Device to Device

For versions before v1.2.0, only separable kernels had the option to be pipelined concurrently for multiple GPUs. But some projects having multiple non-separable kernels could not get any help from that feature. Generally more hardware-dependent in-kernel commands like "atomic functions" which use any global memory address whenever needed, make the "partitioning" impossible.

To bypass difficulties of partitioning of multiple non-partitionable kernels, version 1.2.0 of Cekirdekler API adds "device to device" pipelining feature. This technique binds a different device for each stage of a multi-stage pipeline and runs all stages concurrently. Concurrency of stages is also expanded to include double buffering that makes inter-stage data movements' latencies hidden behind the compute time of stages (if its greater).

This is how it works:

Creating a pipeline needs the stages to be created first. A pipeline stage is created with kernels, input-output(and hidden for sequential logic) buffers and a device.

Pipeline stage for addition of "1" to array elements:

Hardware.ClDevices gpu1 = Hardware.ClPlatforms.all().devicesWithMostComputeUnits()[0];
Pipeline.ClPipelineStage add = new Pipeline.ClPipelineStage();
add.addDevices( gpu1);
ClArray<float> inputArrayGPU1 = new ClArray<float>(1024);
ClArray<float> outputArrayGPU1 = new ClArray<float>(1024);
add.addInputBuffers(inputArrayGPU1);
add.addOutputBuffers(outputArrayGPU1);
add.addKernels(@"
__kernel void vecAdd(__global float * input, __global float * output)
{ 
    int id=get_global_id(0); 
    output[id]=input[id]+1.0f; 
}", "vecAdd", new int[] { 1024/*global range*/ }, new int[] { 256/* local range*/ });

pipeline stage for multiplication of array elements by 5:

Hardware.ClDevices gpu2 = Hardware.ClPlatforms.all().devicesWithMostComputeUnits()[1];
Pipeline.ClPipelineStage mul = new Pipeline.ClPipelineStage();
mul .addDevices( gpu2);
ClArray<float> inputArrayGPU2 = new ClArray<float>(1024);
ClArray<float> outputArrayGPU2 = new ClArray<float>(1024);
mul .addInputBuffers(inputArrayGPU2);
mul .addOutputBuffers(outputArrayGPU2);
mul .addKernels(@"
__kernel void vecMul(__global float * input, __global float * output)
{ 
    int id=get_global_id(0); 
    output[id]=input[id]*5.0f; 
}", "vecMul", new int[] { 1024/*global range*/ }, new int[] { 256/* local range*/ });

building a pipeline from add+mul stages:

add.prependToStage(mul); // prepends add stage to mul stage
var pipeline = add.makePipeline(); // creates a 2-stage pipeline out of add and mul.

pushing data to pipeline:

pipeline.pushData(new object[] { arrayToGetData }, new object[] { arrayToReceiveResult })

pushData() method returns a boolean value indicating that the result is ready or not. If true, the first input values has been computed on all pipeline stages and already copied to output array(or any result array given by client code).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipelining: Device to Device

Clone this wiki locally