Opencl

The OpenCL block allows a user to interface with an OpenCL compatible device, like a GPU. This block handles most of the complications of using the OpenCL API. All the user has to do is feed the block a .cl file with the kernel source and click run! This block makes use of GRAS's special buffer model so memory allocated from OpenCL can be directly written by upstream blocks and read by downstream blocks.

Table of Contents

Setup and install
Using OpenCL block
Implementation notes

Setup and install

OpenCL development environment

The first step is to install an OpenCL development environment. Now this part is specific to the hardware or GPU in question, so please refer to your vendor's installation or SDK install instructions for OpenCL. Here are a few examples from my personal experience:

Ubuntu with Nvidia GPU

The Nvidia packages for OpenCL come with default Ubuntu packages:

sudo apt-get install nvidia-opencl-dev opencl-headers

Ubuntu with Intel OpenCL SDK

An rpm for the Intel SDK can be found on Intel's website:

wget http://registrationcenter.intel.com/irc_nas/2563/intel_sdk_for_ocl_applications_2012_x64.tgz
tar -xzvf intel_sdk_for_ocl_applications_2012_x64.tgz
sudo apt-get install alien
sudo alien -i intel_ocl_sdk_2012_x64.rpm
sudo ln -s /usr/lib64/libOpenCL.so /usr/lib/libOpenCL.so

Configure and build GRAS

After installing the OpenCL development environment. You should install GRAS according to the build instructions here:

https://github.com/guruofquality/gras/wiki/Build

During the cmake configuration step, you should see verbose similar to this:

Found OpenCL: /usr/lib/libOpenCL.so

If the cmake configuration cannot find the OpenCL development files, the development directories for OpenCL headers and libraries can also be manually set via the following variables in cmake:

OPENCL_LIBRARIES
OPENCL_INCLUDE_DIRS

GREX OpenCL unit test

The following unit test can help to diagnose issues for OpenCL, enumerating device, allocating buffers, etc... If this test passes, this is a good indication that OpenCL is setup properly and development of your application can continue.

grex/jit/opencl_test_test.sh

Created default thread pool with 8 threads.
Number of platforms: 1, selecting 0 ...
    name: NVIDIA CUDA
    vendor: NVIDIA Corporation
    version: OpenCL 1.1 CUDA 4.2.1
Number of devices: 1, selecting 0 ...
Making WO buffers for output port 0...
Making RO buffers for input port 1...
Making RO buffers for input port 0...

Using OpenCL block

The OpenCL block can be used in C++, python, or GNU Radio companion environments. The user has to know surprisingly little about the OpenCL API. Most of the OpenCL API that revolves around buffer allocation, kernel compilation, device detection, etc... Fortunately, this OpenCL block wraps around all of that for you. The user only needs to be concerned about implementing a kernel.

Writing a kernel

The internet is swarming with examples of how to write kernels in OpenCL, so I will not go into detail about how to write a kernel. However, this guide will tell you how to set-up the kernel's parameters so that it can be interfaced with the OpenCL block. Simply put, each parameter is a pointer that corresponds with an input or output port, with input ports enumerated first, and then output ports.

Suppose a two port adder, then the kernel definition would look like this:

__kernel void add_2x_float32(
    __global const float* in0,
    __global const float* in1,
    __global float* out
)
{
    //code here
}

Some very simple examples of kernels can be found in the unit test code:

https://github.com/guruofquality/grex/blob/master/jit/opencl_test.py

Configuration parameters

There are several configuration parameters responsible for controlling how the kernel gets executed, specifically, how work dimensions get sliced up based on global and local sizes. Also, there are configuration parameters to control

From the documentation in https://github.com/guruofquality/grex/blob/master/jit/opencl_block.xml

The global size factor - a kenel execution parameter.
This factor controls how get_global_size() is calculated
based on the minimum number of items at each input port.

Local size - a kernel execution parameter.
Local size controls the local work size dimension.

The production factor - a production/consumption parameter.
This factor controls how items get produced from this block
based on the minimum number of items at each input port.

The consumption offset - a production/consumption parameter.
This value controls how items get consumed from this block
based on the minimum number of items at each input port.

Missing features?

The OpenCL API encompass a great deal of things, and it would be impossible for this block to cover all of them. So far, this block handles linear arrays of data in and out, and exposes hooks to control linear work groups and work dimensions. I think this makes sense for GNU Radio applications which are often based on processing buffers of linear samples. Please offer suggestions if you feel something important is missing!

Implementation notes

GRAS buffer API

The OpenCL block makes use of GRAS's advanced buffering model. Using the GRAS API, the OpenCL block swaps out its input and output buffer queues, and replaces these with a custom queue that uses OpenCL's buffer allocators. Therefore, blocks upstream of the OpenCL block write into memory allocated by OpenCL, and blocks downstream of the OpenCL block read from memory allocated by OpenCL. See actual implementation for buffer API usage:

https://github.com/guruofquality/grex/blob/master/jit/opencl_buffer.hpp

Pinned memory

The OpenCL buffers are allocated with the CL_MEM_ALLOC_HOST_PTR flag. On a PCIe express graphics card, buffers can be located in pinned memory and DMA'd over the PCIe interface when reads and writes are enqueued. Nvidia recommends this method for best performance in the best practices guide.

http://www.nvidia.com/content/cudazone/CUDABrowser/downloads/papers/NVIDIA_OpenCL_BestPracticesGuide.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly