-
Notifications
You must be signed in to change notification settings - Fork 7
Basic CUDA Syntax
Before we can start launching our first GPU program, there are a few terms in the CUDA programming model that we'll need to be familiar with:
Kernel - A kernel is a block of code that is intended to run on the GPU. Kernel definitions are similar to serial function definitions in nature, as are kernel launches and function calls.
Threads - Threads represent the work assigned to a single execution core on the GPU. Each thread can be uniquely identified from within the kernel to determine the specific task or set of data it is allowed to work on.
Blocks - Blocks represent groups of threads, and thus groups of execution cores, that are assigned to run a given kernel. A block can only be as large as the number of cores that each Streaming Multiprocessor (SM) group has within it, as a block can only span a single SM group. Because each block is guaranteed to be within an SM group, threads within each block can share data via the SM group's register banks.
Grid - Each kernel runs on a single grid, which will always contain at least one block. Grids can contain a very large number of blocks, many more than could run on the GPU at once. Because a grid can span many blocks and thus many SM groups, the only shared memory that can be relied on is the device-wide shared memory bank. If a grid contains more blocks than there are cores available to run on, the remaining blocks are run as soon as enough cores are available.
Device - Anything denoted as "device" means that it is related to the GPU - this can include device functions (kernels), device memory, etc. It is a common practice to denote device variables (i.e. those that are stored on device memory) with the prefix "d_".
Host - Similar to the device designation, labeling something as "host" means that it is stored somewhere within the main computer, off of the GPU. The prefix "h_" is used to distinguish host variables from device variables.
For our first CUDA program, we are going to write a simple kernel that fills an array with 1s. Let's start by looking at the signature of our kernel definition:
__global__ void fillWithOnes(int *array, int size)
The __global__
keyword in the kernel definition is a specifier that instructs the compiler of three things:
- This is a kernel and not a normal function. This is important, because kernels, unlike functions, have to be launched in a set configuration which specifies the number of blocks and threads it will run on.
- This kernel should be visible to the host and will be called from the host side.
- The contents of this kernel will be executed on the GPU.
There are two other declaration specifiers as it relates to the GPU: __device__
and __host__
.
Adding these headers to a function instructs the CUDA compiler that they are intended to be run on either the device (GPU), or host (CPU), respectively.
After the domain specifier comes our return type, which is void
. All __global__
functions must have a return type of void
.
The kernel takes in two parameters: an integer array, and an integer representing the size of that array. These are the same as you would use for any host function with one catch: any addresses (aka pointers) passed to a kernel must be for GPU memory only. Thus we must ensure that the value of array
we use when launching the kernel is a device address, which we will do later on.
Now we're ready to look at the whole kernel:
__global__ void fillWithOnes(int *array, int size) {
if (threadIdx.x < size)
array[threadIdx.x] = 1;
}
In the first line of the kernel, we compare threadIdx.x
to the size of the array we want to fill. threadIdx
is a special thread-specific constant that indicates where a thread is in relation to the other threads in its block. In this example, the x-axis of the thread index can serve as a unique identifier for all of our threads, which allows us to:
- Make sure that all of our threads are within the bounds of the array (i.e.,
threadIdx.x < size
) - Access each unique element within the array and set it to 1 The specifics of how threads and blocks can be organized will be covered a bit later on in the thread and block scheduling tutorial.
To launch a kernel, you have to have two sets of values:
- The configuration of the kernel that determines the number of threads that go in a block, and the number of blocks that go in the kernel's grid
- The parameters of the kernel that function just like the parameters of a normal function.
The configuration of a kernel is set on the same line as the kernel launch, using what is known as triple-bracket notation <<<>>>
(example below). The first value in the triple brackets represents the configuration of thread blocks within the kernel grid, and the second value represents the configuration of threads within each of those blocks.
kernel<<<blocksInGrid, threadsPerBlock>>>();
Parameters act much the same way in kernels as they do for normal functions. Parameters of simple types, like integers, strings, floating-point numbers, etc. can all be passed by value, meaning that the value of the variable is passed rather than the actual variable. Any parameter that can be passed by value requires no special preparation and can be passed to the kernel as normal.
If we want to modify a parameter's values within the kernel, then passing by value is not enough. Instead we have to pass the parameter by reference, meaning that we send the kernel (or function) the address of the variable we want to work with. Any variable that is going to be passed by reference to a kernel has to reside on the GPU, since GPU threads can't access CPU memory.
The most common way to do this is to use the cudaMalloc
function. Much like normal malloc
, cudaMalloc
's purpose is to allocate a block of memory on the GPU and return a pointer to that block of memory. Unlike malloc
, cudaMalloc
doesn't return the address of the allocated memory. Instead, cudaMalloc
takes a reference to the pointer we want to allocate, and updates the pointer reference with its address once it has been allocated. A typical usage of cudaMalloc
looks something like this:
int *d_pointer;
// Allocate space on the GPU for 8 integers for d_pointer
cudaMalloc(&d_pointer, sizeof(int)*8);
Now that we've covered all of the basics of launching a kernel, we can put them all together to launch our fillWithOnes
kernel.
First, we declare our array and assign it a GPU memory location, then we launch our kernel using a 1-block grid containing 256 threads.
The kernel will then run and eventually return execution control to the host.
int main(int argc, char *argv[]) {
int arraySize = 256;
int *d_array;
// Allocate memory for the array on the GPU
cudaMalloc(&d_array, sizeof(int)*arraySize);
// Launch the kernel using 1 block containing arraySize threads
fillWithOnes<<<1, arraySize>>>(d_array, arraySize);
}
You can download and run this code for yourself if you'd like to try it out.
You may notice that the program has no output, and that is because we can't print the contents of our array due to it being stored on the GPU. If you want reassurance that the kernel is working, you can also check out the version that prints the contents of the array before and after the kernel runs.