Thread and Block Scheduling

Previous: Performance Experiment: On GPU vs Off GPU Bandwidth

Warps

You should be able to recall from the CUDA Basics tutorial that each thread represents the work done by a single execution core.
In that article and the preceding one about the architecture of the GPU, the ideas of threads and execution cores are slightly oversimplified and give the impression that each thread can be individually controlled and thus that execution cores operate entirely independently of one another, but this is not the case.

Instead, execution cores and threads are executed in groups called warps. Warps typically consist of 32 execution cores / threads, all of which are operated in perfect synchronization with one another, in lockstep. Threads within a warp all follow the same instructions, even if you were to write a kernel which omits certain threads (which we will learn to do shortly), the omission only impacts memory access instructions. Arithmetic and other functional operations are run on all threads regardless of whether the threads actually need to run them or not.

Threads, Blocks, and Grids

In addition to automatically being grouped into warps by the hardware, you as the programmer can also logically organize collections of threads into groups called blocks. Blocks can have any number of threads in them, but the common convention is to use a factor of 32 as the block size, since blocks that use a number of threads not divisible by 32 will have underutilized warps in them.

So far, our early GPU programs have only configured kernels to run with one-dimensional blocks, but CUDA kernels can be configured to run with two- or three-dimensional block sizes. The main factor that determines how many dimensions your kernel should be configured with is your problem's domain. Two-dimensional blocks are commonly used for linear algebra problems, because the two-dimensional thread indexing it creates naturally lines up with the row-column indexing of matrices from the problem domain, while one-dimensional configurations are suitable for simpler problems.

Similarly, grids can also contain blocks arranged in one-, two-, or three-dimensional patterns, as determined by the needs of the problem, but it can often makes sense for a kernel's grid and block configurations to have the same dimensionality.

Thread Organization and Access Patterns

In this section, we are going to outline how to configure a kernel that has one, two, and three-dimensional block and grid configurations, as well as common patterns for organizing and uniquely identifying threads at each level of dimensionality.

One-dimensional Kernels

Kernels that only have a single dimension for their blocks and threads are useful for operating on vectors / arrays as well as for general use. Even though you may not have thought of them as one-dimensional at the time, all of the kernels we have dealt with up to this point are one dimensional.

Configuring one-dimensional kernels

A one-dimensional kernel can be configured much like you are already used to: by simply providing a number for both the amount of blocks in the kernel grid and the number of threads per block:

// kernelLaunch<<<blocksPerGrid, threadsPerBlock>>>()
mykernel1D<<<2, 32>>>(); // This kernel has 2 blocks with 32 threads each (for a total of 64 threads)

There is a limit on the number of threads that can go in each block that is dependent on the GPU itself. Modern GPUs typically allow blocks to contain hundreds of threads, with top-of-the-line datacenter NVIDIA GPUs like the A100 allowing 1024 threads per block. The amount of blocks in a grid, however, is not nearly as limited and can even surpass the amount of execution cores physically available on the device, in which case the blocks will be scheduled as soon as the compute hardware becomes free.

Indexing one-dimensional kernels

To index, or uniquely identify, the threads running a kernel, we can use a number of kernel configuration attributes that are built in to each thread running the kernel.

threadIdx is a data structure that contains a thread's index along each axis within its block.
blockDim contains the dimensions of each block along each of the axes.
blockIdx contains a block's index within the grid along each axis.
gridDim contains the number of blocks in the grid in each direction.

Using these four attributes which are available within every thread, we can assign every thread a unique identifier and use these to regulate access to critical sections of the kernel that should not be accessed by all threads.

Let's look at each of these four values using a simple one-dimensional kernel that contains only 2 blocks, each with 4 threads:

In one-dimensional kernels, each of our kernel configuration attributes only have data for a single axis, which is the x-axis by default. Thus, when accessing each of the attributes, we will be accessing the .x value stored in each one.

In the image above, we can see all of the attributes for the kernel launched with myKernel<<<2,4>>>();

threadIdx.x ranges from 0 to 3, depending on each thread's position in its block. Thread ID values stop at 3 for this kernel because thread indexing starts at 0 at the beginning of each block, and there are only 4 threads in each block
blockDim.x is 4 since there are 4 threads per block
blockIdx.x starts at 0 and counts up to gridDim.x - 1. In this case there are only two blocks, so 1 is the highest block index
gridDim.x is 2 since there are 2 blocks in the grid

Producing a unique identifier in 1D kernels

In one-dimensional kernels, producing a unique ID for each thread is relatively simple: all we are really trying to do is sum the number of threads along the x-axis prior to the thread we are indexing, which we can do using the formula: uniqueID = blockDim.x * blockIdx.x + threadIdx.x

If you are having trouble coming around to why this formula works, consider the visual example for thread with threadIdx = 2 and blockIdx = 1:

The purple block represents the sum of 4 threads that results from multiplying blockIdx.x (1) by blockDim.x (4), and the blue block represents the addition of threadIdx.x (2) to the sum, which produces the unique identifier 6. This is the same value you would get by simply counting up the threads starting at 0.

Two-dimensional Kernels

As mentioned earlier, two-dimensional kernels are especially useful when solving problems based in linear algebra, because just as a matrix is indexed by a row and column, the kernel grid is also indexed on two independent axes, which are the x and y axes by convention, essentially allowing you to align each matrix element with a thread.

Configuring two-dimensional kernels

To launch a 2D kernel, we have to introduce a variable type that we have not yet used: dim3. Technically, we have been using the dim3 type all along, since both of the parameters passed to the kernel on launch are coerced to the dim3 type, but now marks the first type we will use it explicitly.

In the snippet below, we construct two dim3 objects, one for each aspect of the kernel configuration, before passing them the the kernel launch call in order to create our 2D kernel. Note that we only pass the dim3 constructor 2 values - it fills in a 1 for the z-direction by default if a third argument is not provided.

dim3 threadsPerBlock2D = dim3(4, 4);
dim3 blocksPerGrid2D   = dim3(2, 2)

// This kernel has a 2x2 grid in which each block has 4 threads along each axis (64 threads total)
myKernel2D<<<threadsPerBlock2D, blocksPerGrid2D>>>();

To get an idea of what a fully 2D kernel grid looks like, take a look at the image to the right, which is a representation of a kernel with 2 threads per block axis, and 2 blocks per grid axis. The resulting kernel has (2 x 2 x 2 x 2) = 16 threads. The model also highlights how each thread can be indexed within their respective blocks using the x and y axes.

Indexing two-dimensional kernels

Using the same four kernel configuration attributes as we did in the one-dimensional kernel walkthrough, we can produce a unique identifier for each thread in the grid just by getting the amount of threads before it in the kernel grid, just as in the 1D example. The primary difference here is that we have to account for threads along two axes now instead of one.

We can calculate a thread's ID in four steps, each represented by a different color block in the image to the right.

The pink block represents all of the threads in the grid rows before our current block's row
a. gridDim.x * blockDim.x * blockDim.y represents the total number of threads in a row of the grid
b. Multiplying that number by blockIdx.y gives the number of threads in all grid rows prior to the one this block is in
The green block represents all of the threads in blocks to the left of the current block within its row
a. blockDim.x * blockDim.y is the total number of threads in a block
b. Threads per block is multiplied by the number of blocks prior to the current block - represented by blockIdx.x
The purple block represents all of the threads in the rows before the target thread's row, within the same block as the target thread
a. blockDim.x is the number of threads per row
b. threadIdx.y is the number of rows prior to the current row in the target thread's block
The blue block represents all of the threads that precede the target thread within the same block and row as the target - threadIdx.x

Determining the unique ID of a target thread is a simple as calculating each of these four components and adding them together, as shown in the example.

Three-dimensional Kernels

Three-dimensional kernels are more specialized than two-dimensional kernels but still find use in many three-dimensional modeling and simulation applications in areas such as aerodynamics, thermodynamics, electromagnetics, as well as other fields in chemistry, physics, engineering, and more.

Creating a 3D kernel

To launch a 3D kernel we will once again make two objects of the dim3 type, but this time we will provide the constructor with all 3 arguments, because we want to specify an explicit value for all 3 axes.

dim3 threadsPerBlock3D = dim3(2,2,2);
dim3 blocksPerGrid3D = dim3(2,2,2);

// This kernel has a 2x2x2 grid in which each block has 2 threads along each axis (64 threads total)
myKernel3D<<<threadsPerBlock3D, blocksPerGrid3D>>>();

The diagram below is a representation of the kernel grid created in the code above. As you can see, it consists of a 2x2x2 grid of blocks (each block is marked by the red lines in the image) which each contain a 2x2x2 group of threads. Each thread is marked with the coordinates of its block in blue (B(x, y, z) and its coordinates within its block (T(x,y,z)), which together uniquely identify every single thread.

Indexing three-dimensional kernels

In three dimensions, producing a unique identifier becomes a bit more complicated, but still isn't overwhelming. By being smart about the order in which we choose to count our threads in along the various axes, we can arrive at a unique ID while adding only two more steps to the process we used for producing a unique ID in two dimensions.

To do this, we are essentially going to break up our 3D grid into 2D slices, and then account for the threads in each slice before accounting for threads that exist in the same slice as our target thread.

Take a look at the color-coded image below for an exact breakdown of how each unique identifier is obtained. In this example, we are focused on the thread at B(1,1,1), T(1,1,1), or unique ID 63, because it highlights how each part of our ID equation plays in to the thread sum, which is the overall thread ID.

In the image, the pink, green, purple, and blue blocks represent the exact same parts of the formula that they did in the 2D example, because they account for all of the threads prior to the target thread within the same 2D plane that the thread is on, just as in the 2D example.

The yellow block in the image corresponds to the expression blockDim.x * blockDim.y * blockDim.z * gridDim.x * gridDim.y * blockIDx.z, which accounts for all of the threads that are in blocks with a z-axis value less than that of the target thread block's. In the image, it covers the first two slices of the cube, or values 0-31, because the target thread is in a block with a z-value of 1. Thus, this equation adds all of the threads in the block before it which have a block z-value of 0.

The orange block serves a similar purpose, but instead of representing whole blocks with a block z-value less than the target thread's, it accounts for threads within the same block that have a lower z-value. As you can see in the image, the third slice of the cube colored orange only contains threads with the index B(X, X, 1). Thus, this expression only has to account for all of the threads in a single block slice (blockDim.x * blockDim.y * gridDim.x * gridDim.y) times the number of slices prior to the target thread within the block - threadIdx.z.

After completing those two calculations, we have accounted for all of the threads prior to the 2D slice of the grid that the target thread is part of, and can thus treat it like a 2D problem. Hence we add all of the values used for generating unique 2D identifier to our z-index sum and get the total number of threads prior to the target - AKA its unique ID.

If you've read and understood this far, congrats! You've made through one of the longer and tougher lessons in this series. By contrast, the next article on some of the common applications of parallelism ought to be a breeze.

To the see the configuration and indexing of 1D, 2D, and 3D kernels in action, check out the example program made to accompany this tutorial.

Next: Reduction

Provide feedback

Saved searches

Use saved searches to filter your results more quickly