Skip to content

CUDA Memory Types

rshipley160 edited this page Dec 10, 2021 · 2 revisions

Previous: Basic CUDA Syntax

GPU Memory

Just as the GPU has three separate types of physical memory, (registers, cache, and shared memory) CUDA provides three different primitive memory types:

  • Local variables or thread-local variables are those that are stored on the registers dedicated to each thread and can only be accessed by the thread that "owns" those registers. Because they are stored in the register banks, thread-local variables have the fastest memory access time*. Within a kernel, thread-local variables are declared like any other variable, with just a datatype and and variable name.
    (Ex: float local_value;)

    • * Caution should be taken when using local memory to store arrays or other large data types, as these values are usually too large to be stored in register memory and usually end up being stored in cache memory instead, which can affect access times.
  • Shared variables are variables that can be shared by an entire block, and are stored in the shared SM group cache. Access times for this type of variable are still very fast, but not quite as fast as local variable accesses.
    (Ex: __shared__ float shared_value;)

  • Global variables are stored in device-wide shared memory and are the slowest to access by far, requiring hundreds of compute cycles' worth of time to complete.
    (Ex: __device__ float global_value;)

    • Constant variables are a form of global variable that are read-only. Because they are immutable, the values of constant variables are cached by each SM prior to kernel execution, making them much faster to access than true global variables.
      (Ex: __constant__ float constant_value = x.x;)

In addition to these primitive types, CUDA also provides device pointers that fill the same role as ordinary pointers and can be used in much the same way. The primary difference is that the address the device pointers store is located on the GPU, causing the contents to be inaccessible to the host. Device pointers do not have special keywords to denote them as such. Instead they are delineated by the methods used to allocate and deallocate them, shown below.

// Allocate an 8-element array of integers on a device pointer
int arraySize = 8 * sizeof(int);
int * d_array;
cudaMalloc(&d_array, arraySize);

// Free the device pointer
cudaFree(d_array);

All of these data types are only accessible from the GPU, and attempting to read their values from the host side will cause an error because the host is attempting to access an address it can't read. In order to allow GPU data to be read by the host, we have to use one of CUDA's host variable types.

CPU Memory

It is important to note that all of the host memory types covered below are concerned with complex data structures of some sort (e.g. pointers, structs, arrays, etc.) because those types directly interact with memory addresses. In most cases we aren't concerned with CPU/GPU interaction when it comes to primitive variables (int, char, float, etc.) because those can be passed to a kernel without referencing the actual memory addresses they are stored at.

  • Paged Memory is the term used in CUDA to represent ordinary host-side variables that you are likely already familiar with. The name comes from the concept of paging in computer architecture which allows blocks of memory to be temporarily moved into faster memories when they are being used frequently. This shuffling of memory from one device to another means the GPU can't be certain where the up-to-date version of a paged variable is, requiring it to make an I/O request in order to copy data to or from a paged memory variable.
// Allocating an 8-element array using paged memory and malloc
int *paged_array = (int *) malloc(arraySize);

// Use allocated memory...

// Reclaim memory allocated for paged_array at program end
free(paged_array);
  • Page-locked or pinned memory is host memory that doesn't get moved around like paged memory does. As a result, the GPU is allowed to directly access the addresses associated with pinned memory, resulting in a much higher CPU-GPU bandwidth. New pinned host memory can be allocated using cuMemAllocHost (from the CUDA Driver API) or cudaMallocHost (from CUDA Runtime). Both methods achieve the same end result; the only difference is their origin. Page-locked memory is somewhat small compared to paged memory, thus it must be used more sparingly as over-allocating page-locked memory could cause a significant performance impact.
// Allocating an 8-element array using pinned memory (CUDA runtime)
int *pinned_array;
cudaMallocHost(&pinned_array, arraySize);

// Free page-locked host array (CUDA runtime)
cudaFreeHost(pinned_array);

Unified Memory

Unified memory is a relatively new offering from the CUDA library that allows for memory to be allocated on the device but remain accessible to both the CPU and GPU without the need for extra host- or device-only copies of the data. Because data is exposed to both sides simultaneously and either side can update it at any time, accesses to this type of data must be synchronized so that the CPU and GPU do not access the data in the wrong order.

// How to allocate / deallocate unified memory
int *unified;
cudaMallocManaged(&unified, arraySize);

cudaFree(unified);

In the next tutorial we'll get to work using some of these memory types to demonstrate how data can be moved between the CPU and GPUs.

Next: Using CUDA Memory