-
Notifications
You must be signed in to change notification settings - Fork 7
CUDA Graphs
Previous: Performance Experiment: Synchronization Methods
So far in the second half of this tutorial series, we have learned about asynchronous execution, how to achieve it using streams, how to avoid data hazards using various methods of synchronization, and how to set up a chain of dependencies between operations using events, as well as when the use of each is appropriate and the various benefits that each can offer. Now we are ready to introduce yet another form of GPU computing that builds on all of these concepts while being entirely different: graphs.
CUDA 10 introduced the graph model to GPU programming with one goal in mind: speed. While streams and events allow programmers to leverage concurrency in their programs to great effect, the folks at NVIDIA sought opportunity for further improvement and eventually found it by eliminating overhead in the kernel launch process.
We haven't really discussed it at this point, but there is more going on when we launch a kernel than meets the eye - the CPU has to spend time and resources putting together the machine instruction package that is then sent to the GPU for execution. This overhead is fairly small, but it is still present and can become more impactful on performance as the kernels being run get smaller and smaller, to the point that more time is collectively spent launching kernels than actually running them.
To reduce this overhead, NVIDIA introduced graphs. Graphs allow for a GPU workflow to be defined once, in a manner very similar to a computational graph, and then run all together, potentially multiple times. The result is usually a performance increase for small kernels because the compilation step is done once ahead of any kernel execution rather than every time a new task is given to a stream.
Graphs can be defined in two ways: stream capture and explicit definition. Stream capture is provided as a way for developers to implement graphs into their programs with minimal changes, and simply records the actions performed before replicating them in a graph format for future runs. The alternative is a set of API functions that allow developers to build their own graphs at each node, which can be necessary for some functions that cannot be recorded using stream capture.
We will take a look at both of these methods and more in our upcoming articles, with plenty of examples along the way.