A clean and simple template for developing CUDA C++ kernels and testing them in Python/PyTorch ππ.
Tested on Ubuntu 20.04.
.
βββ README.md
βββ benchmark.py // use this script to benchmark your kernels
βββ csrc // C/C++ CUDA files
β βββ api.cpp // define the Python interface here
β βββ matmul.cu // a sample CUDA kernel
β βββ square.cu // another sample CUDA kernel
βββ requirements.txt
βββ setup.py // your code is compiled through this script
βββ tests // test the correctness of your kernels here
βββ test_matmul.py
βββ test_square.py
First, install CUDA and PyTorch. The preferred way to install CUDA is through Conda (see here). Also note that you will need an Nvidia GPU to run this.
conda create -n cuda-kernels # create a new Conda environment
conda activate cuda-kernels # activate the environment
conda install cuda -c nvidia/label/cuda-12.4.0 # choose the desired CUDA version (here we use 12.4)
conda install pytorch pytorch-cuda=12.4 -c pytorch -c nvidia/label/cuda-12.4.0 # install Pytorch using the previously mentioned CUDA version
Finally, install the remaining dependencies:
pip install -r requirements.txt
This repo contains two sample CUDA kernels that you can use as a starting point: csrc/square.cu
and csrc/matmul.cu
.
The first step is to compile the kernels, which is done by running setup.py
.
python setup.py install
This will automatically compile the source files found in csrc
. Note that every time you change something in csrc
, you need to recompile using the above command. For faster compilation, specify the compute capability of your GPU by setting the COMPUTE_CAPABILITY
variable in setup.py
.
Once you've compiled, you can use the provided scripts to:
-
Test your kernels:
pytest -v -s
-
Benchmark your kernels:
python benchmark.py
These two commands should work out of the box for the two kernels mentioned above.
That's it! Now, you can start hacking away and create your own CUDA kernels.
Once you start writing more serious kernels, you probably want to do more precise benchmarking. The benchmark.py
script is a simple script for timing your kernels, but it is not as precise as using a profiler. If you want to get detailed information about the performance bottlenecks of your kernels, consider using the ncu
profiler. For example:
ncu -k square_kernel python benchmark.py -i 1
The -k
flag will make sure that only the square_kernel
function is being profiled.
Note: this will not work on most cloud GPU instances out of the box. See the running ncu profiler on a cloud GPU instance section below to fix this.
The Nsight profiler (ncu
) is a very useful tool to profile CUDA kernels. However, it will not run out of the box on cloud GPUs. If you run ncu
, you might get an output like this:
$ ncu ./benchmark
==PROF== Connected to process 2258 (/mnt/tobias/benchmark)
==ERROR== ERR_NVGPUCTRPERM - The user does not have permission to access NVIDIA GPU Performance Counters on the target device 0. For instructions on enabling permissions and to get more information see https://developer.nvidia.com/ERR_NVGPUCTRPERM
To fix this, you can run ncu
with sudo
. Note however that when you run sudo
, your environment variables change, which means that ncu
may no longer be on the PATH. This can be fixed by specifying the full path to ncu
. E.g.:
which ncu # check ncu path
sudo /opt/conda/envs/cuda-kernels/bin/ncu # pass ncu path
In my case, ncu is provided through Conda. To make running ncu more convenient, you can directly add your Conda path to the "sudoers" file. Do this as follows:
sudo visudo
Add your conda environment's bin directory to the Defaults secure_path line:
Defaults secure_path="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/path/to/conda/env/bin"
Replace /path/to/conda/env/bin with the actual path to your conda environment's bin directory.
You can now run ncu simply by prepending sudo
:
sudo ncu
- Docker