A small demo of how to perform CUDA compute and why this is not always the optimal choice on small amounts of data.
TL, DR: If you can split your work in less than number of cores (threads) on your CPU - do it on your CPU. If your work can be split in more than 1.25*cores(threads) (it can be highly parallelized) on our CPU - it’s worth it do on CUDA.
Feel free to use this code as an example or a starting point in CUDA compute.
WARNING! This code is not optimized. This code only meant to show BASIC acceleration using CUDA compared to CPU. By doing further optimizations, both CUDA and CPU can be optimized. CPU code is purposefully NOT multithreaded. By implementing multithreading its compute time can be decreased by the factor of time/cores * 90% or if you CPU supports multithreading (most of them do) by about time/threads * 70% depending on code implementation.
Sample Data: calculation of 16k and 64k samples of Y[i] = A[i]B[i] + A[i] A[i]* A[i]*C[i]/B[i] on CPU and on CUDA.
CPU: Intel Core i5-8500, 3.90 GHz (6 cores, 6 threads) RAM: 16 Gb DDR4 2666 MHz GPU: nVidia GeForce GTX 970 • CUDA version: 5.2 • CUDA cores: 1664 • VRAM: 4095 MB • Max Block Size: 1024102464 • Max Warp Size: 32 • Max Grid Size: 21474836476553565535 OS Win10 Pro 1903
Vertical axis - time to compute in ms. Horizontal axis - block size on CUDA.