This package is an attempt to reproduce NVIDIA's CUDA Runtime API [1], i.e. enable the user to write device kernels and launch them in a quasi-grid structure on NEC's Aurora SX-TSUBASA vector engine.
To that end, we wrap NEC's VE Offload [2] and UDMA [3] APIs their such that the usage mimics CUDA's runtime API.
The installation is as easy as a breeze! The dependencies on the target systems are:
- python (>= 3.5)
- cmake (>= 3.10)
- reasonably new gcc/g++ (eg. from scl devtoolset-8)
- NEC Aurora SDK (ncc, libs) - under
/opt/nec
- LLVM-VE (llvm/clang): https://sx-aurora.com/repos/veos/ef_extra under
/opt/nec
For installation,
- Clone this repository:
$ git clone https://github.com/dthuerck/aurora_runtime.git
- Download and build dependencies:
$ cd aurora_runtime $ chmod +x init.sh $ ./init.sh
That's it! Now we can build an example application featuring GEMA (256x256 batched matrix addition) and GEMM (256x256 batched matrix multiplication):
$ mkdir build && cd build
$ cmake ..
$ make
Finally, run the example with ./app-test
and watch your Aurora hard at work!
The runtime API functions are listed in .runtime/include/aurora_runtime.h
,
their usage is demonstrated in the example (see app-test.cc
).
The runtime centers around the concept of a (virtual) processing group;
basically, we write kernels and each kernel is then executed in a batch
of size n
via offload and OpenMP. Roughly speaking (for people familiar with CUDA),
each processing group is a block and the batch corresponds to a grid of
size n
.
The runtime offers the following variables that are set in kernel functions:
__pg__ix
: the index of the processing group (index in the batch)__num_pgs
: the batch size / number of processing groups__pe__ix
/__pg_size
: reserved for future use
Lastly, the most important part: kernels are conventional C-functions with
the annotation ve_kernel and saved with a .cve
extension.
The build process is fully automated and supported by CMake. For details,
please refer to CMakeLists.txt
.
Ideally, use this repository as a scaffolding:
- Clone this repository and run the
init.sh
. - Replace
gema.cve
,gemm.cve
by your kernels. - Replace
app-test.cc
by your application's source. - Change the
CMakeLists.txt
accordingly.
That's it!
This project uses the following packages:
- VE Offload [1]
- VE UDMA [2]
- NEC's LLVM
- pycparse