Revised documentation (#709)

Revised documentation * Structured OpenCL backend/libsmm to match CUDA/HIP documentation. * Split folder specific documentation into separate READMEs. * Introduction paragraph in User Guide's GPU section. * Reduced overly structured documentation (sections). * Reduced clutter in some section titles. * Removed non-ASCII character.
cp2k · Sep 19, 2023 · 05d4004 · 05d4004
1 parent 5c596c1
commit 05d4004
Show file tree

Hide file tree

Showing 13 changed files with 172 additions and 180 deletions.
diff --git a/docs/guide/2-user-guide/4-gpu/index.md b/docs/guide/2-user-guide/4-gpu/index.md
@@ -1,10 +1,18 @@
 title: GPUs
 
-# CUDA/HIP backend and LIBSMM_ACC
+# Introduction
 
-Users interested to tune kernels for the CUDA/HIP backend, can take a look at the [Developer Guide](../../3-developer-guide/3-programming/2-accelerator-backend/2-libsmm_acc/3-tune.html). Following the guide, [tuned parameters](https://github.com/cp2k/dbcsr/tree/develop/src/acc/libsmm_acc/parameters) can be collected for the desired GPU and potentially submitted for the benefit of others.
+[CP2K](https://github.com/cp2k/cp2k/) was initially enabled for GPUs by the means of the DBCSR library. The original development focused on scalability and an assumption of a `1:1`-relationship between CPUs and GPUs (one CPU-socket drives one GPU). Multi-GPU asks for associating CPU-ranks with the closest GPU (affinity), but is usually a desparture in terms of algorithms as well (GPU to GPU communication). DBCSR associates ranks with GPUs based on a round-robin scheme using the rank-ID, i.e., GPU-affinity is only achieved with the help of the underlying MPI implementation or support from other runtimes. Aggregating GPU acceleration in as little as possible systems is contrary to the original design of DBCSR (and CP2K at that time). CP2K is a versatile toolbox covering a variety of workloads (input language), which imposes several hotspots beyond DBCSR ([status](https://www.cp2k.org/gpu)).
 
-# OpenCL Backend and OpenCL based LIBSMM
+CP2K or DBCSR can scale to thousands of nodes and furter benefit from thread-scalability once communication starts to dominate (due to higher total rank-counts). Thread-scalability (OpenMP) in DBCSR if not CP2K is not equally developed when compared to process scalability (MPI), i.e., higher rank-counts tend to yield better performance on smaller number of systems or nodes. With multiple ranks per GPU, context switches and other overhead can negatively impact performance. However, more ranks are needed to best drive the CPU-dominated portion of the code, and hence GPU and in particular multi-GPU acceleration poses a challenge.
+
+CP2K almost exclusively uses double-precision calculations on CPUs and GPUs (along with DBCSR's need for atomic update instructions for GPUs). Consumer focused GPU offerings often deliver a FLOP-rate ratio between single and double precision up to `SP:DP = 64:1`, which renders them unsuitable for CP2K like not beneficial when compared to modestly many CPU cores. Further, GPU accleration hinges on memory bandwidth rather than compute which further limits the benefit.
+
+# CUDA/HIP Backend
+
+Users interested to tune kernels for the CUDA/HIP backend and LIBSMM_ACC, can take a look at the [Developer Guide](../../3-developer-guide/3-programming/2-accelerator-backend/2-libsmm_acc/3-tune.html). Following the guide, [tuned parameters](https://github.com/cp2k/dbcsr/tree/develop/src/acc/libsmm_acc/parameters) can be collected for the desired GPU and potentially submitted for the benefit of others.
+
+# OpenCL Backend
 
 This section shows how to auto-tune a kernel for the OpenCL based LIBSMM library. The process builds a stand-alone driver program which is then driven by an [OpenTuner](https://opentuner.org/) based script guiding the auto-tuning of the desired kernel. The [Developer Guide](../../3-developer-guide/3-programming/2-accelerator-backend/4-opencl-libsmm.html) provides more information, e.g., about constraining execution time or parallelizing the tuning-process as well as how to select and tune an entire set of kernels.
 

diff --git a/docs/guide/3-developer-guide/2-documentation/index.md b/docs/guide/3-developer-guide/2-documentation/index.md
@@ -2,7 +2,7 @@ title: Documentation
 
 # Documentation
 
-## Build
+## Build
 
 To build the documentation you need [FORD](https://github.com/Fortran-FOSS-Programmers/ford).
 

diff --git a/...ide/3-developer-guide/3-programming/2-accelerator-backend/2-libsmm_acc/index.md b/...ide/3-developer-guide/3-programming/2-accelerator-backend/2-libsmm_acc/index.md
@@ -1,3 +1,3 @@
-title: LIBSMM (CUDA/HIP)
+title: CUDA/HIP
 
 {!./src/acc/libsmm_acc/README.md!}
diff --git a/...-developer-guide/3-programming/2-accelerator-backend/3-libsmm_ocl/1-autotune.md b/...-developer-guide/3-programming/2-accelerator-backend/3-libsmm_ocl/1-autotune.md
@@ -0,0 +1,3 @@
+title: Autotune
+
+{!./src/acc/opencl/smm/README-autotune.md!}
diff --git a/...-developer-guide/3-programming/2-accelerator-backend/3-libsmm_ocl/2-bulktune.md b/...-developer-guide/3-programming/2-accelerator-backend/3-libsmm_ocl/2-bulktune.md
@@ -0,0 +1,3 @@
+title: Parameters
+
+{!./src/acc/opencl/smm/README-bulktune.md!}
diff --git a/...ide/3-developer-guide/3-programming/2-accelerator-backend/3-libsmm_ocl/index.md b/...ide/3-developer-guide/3-programming/2-accelerator-backend/3-libsmm_ocl/index.md
@@ -0,0 +1,5 @@
+title: OpenCL
+
+{!./src/acc/opencl/README.md!}
+
+{!./src/acc/opencl/smm/README.md!}
diff --git a/...guide/3-developer-guide/3-programming/2-accelerator-backend/3-opencl-backend.md b/...guide/3-developer-guide/3-programming/2-accelerator-backend/3-opencl-backend.md
diff --git a/.../guide/3-developer-guide/3-programming/2-accelerator-backend/4-opencl-libsmm.md b/.../guide/3-developer-guide/3-programming/2-accelerator-backend/4-opencl-libsmm.md
diff --git a/src/acc/README.md b/src/acc/README.md
@@ -1,25 +1,18 @@
-# ACCelerator Interfaces
+# ACCelerator Interface
 
-## Overview
+## Backends
 
-This folder contains the ISO_C_BINDING based Fortran code of DBCSR's [ACC-backend interface](https://github.com/cp2k/dbcsr/blob/develop/src/acc/acc.h) and [LIBSMM/ACC-interface](https://github.com/cp2k/dbcsr/blob/develop/src/acc/acc_libsmm.h). It also contains the CUDA (for Nvidia GPUs), the HIP (for AMD GPUs), and the OpenCL accelerator backends.
+The accelerator interface (ACC) consists of ISO_C_BINDING based Fortran code of DBCSR's [ACC-backend interface](https://github.com/cp2k/dbcsr/blob/develop/src/acc/acc.h) and [LIBSMM/ACC-interface](https://github.com/cp2k/dbcsr/blob/develop/src/acc/acc_libsmm.h). The interface is implemented by CUDA (for Nvidia GPUs), the HIP (for AMD GPUs), and the OpenCL accelerator backends.
 
-Further, two stand-alone sample codes are given exercising both interfaces (benchmarks).
+The code for both the CUDA and the HIP backend is unified, and can be found in the `cuda` directory. At compile-time either one or the other backend is chosen per macro (`__CUDA` or `__HIP`). Similarly, the code for the OpenCL backend is activated by a build-time macro (`__OPENCL`).
 
-## CUDA and HIP backends
+## Drivers
 
-The code for both the CUDA and HIP backends is unified, and can be found in the `cuda` directory.
-At compile-time either one or the other backend is chosen per macro (`__CUDA` or `__HIP`).
+There are two stand-alone sample codes or drivers exercising the ACC-interface. The driver code (only depending on above mentioned interfaces) can be built locally and in a rather self-contained fashion, i.e., no DBCSR library is needed (except runtime libraries such as CUDA, HIP, OpenCL). For OpenCL, the LIBXSMM library is mandatory.
 
-## OpenCL backend
+To build the driver code, a folder `libxsmm` in parallel to DBCSR's root directory (`dbcsr`) is expected to be present and prebuilt (`make GNU=1` in LIBXSMM's root directory). To build the driver code, change into the respective backend folder (`cuda` or `opencl`), and invoke `make` (`DBG=0|1|2` is supported among other optional key-value pairs).
 
-The code for both the OpenCL backends is enabled with a build-time macro (`__OPENCL`).
-
-## Benchmarks
-
-Two stand-alone drivers (only depending on above mentioned interfaces) can be built locally and in a rather self-contained fashion, i.e., no DBCSR library is needed (except runtime libraries such as CUDA, HIP, OpenCL/LIBXSMM). For OpenCL, a folder `libxsmm` parallel to DBCSR's root directory (`dbcsr`) is expected to be present and prebuilt (`make` in LIBXSMM's root directory is enough). To build the driver code, change into the respective backend folder (`cuda` or `opencl`), and invoke `make` (`DBG=0|1|2`, and a few other key-value pairs are optional). When building the code is completed, change back into the parent folder and invoke either `acc_bench_trans` or `acc_bench_smm`.
-
-**NOTE**: To activate a certain device, an environment variable `DEVICE` can be used. For example, `DEVICE=1 ./acc_bench_trans` activates the second device (at least two devices must be discovered).
+**NOTE**: To activate a certain device, the drivers consider an environment variable called `DEVICE`. For example, `DEVICE=1 ./acc_bench_trans` activates the second device (at least two devices must be discovered).
 
 The drivers support a few command line options (_nrepeat_, _stack_size_, _m_, _n_, ...). Command line arguments are positional but allow `0` as placeholder to access the default value (`acc_bench_smm 0 0 5 13 5` performs the default number of repetitions with the default stacksize when running the 5x13x5-kernel). For example, running the tranpose benchmark may look like:
 
@@ -36,19 +29,17 @@ errors: 0
 For timing, comparison (host code), and validation, LIBXSMM is required. The drivers exercise the respective backend. For example with the CUDA backend:
 
 ```bash
-cd cuda
-make DBG=0 WITH_GPU=P100
-cd ..
+cd src/acc/cuda
+make WITH_GPU=P100
+../acc_bench_smm
 ```
 
 For the OpenCL backend:
 
 ```bash
-cd opencl
-make DBG=0
-cd ..
+cd src/acc/opencl
+make
+../acc_bench_smm
 ```
 
-In either of the above cases, `acc_bench_trans` and `acc_bench_smm` are built using the respective backends.
-Both driver codes can be instantiated for at least double- and single-precision using a build-time macro (`ELEM_TYPE`).
-Several build-time settings can be made on the build-line (`-D`) or inside of the source files (`acc_bench_trans.c` or `acc_bench_smm.c`).
+In above cases, `acc_bench_trans` and `acc_bench_smm` are built using the respective backend. Both driver codes can be built for double-precision (default) or single-precision using a build-time macro (`make ELEM_TYPE=float` or `-DELEM_TYPE=float` in general).
diff --git a/src/acc/opencl/README.md b/src/acc/opencl/README.md
@@ -1,12 +1,14 @@
-# OpenCL Backend
+# Backend
 
-## Overview
+The OpenCL backend implements the [ACC interface](https://github.com/cp2k/dbcsr/blob/develop/src/acc/acc.h), which is exposed in Fortran and used throughout DBCSR's code base to drive (GPU-)acceleration based on ACC's device enumeration, data movement, and synchronization functionality. By design, DBCSR activates one device per rank (process). For instance, multiple GPUs can be used by the means of multiple ranks per system or at least one rank per device. The LIBSMM library complements the backend and implements the [ACC LIBSMM interface](https://github.com/cp2k/dbcsr/blob/develop/src/acc/acc_libsmm.h).
 
-The OpenCL backend implements the [ACC interface](https://github.com/cp2k/dbcsr/blob/develop/src/acc/acc.h), which is exposed in Fortran and used throughout DBCSR's code base to drive (GPU-)acceleration based on ACC's device enumeration, data movement, and synchronization functionality. By design, DBCSR activates one device per rank (process). For instance, multiple GPUs can be used by the means of multiple ranks per system or at least one rank per device.
+All major GPU vendors support OpenCL even if the vendor-preferred programming model suggests otherwise. On Nvidia GPUs, the OpenCL backend can be used with CUDA based GPU-code in other portions of CP2K. The OpenCL based backend provides the following benefits:
 
-## Customization
+* Code portability between GPU vendors (if not performance portability). For instance, performance of the OpenCL backend matches the performance of the CUDA backend or exceeds it.
+* Acceptable performance for kernels not covered by specifically tuned parameters, and the ability to run on GPU if no tuned parameters are present.
+* Auto-tuning kernels within an acceptable time limit along with handy scripts to retune parameters and to carry forward an existing set (new GPU).
 
-Compile-time settings are (implicitly) documented and can be adjusted by editing [acc_opencl.h](https://github.com/cp2k/dbcsr/blob/develop/src/acc/opencl/acc_opencl.h). Runtime settings are made by the means of environment variables. The OpenCL backend provides `acc_getenv.sh` to list all occurrences of `getenv` categorized into "OpenCL Backend environment variables" and "OpenCL LIBSMM environment variables". Common backend related settings are:
+Runtime settings are made by the means of environment variables. The OpenCL backend provides `acc_getenv.sh` to list all occurrences of `getenv` categorized into "OpenCL Backend environment variables" and "OpenCL LIBSMM environment variables". Common backend related settings are:
 
 * `ACC_OPENCL_DEVSPLIT`: integer enabling devices to be split into subdevices (non-zero/default: subdevices, zero: aggregated).
 * `ACC_OPENCL_DEVTYPE`: character string selecting "cpu", "gpu", "all" (unfiltered), or any other string (neither CPU or GPU).
@@ -20,4 +22,4 @@ Compile-time settings are (implicitly) documented and can be adjusted by editing
     * `ACC_OPENCL_DUMP=1`: dump preprocessed kernel source code and use it for JIT compilation. Instantiates the original source code using preprocessor definitions (`-D`) and collapses the code accordingly.
     * `ACC_OPENCL_DUMP=2`: dump compiled OpenCL kernels (depends on OpenCL implementation), e.g., PTX code on Nvidia.
 
-The OpenCL backend enumerates and orders devices by device-kind, i.e., GPU, CPU, and "other" (primary criterion) and by memory capacity (secondary criterion). Device IDs are zero-based as defined by the ACC interface (and less than what is permitted/returned by `acc_get_ndevices`).
+The OpenCL backend enumerates and orders devices by kind, i.e., GPU, CPU, and "other" (primary criterion) and by memory capacity (secondary criterion). Device IDs are zero-based as defined by the ACC interface (and less than what is permitted by `acc_get_ndevices`).
diff --git a/src/acc/opencl/smm/README-autotune.md b/src/acc/opencl/smm/README-autotune.md
@@ -0,0 +1,52 @@
+# Auto Tuning
+
+Auto tuning code for performance is a practical way to find the "best" setting for parameterized code (e.g., GPU kernels). Introducing effective parameters is a prerequisite, and exploring the (potentially) high-dimensional parameter space in an efficient way is an art. It is desirable to have reasonable defaults even without auto-tuning the parameters. It would be even better to avoid auto-tuning if best performance was possible right away.
+
+For the OpenCL based LIBSMM, a variety of parameters are explored using [OpenTuner](http://opentuner.org/). The script [tune_multiply.py](https://github.com/cp2k/dbcsr/blob/develop/src/acc/opencl/smm/tune_multiply.py) (or tune_multiply.sh) leverages the `acc_bench_smm` by parsing console output (timing, data type, etc.). This way, the tuning is implemented without being intermingled with the subject being tuned. The "communication" between the tuner and the executable is solely based on environment variables.
+
+**NOTE**: If `tune_multiply.py` (or `tune_multiply.sh`) is called with an environment variable already set, the respective parameter (e.g., `OPENCL_LIBSMM_SMM_BM` or `OPENCL_LIBSMM_SMM_BN`) is considered fixed (and not tuned automatically). This way, the parameter space is reduced in size and effort can be directed more intensely towards the remaining parameters.
+
+To toggle the benchmarks between tuning single precision (SP) and double precision (DP), `make ELEM_TYPE=float` can be used when building the benchmark drivers (`ELEM_TYPE` can be also directly edited in [acc_bench_smm.c](https://github.com/cp2k/dbcsr/blob/develop/src/acc/acc_bench_smm.c#L26)). Auto-tuned parameters for SP and DP can be embedded into the same final application and are considered correctly at runtime.
+
+To build the benchmarks in double precision (`ELEM_TYPE=double` is default):
+
+```bash
+cd src/acc/opencl
+make
+```
+
+To build the benchmarks in single precision (SP):
+
+```bash
+cd src/acc/opencl
+make ELEM_TYPE=float
+```
+
+To auto-tune, please install the Python `wheel` and `opentuner` packages:
+
+```bash
+cd src/acc/opencl/smm
+pip install -r requirements.txt
+```
+
+The OpenTuner script supports several command line arguments (`tune_multiply.py --help`). For example, `--stop-after=300` can be of interest to finish in five minutes (without limit, OpenTuner decides when the auto-tuning process is finished). A single kernel can be selected by M, N, and K parameters (GEMM), e.g., `M=15`, `N=5`, and `K=7`:
+
+```bash
+./tune_multiply.py 13x5x7
+```
+
+**NOTE**: If multiple different kernels are tuned using `tune_multiply.py`, it is advisable to delete the `opentuner.db` directory prior to tuning a different kernel since otherwise auto-tuning is potentially (mis-)guided by information which was collected for a different kernel (`tune_multiply.sh` does this automatically).
+
+The OpenTuner script implements multiple objectives ("cost"), primarily "accuracy" (maximized) and a secondary objective "size" (minimized). The former represents the achieved performance (GFLOPS/s) while the latter represents an artificial kernel requirement (just to prefer one parameter set over another in case of similar performance). The console output looks like ("accuracy" denotes performance in GFLOPS/s):
+
+```text
+[    15s]    INFO opentuner.search.plugin.DisplayPlugin: tests=8, best {'BS': 32, 'BM': 6, 'BN': 1}, cost accuracy=28.80000000, size=1.0, found by UniformGreedyMutation
+[    27s]    INFO opentuner.search.plugin.DisplayPlugin: tests=19, best {'BS': 48, 'BM': 8, 'BN': 1}, cost accuracy=32.20000000, size=1.0, found by UniformGreedyMutation
+[    40s]    INFO opentuner.search.plugin.DisplayPlugin: tests=31, best {'BS': 48, 'BM': 8, 'BN': 1}, cost accuracy=32.20000000, size=1.0, found by UniformGreedyMutation
+[    54s]    INFO opentuner.search.plugin.DisplayPlugin: tests=43, best {'BS': 48, 'BM': 8, 'BN': 1}, cost accuracy=32.20000000, size=1.0, found by UniformGreedyMutation
+[    67s]    INFO opentuner.search.plugin.DisplayPlugin: tests=53, best {'BS': 48, 'BM': 8, 'BN': 1}, cost accuracy=32.20000000, size=1.0, found by UniformGreedyMutation
+```
+
+The script finally writes a JSON-file with a filename like `tune_multiply-float-12x12x12-s15-60gflops.json` which is encoding the benchmark ("multiply"), the precision ("float"), the kernel ("12x12x12"), the number of bits necessary to represent the size of the problem, i.e., log2 of the problem-size ("s15"), and the achieved performance ("60gflops"). The script handles SIGINT (like Ctrl-C), and output is still written despite of abnormally terminating (can be abused to tune interactively). Tuning starts from an internal default that is supposed to match LIBSMM's internal default parameters. However, tuning can be (re-)started with specific parameters (e.g., `-bs 64`, `-bm 13`, `-bn 1` for `OPENCL_LIBSMM_SMM_BS`, `OPENCL_LIBSMM_SMM_BM`, and `OPENCL_LIBSMM_SMM_BN` respectively), or partially fixed for a subset of parameters.
+
+**NOTE**: The `acc_bench_smm` executable is potentially started many times when auto-tuning parameters, therefore it is advisable to keep the state of the GPU driver stack persistent (if the setup would otherwise unload the driver configuration), e.g., `nvidia-smi -pm ENABLED`. This can happen in cases where the GPU is only for compute and not used for graphics (no X-Window system, e.g., in case of a "headless" system). Time needed for tuning parameters is not only impacted by accessing and readying the device, but also by the time needed to compile a kernel at runtime aka Just-In-Time (JIT).