Update readme

UoB-HPC · Aug 19, 2024 · c8209b1 · c8209b1
1 parent d940286
commit c8209b1
Show file tree

Hide file tree

Showing 3 changed files with 177 additions and 31 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,7 +1,7 @@
 # Changelog
 All notable changes to this project will be documented in this file.
 
-## [v2.0] - 2022-??-??
+## [v2.0] - 2024-08-19
 
 ### Added
 - CI via GitHub Actions
@@ -12,12 +12,12 @@ All notable changes to this project will be documented in this file.
 - Flag for machine-readable CSV output
 - Flag for toggling optional energy output to file
 - Executable now embeds compile commands used at build-time
-- New models: C++ std, C++20 std, Intel TBB <!--  RAJA, Thrust, Rust -->
+- New models: C++ std, C++20 std, Intel TBB, RAJA, Thrust, serial
 - Context allocation and transfer (if required) is now measured
 - Added optional `cpu_feature` library, can be disabled at build-time
 
 ### Changed
-- Human readable output now uses YAML format
+- Human-readable output now uses YAML format
 - Renamed parameter `NUM_TD_PER_THREAD` to `PPWI` for all implementations
 - Consolidated builds to use a shared CMake script, Makefiles removed
 - All implementation now share a common C++ driver with device selection based on index or name substrings

diff --git a/CITATION.cff b/CITATION.cff
diff --git a/README.md b/README.md
@@ -1,22 +1,38 @@
 # miniBUDE
 
-This mini-app is an implementation of the core computation of the Bristol University Docking Engine (BUDE) in different HPC programming models.
-The benchmark is a virtual screening run of the NDM-1 protein and runs the energy evaluation for a single generation of poses repeatedly, for a configurable number of iterations.
-Increasing the iteration count has similar performance effects to docking multiple ligands back-to-back in a production BUDE docking run.
+This mini-app is an implementation of the core computation of the Bristol University Docking
+Engine (BUDE) in different HPC programming models.
+The benchmark is a virtual screening run of the NDM-1 protein and runs the energy evaluation for a
+single generation of poses repeatedly, for a configurable number of iterations.
+Increasing the iteration count has similar performance effects to docking multiple ligands
+back-to-back in a production BUDE docking run.
+
+> [!NOTE]  
+> miniBUDE version 20210901 used
+> in [OpenBenchmarking](https://openbenchmarking.org/test/pts/minibude)
+> for multiple Phoronix articles and
+>
+Intel [slides](https://www.servethehome.com/wp-content/uploads/2022/08/HC34-Intel-Ponte-Vecchio-Performance-miniBUDE.jpg)
+> uses the v1 branch. The main branch contains an identical kernel but with a unified driver and
+> improved build system.
 
 ## Structure
 
 The top-level `data` directory contains the input common to implementations.
-The top-level `makedeck` directory contains an input deck generation program and a set of mol2/bhff input files.
+The top-level `makedeck` directory contains an input deck generation program and a set of mol2/bhff
+input files.
 Each other subdirectory in `src` contains a separate C/C++ implementation.
 
 ## Building
 
-Drivers, compiler and software applicable to whichever implementation you would like to build against is required.
+Drivers, compiler, and software applicable to whichever implementation you would like to build
+against is required.
+The build system requirement is CMake; no software dependency is required.
 
 ### CMake
 
-The project supports building with CMake >= 3.14.0, which can be installed without root via the [official script](https://cmake.org/download/).
+The project supports building with CMake >= 3.14.0, which can be installed without root via
+the [official script](https://cmake.org/download/).
 
 Each miniBUDE implementation (programming model) is built as follows:
 
@@ -38,16 +54,119 @@ The `MODEL` option selects one implementation of miniBUDE to build.
 The source for each model's implementations are located in `./src/<model>`.
 
 Currently available models are:
+
 ```
 omp;ocl;std-indices;std-ranges;hip;cuda;kokkos;sycl;acc;raja;tbb;thrust
 ```
 
+## Running
+
+By default, the following PPWI sizes are compiled: `1,2,4,8,16,32,64,128`.
+This is a templated compile-time size so the virtual screen kernel is compiled and unrolled for each
+PPWI value.   
+Certain sizes, such as 64, exploits wide vector lengths on platforms that have support (e.g AVX512).
+
+To run with the default options, run the binary without any flags.
+To adjust the run time, use -i to set the number of iterations.
+For very short runs, e.g. for simulation, use -n 1024 to reduce the number of poses.
+
+More than one `ppwi` and `wgsize` may be specified on models that support this.
+When given, miniBUDE will auto-tune all combinations of `ppwi` and `wgsize` and print the best
+solution in the end.
+A heatmap may be generated using this output, see `heatmap.py`.
+Currently, the following models support this scenario: Kokkos, RAJA, CUDA/HIP, OpenCL, CUDA, SYCL,
+OpenMP target, and OpenACC.
+
+```shell
+> ./omp-bude                                                                                                                                                              ✔  tom@soraws-uk  11:51:17 
+miniBUDE:  
+compile_commands:
+   - ...
+vcs:
+  ...
+host_cpu:
+  ~
+time: ...
+deck:
+  path:         "../data/bm1"
+  poses:        65536
+  proteins:     938
+  ligands:      26
+  forcefields:  34
+config:
+  iterations:   8
+  poses:        65536
+  ppwi:
+    available:  [1,2,4,8,16,32,64,128]
+    selected:   [1]
+  wgsize:       [1]
+device: { index: 0,  name: "OMP CPU" }
+# (ppwi=1,wgsize=1,valid=1)
+results:
+  - outcome:             { valid: true, max_diff_%: 0.002 }
+    param:               { ppwi: 1, wgsize: 1 }
+    raw_iterations:      [410.365,467.623,498.332,416.583,465.063,469.426,473.833,440.093,461.504,455.871]
+    context_ms:          6.184589
+    sum_ms:              3680.705
+    avg_ms:              460.088
+    min_ms:              416.583
+    max_ms:              498.332
+    stddev_ms:           22.571
+    giga_interactions/s: 3.474
+    gflop/s:             139.033
+    gfinst/s:            86.847
+    energies:            
+      - 865.52
+      - 25.07
+      - 368.43
+      - 14.67
+      - 574.99
+      - 707.35
+      - 33.95
+      - 135.59
+best: { min_ms: 416.58, max_ms: 498.33, sum_ms: 3680.71, avg_ms: 460.09, ppwi: 1, wgsize: 1 }
+```
+
+For reference, available command line option are:
+
+```shell
+> ./omp-bude --help                                                                                                                                                   INT ✘  tom@soraws-uk  11:53:01 
+Usage: ./bude [COMMAND|OPTIONS]
+
+Commands:
+  help -h --help       Print this message
+  list -l --list       List available devices
+Options:
+  -d --device  INDEX   Select device at INDEX from output of --list, performs a substring match of device names if INDEX is not an integer
+                       [optional] default=0
+  -i --iter    I       Repeat kernel I times
+                       [optional] default=8
+  -n --poses   N       Compute energies for only N poses, use 0 for deck max
+                       [optional] default=0 
+  -p --ppwi    PPWI    A CSV list of poses per work-item for the kernel, use `all` for everything
+                       [optional] default=1; available=1,2,4,8,16,32,64,128
+  -w --wgsize  WGSIZE  A CSV list of work-group sizes, not all implementations support this parameter
+                       [optional] default=1
+     --deck    DIR     Use the DIR directory as input deck
+                       [optional] default=`../data/bm1`
+  -o --out     PATH    Save resulting energies to PATH (no-op if more than one PPWI/WGSIZE specified)
+                       [optional]
+  -r --rows    N       Output first N row(s) of energy values as part of the on-screen result
+                       [optional] default=8
+     --csv             Output results in CSV format
+                       [optional] default=false
+
+```
+
 #### Overriding default flags
+
 By default, we have defined a set of optimal flags for known HPC compilers.
 There are assigned those to `RELEASE_FLAGS`, and you can override them if required.
 
-To find out what flag each model supports or requires, simply configure while only specifying the model.
+To find out what flag each model supports or requires, simply configure while only specifying the
+model.
 For example:
+
 ```shell
 > cd miniBUDE
 > cmake -Bbuild -H. -DMODEL=omp 
@@ -91,8 +210,11 @@ No CMAKE_BUILD_TYPE specified, defaulting to 'Release'
 
 Two input decks are included in this repository:
 
-* `bm1` is a short benchmark (~100 ms/iteration on a 64-core ThunderX2 node) based on a small ligand (26 atoms)
-* `bm2` is a long benchmark (~25 s/iteration on a 64-core ThunderX2 node) based on a big ligand (2672 atoms)* `bm2` is a long benchmark (~25 s/iteration on a 64-core ThunderX2 node) based on a big ligand (2672 atoms)
+* `bm1` is a short benchmark (~100 ms/iteration on a 64-core ThunderX2 node) based on a small
+  ligand (26 atoms)
+* `bm2` is a long benchmark (~25 s/iteration on a 64-core ThunderX2 node) based on a big ligand (
+  2672 atoms)* `bm2` is a long benchmark (~25 s/iteration on a 64-core ThunderX2 node) based on a
+  big ligand (2672 atoms)
 * `bm2_long` is a very long benchmark based on `bm2` but with 1048576 poses instead of 65536
 
 They are located in the [`data`](data/) directory, and `bm1` is run by default.
@@ -103,4 +225,46 @@ See [`makedeck`](makedeck/) for how to generate additional input decks.
 
 Please cite miniBUDE using the following reference:
 
-> Andrei Poenaru, Wei-Chen Lin and Simon McIntosh-Smith. ‘A Performance Analysis of Modern Parallel Programming Models Using a Compute-Bound Application’. In: 36th International Conference, ISC High Performance 2021. Frankfurt, Germany, 2021. In press.
+```latex
+@inproceedings{poenaru2021performance,
+  title={A performance analysis of modern parallel programming models using a compute-bound application},
+  author={Poenaru, Andrei and Lin, Wei-Chen and McIntosh-Smith, Simon},
+  booktitle={International Conference on High Performance Computing},
+  pages={332--350},
+  year={2021},
+  organization={Springer}
+}
+```
+
+> Andrei Poenaru, Wei-Chen Lin, and Simon McIntosh-Smith. 2021. A Performance Analysis of Modern
+> Parallel Programming Models Using a Compute-Bound Application. In High Performance Computing: 36th
+> International Conference, ISC High Performance 2021, Virtual Event, June 24 – July 2, 2021,
+> Proceedings. Springer-Verlag, Berlin, Heidelberg,
+> 332–350. https://doi.org/10.1007/978-3-030-78713-4_18
+
+For the Julia port specifically: https://doi.org/10.1109/PMBS54543.2021.00016
+
+```latex
+@inproceedings{lin2021julia,
+  title={Comparing julia to performance portable parallel programming models for hpc},
+  author={Lin, Wei-Chen and McIntosh-Smith, Simon},
+  booktitle={2021 International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)},
+  pages={94--105},
+  year={2021},
+  organization={IEEE}
+}
+```
+
+For the ISO C++ port specifically:
+https://doi.org/10.1109/PMBS56514.2022.00009
+
+```latex
+@inproceedings{lin2022cpp,
+  title={Evaluating iso c++ parallel algorithms on heterogeneous hpc systems},
+  author={Lin, Wei-Chen and Deakin, Tom and McIntosh-Smith, Simon},
+  booktitle={2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)},
+  pages={36--47},
+  year={2022},
+  organization={IEEE}
+}
+```