From c8209b16178990a449179ee53dafad1776d57978 Mon Sep 17 00:00:00 2001 From: Tom Lin Date: Mon, 19 Aug 2024 12:15:43 +0100 Subject: [PATCH] Update readme --- CHANGELOG.md | 6 +- CITATION.cff | 18 ----- README.md | 184 ++++++++++++++++++++++++++++++++++++++++++++++++--- 3 files changed, 177 insertions(+), 31 deletions(-) delete mode 100644 CITATION.cff diff --git a/CHANGELOG.md b/CHANGELOG.md index 58c3de7..619f896 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,7 +1,7 @@ # Changelog All notable changes to this project will be documented in this file. -## [v2.0] - 2022-??-?? +## [v2.0] - 2024-08-19 ### Added - CI via GitHub Actions @@ -12,12 +12,12 @@ All notable changes to this project will be documented in this file. - Flag for machine-readable CSV output - Flag for toggling optional energy output to file - Executable now embeds compile commands used at build-time -- New models: C++ std, C++20 std, Intel TBB +- New models: C++ std, C++20 std, Intel TBB, RAJA, Thrust, serial - Context allocation and transfer (if required) is now measured - Added optional `cpu_feature` library, can be disabled at build-time ### Changed -- Human readable output now uses YAML format +- Human-readable output now uses YAML format - Renamed parameter `NUM_TD_PER_THREAD` to `PPWI` for all implementations - Consolidated builds to use a shared CMake script, Makefiles removed - All implementation now share a common C++ driver with device selection based on index or name substrings diff --git a/CITATION.cff b/CITATION.cff deleted file mode 100644 index a774da6..0000000 --- a/CITATION.cff +++ /dev/null @@ -1,18 +0,0 @@ -cff-version: 1.1.0 -message: If you use this software, please cite it as below. -authors: - - family-names: Poenaru - given-names: Andrei - affiliation: University of Bristol - website: https://github.com/andreipoe - - family-names: Lin - given-names: Wei-Chen - affiliation: University of Bristol - website: https://github.com/tom91136 - - family-names: McIntosh-Smith - given-names: Simon - affiliation: University of Bristol - website: https://uob-hpc.github.io -title: miniBUDE -version: 2.0 -date-released: 2022-02-16 diff --git a/README.md b/README.md index ea3e66f..49cc4cc 100644 --- a/README.md +++ b/README.md @@ -1,22 +1,38 @@ # miniBUDE -This mini-app is an implementation of the core computation of the Bristol University Docking Engine (BUDE) in different HPC programming models. -The benchmark is a virtual screening run of the NDM-1 protein and runs the energy evaluation for a single generation of poses repeatedly, for a configurable number of iterations. -Increasing the iteration count has similar performance effects to docking multiple ligands back-to-back in a production BUDE docking run. +This mini-app is an implementation of the core computation of the Bristol University Docking +Engine (BUDE) in different HPC programming models. +The benchmark is a virtual screening run of the NDM-1 protein and runs the energy evaluation for a +single generation of poses repeatedly, for a configurable number of iterations. +Increasing the iteration count has similar performance effects to docking multiple ligands +back-to-back in a production BUDE docking run. + +> [!NOTE] +> miniBUDE version 20210901 used +> in [OpenBenchmarking](https://openbenchmarking.org/test/pts/minibude) +> for multiple Phoronix articles and +> +Intel [slides](https://www.servethehome.com/wp-content/uploads/2022/08/HC34-Intel-Ponte-Vecchio-Performance-miniBUDE.jpg) +> uses the v1 branch. The main branch contains an identical kernel but with a unified driver and +> improved build system. ## Structure The top-level `data` directory contains the input common to implementations. -The top-level `makedeck` directory contains an input deck generation program and a set of mol2/bhff input files. +The top-level `makedeck` directory contains an input deck generation program and a set of mol2/bhff +input files. Each other subdirectory in `src` contains a separate C/C++ implementation. ## Building -Drivers, compiler and software applicable to whichever implementation you would like to build against is required. +Drivers, compiler, and software applicable to whichever implementation you would like to build +against is required. +The build system requirement is CMake; no software dependency is required. ### CMake -The project supports building with CMake >= 3.14.0, which can be installed without root via the [official script](https://cmake.org/download/). +The project supports building with CMake >= 3.14.0, which can be installed without root via +the [official script](https://cmake.org/download/). Each miniBUDE implementation (programming model) is built as follows: @@ -38,16 +54,119 @@ The `MODEL` option selects one implementation of miniBUDE to build. The source for each model's implementations are located in `./src/`. Currently available models are: + ``` omp;ocl;std-indices;std-ranges;hip;cuda;kokkos;sycl;acc;raja;tbb;thrust ``` +## Running + +By default, the following PPWI sizes are compiled: `1,2,4,8,16,32,64,128`. +This is a templated compile-time size so the virtual screen kernel is compiled and unrolled for each +PPWI value. +Certain sizes, such as 64, exploits wide vector lengths on platforms that have support (e.g AVX512). + +To run with the default options, run the binary without any flags. +To adjust the run time, use -i to set the number of iterations. +For very short runs, e.g. for simulation, use -n 1024 to reduce the number of poses. + +More than one `ppwi` and `wgsize` may be specified on models that support this. +When given, miniBUDE will auto-tune all combinations of `ppwi` and `wgsize` and print the best +solution in the end. +A heatmap may be generated using this output, see `heatmap.py`. +Currently, the following models support this scenario: Kokkos, RAJA, CUDA/HIP, OpenCL, CUDA, SYCL, +OpenMP target, and OpenACC. + +```shell +> ./omp-bude  ✔  tom@soraws-uk  11:51:17 +miniBUDE: +compile_commands: + - ... +vcs: + ... +host_cpu: + ~ +time: ... +deck: + path: "../data/bm1" + poses: 65536 + proteins: 938 + ligands: 26 + forcefields: 34 +config: + iterations: 8 + poses: 65536 + ppwi: + available: [1,2,4,8,16,32,64,128] + selected: [1] + wgsize: [1] +device: { index: 0, name: "OMP CPU" } +# (ppwi=1,wgsize=1,valid=1) +results: + - outcome: { valid: true, max_diff_%: 0.002 } + param: { ppwi: 1, wgsize: 1 } + raw_iterations: [410.365,467.623,498.332,416.583,465.063,469.426,473.833,440.093,461.504,455.871] + context_ms: 6.184589 + sum_ms: 3680.705 + avg_ms: 460.088 + min_ms: 416.583 + max_ms: 498.332 + stddev_ms: 22.571 + giga_interactions/s: 3.474 + gflop/s: 139.033 + gfinst/s: 86.847 + energies: + - 865.52 + - 25.07 + - 368.43 + - 14.67 + - 574.99 + - 707.35 + - 33.95 + - 135.59 +best: { min_ms: 416.58, max_ms: 498.33, sum_ms: 3680.71, avg_ms: 460.09, ppwi: 1, wgsize: 1 } +``` + +For reference, available command line option are: + +```shell +> ./omp-bude --help  INT ✘  tom@soraws-uk  11:53:01 +Usage: ./bude [COMMAND|OPTIONS] + +Commands: + help -h --help Print this message + list -l --list List available devices +Options: + -d --device INDEX Select device at INDEX from output of --list, performs a substring match of device names if INDEX is not an integer + [optional] default=0 + -i --iter I Repeat kernel I times + [optional] default=8 + -n --poses N Compute energies for only N poses, use 0 for deck max + [optional] default=0 + -p --ppwi PPWI A CSV list of poses per work-item for the kernel, use `all` for everything + [optional] default=1; available=1,2,4,8,16,32,64,128 + -w --wgsize WGSIZE A CSV list of work-group sizes, not all implementations support this parameter + [optional] default=1 + --deck DIR Use the DIR directory as input deck + [optional] default=`../data/bm1` + -o --out PATH Save resulting energies to PATH (no-op if more than one PPWI/WGSIZE specified) + [optional] + -r --rows N Output first N row(s) of energy values as part of the on-screen result + [optional] default=8 + --csv Output results in CSV format + [optional] default=false + +``` + #### Overriding default flags + By default, we have defined a set of optimal flags for known HPC compilers. There are assigned those to `RELEASE_FLAGS`, and you can override them if required. -To find out what flag each model supports or requires, simply configure while only specifying the model. +To find out what flag each model supports or requires, simply configure while only specifying the +model. For example: + ```shell > cd miniBUDE > cmake -Bbuild -H. -DMODEL=omp @@ -91,8 +210,11 @@ No CMAKE_BUILD_TYPE specified, defaulting to 'Release' Two input decks are included in this repository: -* `bm1` is a short benchmark (~100 ms/iteration on a 64-core ThunderX2 node) based on a small ligand (26 atoms) -* `bm2` is a long benchmark (~25 s/iteration on a 64-core ThunderX2 node) based on a big ligand (2672 atoms)* `bm2` is a long benchmark (~25 s/iteration on a 64-core ThunderX2 node) based on a big ligand (2672 atoms) +* `bm1` is a short benchmark (~100 ms/iteration on a 64-core ThunderX2 node) based on a small + ligand (26 atoms) +* `bm2` is a long benchmark (~25 s/iteration on a 64-core ThunderX2 node) based on a big ligand ( + 2672 atoms)* `bm2` is a long benchmark (~25 s/iteration on a 64-core ThunderX2 node) based on a + big ligand (2672 atoms) * `bm2_long` is a very long benchmark based on `bm2` but with 1048576 poses instead of 65536 They are located in the [`data`](data/) directory, and `bm1` is run by default. @@ -103,4 +225,46 @@ See [`makedeck`](makedeck/) for how to generate additional input decks. Please cite miniBUDE using the following reference: -> Andrei Poenaru, Wei-Chen Lin and Simon McIntosh-Smith. ‘A Performance Analysis of Modern Parallel Programming Models Using a Compute-Bound Application’. In: 36th International Conference, ISC High Performance 2021. Frankfurt, Germany, 2021. In press. +```latex +@inproceedings{poenaru2021performance, + title={A performance analysis of modern parallel programming models using a compute-bound application}, + author={Poenaru, Andrei and Lin, Wei-Chen and McIntosh-Smith, Simon}, + booktitle={International Conference on High Performance Computing}, + pages={332--350}, + year={2021}, + organization={Springer} +} +``` + +> Andrei Poenaru, Wei-Chen Lin, and Simon McIntosh-Smith. 2021. A Performance Analysis of Modern +> Parallel Programming Models Using a Compute-Bound Application. In High Performance Computing: 36th +> International Conference, ISC High Performance 2021, Virtual Event, June 24 – July 2, 2021, +> Proceedings. Springer-Verlag, Berlin, Heidelberg, +> 332–350. https://doi.org/10.1007/978-3-030-78713-4_18 + +For the Julia port specifically: https://doi.org/10.1109/PMBS54543.2021.00016 + +```latex +@inproceedings{lin2021julia, + title={Comparing julia to performance portable parallel programming models for hpc}, + author={Lin, Wei-Chen and McIntosh-Smith, Simon}, + booktitle={2021 International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)}, + pages={94--105}, + year={2021}, + organization={IEEE} +} +``` + +For the ISO C++ port specifically: +https://doi.org/10.1109/PMBS56514.2022.00009 + +```latex +@inproceedings{lin2022cpp, + title={Evaluating iso c++ parallel algorithms on heterogeneous hpc systems}, + author={Lin, Wei-Chen and Deakin, Tom and McIntosh-Smith, Simon}, + booktitle={2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)}, + pages={36--47}, + year={2022}, + organization={IEEE} +} +```