Skip to content

Commit

Permalink
Update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
tom91136 committed Aug 19, 2024
1 parent d940286 commit c8209b1
Show file tree
Hide file tree
Showing 3 changed files with 177 additions and 31 deletions.
6 changes: 3 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Changelog
All notable changes to this project will be documented in this file.

## [v2.0] - 2022-??-??
## [v2.0] - 2024-08-19

### Added
- CI via GitHub Actions
Expand All @@ -12,12 +12,12 @@ All notable changes to this project will be documented in this file.
- Flag for machine-readable CSV output
- Flag for toggling optional energy output to file
- Executable now embeds compile commands used at build-time
- New models: C++ std, C++20 std, Intel TBB <!-- RAJA, Thrust, Rust -->
- New models: C++ std, C++20 std, Intel TBB, RAJA, Thrust, serial
- Context allocation and transfer (if required) is now measured
- Added optional `cpu_feature` library, can be disabled at build-time

### Changed
- Human readable output now uses YAML format
- Human-readable output now uses YAML format
- Renamed parameter `NUM_TD_PER_THREAD` to `PPWI` for all implementations
- Consolidated builds to use a shared CMake script, Makefiles removed
- All implementation now share a common C++ driver with device selection based on index or name substrings
Expand Down
18 changes: 0 additions & 18 deletions CITATION.cff

This file was deleted.

184 changes: 174 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,38 @@
# miniBUDE

This mini-app is an implementation of the core computation of the Bristol University Docking Engine (BUDE) in different HPC programming models.
The benchmark is a virtual screening run of the NDM-1 protein and runs the energy evaluation for a single generation of poses repeatedly, for a configurable number of iterations.
Increasing the iteration count has similar performance effects to docking multiple ligands back-to-back in a production BUDE docking run.
This mini-app is an implementation of the core computation of the Bristol University Docking
Engine (BUDE) in different HPC programming models.
The benchmark is a virtual screening run of the NDM-1 protein and runs the energy evaluation for a
single generation of poses repeatedly, for a configurable number of iterations.
Increasing the iteration count has similar performance effects to docking multiple ligands
back-to-back in a production BUDE docking run.

> [!NOTE]
> miniBUDE version 20210901 used
> in [OpenBenchmarking](https://openbenchmarking.org/test/pts/minibude)
> for multiple Phoronix articles and
>
Intel [slides](https://www.servethehome.com/wp-content/uploads/2022/08/HC34-Intel-Ponte-Vecchio-Performance-miniBUDE.jpg)
> uses the v1 branch. The main branch contains an identical kernel but with a unified driver and
> improved build system.
## Structure

The top-level `data` directory contains the input common to implementations.
The top-level `makedeck` directory contains an input deck generation program and a set of mol2/bhff input files.
The top-level `makedeck` directory contains an input deck generation program and a set of mol2/bhff
input files.
Each other subdirectory in `src` contains a separate C/C++ implementation.

## Building

Drivers, compiler and software applicable to whichever implementation you would like to build against is required.
Drivers, compiler, and software applicable to whichever implementation you would like to build
against is required.
The build system requirement is CMake; no software dependency is required.

### CMake

The project supports building with CMake >= 3.14.0, which can be installed without root via the [official script](https://cmake.org/download/).
The project supports building with CMake >= 3.14.0, which can be installed without root via
the [official script](https://cmake.org/download/).

Each miniBUDE implementation (programming model) is built as follows:

Expand All @@ -38,16 +54,119 @@ The `MODEL` option selects one implementation of miniBUDE to build.
The source for each model's implementations are located in `./src/<model>`.

Currently available models are:

```
omp;ocl;std-indices;std-ranges;hip;cuda;kokkos;sycl;acc;raja;tbb;thrust
```

## Running

By default, the following PPWI sizes are compiled: `1,2,4,8,16,32,64,128`.
This is a templated compile-time size so the virtual screen kernel is compiled and unrolled for each
PPWI value.
Certain sizes, such as 64, exploits wide vector lengths on platforms that have support (e.g AVX512).

To run with the default options, run the binary without any flags.
To adjust the run time, use -i to set the number of iterations.
For very short runs, e.g. for simulation, use -n 1024 to reduce the number of poses.

More than one `ppwi` and `wgsize` may be specified on models that support this.
When given, miniBUDE will auto-tune all combinations of `ppwi` and `wgsize` and print the best
solution in the end.
A heatmap may be generated using this output, see `heatmap.py`.
Currently, the following models support this scenario: Kokkos, RAJA, CUDA/HIP, OpenCL, CUDA, SYCL,
OpenMP target, and OpenACC.

```shell
> ./omp-bude  ✔  tom@soraws-uk  11:51:17
miniBUDE:
compile_commands:
- ...
vcs:
...
host_cpu:
~
time: ...
deck:
path: "../data/bm1"
poses: 65536
proteins: 938
ligands: 26
forcefields: 34
config:
iterations: 8
poses: 65536
ppwi:
available: [1,2,4,8,16,32,64,128]
selected: [1]
wgsize: [1]
device: { index: 0, name: "OMP CPU" }
# (ppwi=1,wgsize=1,valid=1)
results:
- outcome: { valid: true, max_diff_%: 0.002 }
param: { ppwi: 1, wgsize: 1 }
raw_iterations: [410.365,467.623,498.332,416.583,465.063,469.426,473.833,440.093,461.504,455.871]
context_ms: 6.184589
sum_ms: 3680.705
avg_ms: 460.088
min_ms: 416.583
max_ms: 498.332
stddev_ms: 22.571
giga_interactions/s: 3.474
gflop/s: 139.033
gfinst/s: 86.847
energies:
- 865.52
- 25.07
- 368.43
- 14.67
- 574.99
- 707.35
- 33.95
- 135.59
best: { min_ms: 416.58, max_ms: 498.33, sum_ms: 3680.71, avg_ms: 460.09, ppwi: 1, wgsize: 1 }
```
For reference, available command line option are:
```shell
> ./omp-bude --help  INT ✘  tom@soraws-uk  11:53:01
Usage: ./bude [COMMAND|OPTIONS]

Commands:
help -h --help Print this message
list -l --list List available devices
Options:
-d --device INDEX Select device at INDEX from output of --list, performs a substring match of device names if INDEX is not an integer
[optional] default=0
-i --iter I Repeat kernel I times
[optional] default=8
-n --poses N Compute energies for only N poses, use 0 for deck max
[optional] default=0
-p --ppwi PPWI A CSV list of poses per work-item for the kernel, use `all` for everything
[optional] default=1; available=1,2,4,8,16,32,64,128
-w --wgsize WGSIZE A CSV list of work-group sizes, not all implementations support this parameter
[optional] default=1
--deck DIR Use the DIR directory as input deck
[optional] default=`../data/bm1`
-o --out PATH Save resulting energies to PATH (no-op if more than one PPWI/WGSIZE specified)
[optional]
-r --rows N Output first N row(s) of energy values as part of the on-screen result
[optional] default=8
--csv Output results in CSV format
[optional] default=false

```
#### Overriding default flags
By default, we have defined a set of optimal flags for known HPC compilers.
There are assigned those to `RELEASE_FLAGS`, and you can override them if required.
To find out what flag each model supports or requires, simply configure while only specifying the model.
To find out what flag each model supports or requires, simply configure while only specifying the
model.
For example:
```shell
> cd miniBUDE
> cmake -Bbuild -H. -DMODEL=omp
Expand Down Expand Up @@ -91,8 +210,11 @@ No CMAKE_BUILD_TYPE specified, defaulting to 'Release'
Two input decks are included in this repository:
* `bm1` is a short benchmark (~100 ms/iteration on a 64-core ThunderX2 node) based on a small ligand (26 atoms)
* `bm2` is a long benchmark (~25 s/iteration on a 64-core ThunderX2 node) based on a big ligand (2672 atoms)* `bm2` is a long benchmark (~25 s/iteration on a 64-core ThunderX2 node) based on a big ligand (2672 atoms)
* `bm1` is a short benchmark (~100 ms/iteration on a 64-core ThunderX2 node) based on a small
ligand (26 atoms)
* `bm2` is a long benchmark (~25 s/iteration on a 64-core ThunderX2 node) based on a big ligand (
2672 atoms)* `bm2` is a long benchmark (~25 s/iteration on a 64-core ThunderX2 node) based on a
big ligand (2672 atoms)
* `bm2_long` is a very long benchmark based on `bm2` but with 1048576 poses instead of 65536
They are located in the [`data`](data/) directory, and `bm1` is run by default.
Expand All @@ -103,4 +225,46 @@ See [`makedeck`](makedeck/) for how to generate additional input decks.
Please cite miniBUDE using the following reference:
> Andrei Poenaru, Wei-Chen Lin and Simon McIntosh-Smith. ‘A Performance Analysis of Modern Parallel Programming Models Using a Compute-Bound Application’. In: 36th International Conference, ISC High Performance 2021. Frankfurt, Germany, 2021. In press.
```latex
@inproceedings{poenaru2021performance,
title={A performance analysis of modern parallel programming models using a compute-bound application},
author={Poenaru, Andrei and Lin, Wei-Chen and McIntosh-Smith, Simon},
booktitle={International Conference on High Performance Computing},
pages={332--350},
year={2021},
organization={Springer}
}
```
> Andrei Poenaru, Wei-Chen Lin, and Simon McIntosh-Smith. 2021. A Performance Analysis of Modern
> Parallel Programming Models Using a Compute-Bound Application. In High Performance Computing: 36th
> International Conference, ISC High Performance 2021, Virtual Event, June 24 – July 2, 2021,
> Proceedings. Springer-Verlag, Berlin, Heidelberg,
> 332–350. https://doi.org/10.1007/978-3-030-78713-4_18
For the Julia port specifically: https://doi.org/10.1109/PMBS54543.2021.00016
```latex
@inproceedings{lin2021julia,
title={Comparing julia to performance portable parallel programming models for hpc},
author={Lin, Wei-Chen and McIntosh-Smith, Simon},
booktitle={2021 International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)},
pages={94--105},
year={2021},
organization={IEEE}
}
```
For the ISO C++ port specifically:
https://doi.org/10.1109/PMBS56514.2022.00009
```latex
@inproceedings{lin2022cpp,
title={Evaluating iso c++ parallel algorithms on heterogeneous hpc systems},
author={Lin, Wei-Chen and Deakin, Tom and McIntosh-Smith, Simon},
booktitle={2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)},
pages={36--47},
year={2022},
organization={IEEE}
}
```

0 comments on commit c8209b1

Please sign in to comment.