This document outlines building and installing ExaGO on three clusters:
- Newell (Power9 PNNL system)
- Marianas (Intel PNNL system)
- Ascent (Power9 ORNL system)
This document goes into great detail about building on Newell, but most of the instructions apply to all clusters.
If you are on one of the aforementioned systems, the following commands should yield an ExaGO installation with most or all options enabled, including GPU computation.
$ # Set this variable to one of newell, marianas, or ascent
$ export MY_CLUSTER=newell
$ git clone https://gitlab.pnnl.gov/exasgd/frameworks/exago.git
$ cd exago
$ mkdir build install
$ # Load all the modules needed to build/run ExaGO
$ source ./buildsystem/gcc-cuda/${MY_CLUSTER}Variables.sh
$ cd build
$ # Use the initial CMake cache we use for CI
$ cmake -C ../buildsystem/gcc-cuda/cache.cmake ..
$ make -j 12 install
$ # The tests may take a while to run
$ make test
Newell is a cluster of five IBM AC922 Power9 servers with four NVIDIA Volta GPUs per system. This system gives researchers unfettered access to architectures found in the Summit supercomputer at Oak Ridge that is currently ranked No. 1 on the Top500 list.
Specs:
- 2 Power9 CPUs with a total of 128 logical cores per node
- 4 NVIDIA V100 GPUs with NVLINK (16GB per GPU)
- 1TB of system memory per node
- EDR Infiniband internal network
- 10Gb/s connections to PNNL network
- Nodes have 2 physical Infiniband ports each of which has 2 virtual ports
More information is available on the Research Computing confluence page.
More details on setting up your environment to run on the Newell cluster are available below, but the current spack packages can be used to build ExaGO and run the ExaGO test suite by typing just a few lines.
The build scripts currently only work in the bash shell. If you are not using
bash, type bash
immediately after logging into newell. Then cd
into the
root source directory of ExaGO.
Source the shell script we use to load the needed modules and set the needed environment variables to build on Newell
source buildsystem/gcc-cuda/newellVariables.sh
This will set all environment variables needed by ExaGO.
Next, request an interactive session on one of the compute nodes:
srun -A exasgd -t 20:00 --gres=gpu:1 -p newell -n 2 --pty bash
The -A exasgd
indicates you are using the ExaSGD allocation,
-t 20:00
specifies that you are requesting 20 minutes for your interactive
session, --gres=gpu:1
configures the GPUs, -p newell
means
that you are using the newell partition on SLURM, -n 2
specifies the
number of MPI tasks for this session that you are requesting and
--pty bash
means your interactive session will be using the bash shell.
One or two tests are using more than one MPI task with a maximum of three tasks, so you
sould select at least three tasks
in the interactive shell. Using the newell partition guarantees that all your
environment variables and binary files match up with the hardware you are running on.
To build ExaGO, just type
./buildsystem/build.sh --job=gcc-cuda --build-only
in the top level directory while in the interactive session. This will create a
build
directory underneath the ExaGO directory. This directory contains
all program executables, makefiles and test directories. In addition to
building ExaGO, the script will also run the test suite. If you wish to rerun
the test suite after running the build.sh
script you can do so by
ommitting the --build-only
argument to the build script, or by going
into the build
directory and typing either
ctest
or
make test
Currently, runing OpenMPI on Newell requires some nonstandard flags to be
passed to the mpirun
script. The MPI run command
mpirun -n 4
may need to be replaced with something like this:
mpirun -mca pml ucx --mca btl ^vader,tcp,openib,uct -x UCX_NET_DEVICES=mlx5_1:1,mlx5_3:1 -n 4
Without setting these flags, OpenMPI might report warnings or even fail when running multi-node jobs.
You can save yourself some typing by setting environment variables like this
export OMPI_MCA_pml="ucx"
export OMPI_MCA_btl="^vader,tcp,openib,uct"
export UCX_NET_DEVICES=mlx5_1:1,mlx5_3:1
For more information take a look at OpenMPI FAQs.
The currently tested (and recommended) tool chain for building ExaGO consists of
GCC 7.4 and OpenMPI 3.1.5. You will also need a recent version of CMake. ExaGO
depends on PETSc >= 3.13, and optionally on Ipopt and HiOp optimization
libraries. To get needed modules, refer to the CI scripts under buildsystem/gcc-cuda/
.
ExaGO can be installed on Newell easily with cmake
.
First create a build directory outside the ExaGO source directory. For example
$ mkdir build
$ ls
build exago
$
Then configure ExaGO from the build directory using cmake
:
$ cd build
$ cmake ../exago
$ make install
If you would like to configure ExaGO to use all the options used in continuous integration, invoke CMake like so:
$ cd build
$ cmake ../exago -C ../exago/buildsystem/gcc-cuda/cache.cmake
$ make install
The ExaGO library and its applications are installed in the default installation directory. To change the installation directory, run CMake with the flag
$ cmake ../exago -DCMAKE_INSTALL_PREFIX=<your_exago_install_dir>
ExaGO assumes PETSc is built with MPI support. If it is not, it is recommended you configure ExaGO not to use MPI:
$ cmake -DEXAGO_ENABLE_MPI=Off ../exago
In case the PETSc dependency is not automatically found, you can specify it using
ccmake
interactive shell or add command line option like this:
$ cmake ../exago -DPETSC_DIR=<petsc_install_dir> -DPETSC_ARCH=<petsc_arch>
To use Ipopt with ExaGO, set:
cmake ../exago -DEXAGO_ENABLE_IPOPT=ON
ExaGO will find the Ipopt module you loaded on Newell. If you want to use your own Ipopt build, you will most likely need to specify its location like this:
cmake ../exago -DEXAGO_ENABLE_IPOPT=ON -DIPOPT_DIR=<ipopt_install_dir>
Similar to Ipopt, the corresponding flags for HiOp are EXAGO_ENABLE_HIOP
and HIOP_DIR
.
cmake ../exago -DEXAGO_ENABLE_HIOP=ON -DHIOP_DIR=<hiop_install_dir>
The CMake installer should specify dynamic run paths. To run ExaGO, you may need to load all the modules you used to build it.
Ascent uses the LSF scheduler for job submission, so if you would like to run the tests, you'll have to pass some additional options.
$ export MY_CLUSTER=ascent
$ git clone https://gitlab.pnnl.gov/exasgd/frameworks/exago.git
$ mkdir build install
$ cd exago
$ # Load all the modules needed to build/run ExaGO
$ source ./buildsystem/gcc-cuda/${MY_CLUSTER}Variables.sh
$ cd ../build
# The EXAGO_CTEST_LAUNCH_COMMAND option lets the user customize the options
# passed to the job scheduler when running CTest.
$ cmake \
-C ../exago/buildsystem/gcc-cuda/cache.cmake \
-DEXAGO_CTEST_LAUNCH_COMMAND="jsrun -g 1" \
-DCMAKE_INSTALL_PREFIX=$PWD/../install \
../exago
$ make -j 12 install
$ # Request an allocation for 15 minutes
$ bsub -P csc359 -W 15 -nnodes 1 -Is /bin/bash
$ # The tests may take a while to run
$ make test
A workflow on Marianas should look almost exactly like one on Newell, except you
should request an allocation in the dl
or dl_shared
partition, and
source buildsystem/gcc-cuda/marianasVariables.sh
.