-
Notifications
You must be signed in to change notification settings - Fork 102
Multigrid Solver
The feature/multigrid branch contains the present work in progress on implementing the adaptive multigrid multigrid algorithm into QUDA. Once you've checked out this branch, you should enable this with the QUDA_MULTIGRID
cmake option (or with the configure build using --enable-multigrid
). It is highly recommended to CUDA 8.0 or above; while it works with CUDA 7.0, performance is much improved with CUDA 7.5 and above, with CUDA 8.0 being optimal. These instructions are primarily intended for the those who are using the QUDA-multigrid interface directly; an alternative approach is to use Chroma which includes a QUDA-multigrid interface for solving both Wilson and Wilson-clover systems.
Note the QUDA Multigrid interface is experimental, and is evolving, and is thus subject to change. For the most upto date interface semantics we refer to the canonical multigrid_invert_test example in the tests directory.
Before we can use the multigrid solver, we first must run the adaptive setup. Here we describe the various parameters involved.
The interface functions that are responsible for managing an instance of the multigrid solver are defined quda.h.
/**
* Setup the multigrid solver, according to the parameters set in param. It
* is assumed that the gauge field has already been loaded via
* loadGaugeQuda().
* @param param Contains all metadata regarding host and device
* storage and solver parameters
*/
void* newMultigridQuda(QudaMultigridParam *param);
/**
* @brief Free resources allocated by the multigrid solver
* @param mg_instance Pointer to instance of multigrid_solver
*/
void destroyMultigridQuda(void *mg_instance);
/**
* @brief Updates the multigrid preconditioner for the new gauge / clover field
* @param mg_instance Pointer to instance of multigrid_solver
*/
void updateMultigridQuda(void *mg_instance, QudaMultigridParam *param);
We create the multigrid instance with newMultigridQuda
, which is driven by the QudaMultigridParam
struct. This struct sets all the metadata responsible for the multigrid solver parameters. newMultigridQuda
returns a pointer to the instance of the multigrid solver we will subsequently pass as a preconditioner to invertQuda
.
When newMultigridQuda
is called, the adaptive multigrid setup will be invoked creating the multigrid preconditioner. On first invocation this can take some time to call due to the autotuning overhead of each individual kernel. The setup phase essentially consists of the following steps on each level (excluding the coarsest level):
- Compute a set of null-space vectors
- Block orthonormalize the null-space vectors to define the prolongation and restriction operators
- Compute the coarse-grid operator from the Galerkin product R A P There are variations on this exact process depending on the nature of the setup parameters, e.g., whether we perform multiple cycles of the setup process, etc.
Once the setup has completed, if the run_verify
option is enabled, the algorithm will run various verification tests to ensure basic sanity. E.g.,
- (1 - P R) v = 0, where v is a null-space vectors (orthonormality)
- (1 - R P) eta = 0, where eta is a random vector
- (Ac - R A P) eta = 0, where eta is a random vector (Galerkin condition) If these tests do not pass then the parameters are incorrect or the user has come across a bug. In either case, please file an issue on GitHub.
In the below table we describe each parameter and also note the matching command-line parameter that sets this parameter in multigrid_invert_test (where applicable). For the parameters that appear in an array of size QUDA_MAX_MG_LEVEL
, each parameter can be set for each level independently. For some parameters, it only makes sense to set these [0, QUDA_MAX_MG_LEVEL-1)
, e.g., the geo_block_size
since they concern the connection to the next coarsest level, where as others potentially need to be set for every level [0, QUDA_MAX_MG_LEVEL-1]
, e.g., --mg-schwarz-type
.
Name | Description | Comment | Command-line parameter |
---|---|---|---|
QudaInvertParam *invert_param |
Set the MG solver meta data such as precision, quark mass, Dirac operator type, etc. | ||
int n_level |
Number of multigrid levels | 3 is usually optimal | --mg-levels <2+> |
int geo_block_size[QUDA_MAX_MG_LEVEL][QUDA_MAX_DIM] |
Geometric block sizes to use on each level | Generally favor more aggressive coarsening for first level, rule of thumb is 4^4 | --mg-block-size <level x y z t> |
int spin_block_size[QUDA_MAX_MG_LEVEL] |
Spin block sizes to use on each level | 2 for level 0, and 1 thereafter | N/A |
int n_vec[QUDA_MAX_MG_LEVEL] |
Number of null-space vectors to use on each level | 24 or 32 is supported presently | --mg-nvec <level nvec> |
QudaPrecision precision_null[QUDA_MAX_MG_LEVEL] |
Precision to store the null-space vectors and preconditioned coarse-link variables | Use QUDA_HALF_PRECISION for optimal performance |
--prec-null <double/single/half> |
QudaVerbosity verbosity[QUDA_MAX_MG_LEVEL] |
Verbosity on each level of the multigrid | --mg-verbosity <level verb> |
|
QudaInverterType setup_inv_type[QUDA_MAX_MG_LEVEL] |
Inverter to use in the setup phase |
QUDA_BICGSTAB_INVERTER or QUDA_CG_INVERTER generally preferred |
--mg-setup-type <null/test> |
int num_setup_iter[QUDA_MAX_MG_LEVEL] |
Number of setup iterations | Experimental - keep this set to 1 | --mg-setup-iters <level iter> |
double setup_tol[QUDA_MAX_MG_LEVEL] |
Tolerance to use in the setup phase | Usually around 5e-6 is good. | --mg-setup-tol <level tol> |
int setup_maxiter[QUDA_MAX_MG_LEVEL] |
Maximum number of iterations for each setup solver | 500-1000 should work for most systems | --mg-setup-maxiter <level iter> |
QudaSetupType setup_type |
Null-space type to use in the setup phase | Suggest setting to QUDA_NULL_VECTOR_SETUP
|
--mg-setup-type <null/test> |
QudaBoolean pre_orthonormalize |
Pre orthonormalize vectors in the setup phase | Suggest setting to QUDA_BOOLEAN_FALSE
|
--mg-pre-orth <true/false> |
QudaBoolean post_orthonormalize |
Post orthonormalize vectors in the setup phase | Suggest setting to QUDA_BOOLEAN_TRUE
|
--mg-post-orth <true/false> |
QudaInverterType coarse_solver[QUDA_MAX_MG_LEVEL] |
The solver that wraps around the coarse grid correction and smoother | Set for all levels except 0. Suggest using QUDA_GCR_INVERTER on all intermediate grids and QUDA_CA_GCR_INVERTER on the bottom. |
--mg-coarse-solver <level gcr/etc.> |
double coarse_solver_tol[QUDA_MAX_MG_LEVEL] |
Tolerance for the solver that wraps around the coarse grid correction and smoother | Suggest setting each level to 0.25 | --mg-coarse-solver-tol <level gcr/etc.> |
double coarse_solver_maxiter[QUDA_MAX_MG_LEVEL] |
Tolerance for the solver that wraps around the coarse grid correction and smoother | Suggest setting in the range 8-100 | --mg-coarse-solver-maxiter <level n> |
QudaInverterType smoother[QUDA_MAX_MG_LEVEL] |
Smoother to use on each level | Set to QUDA_CA_GCR_INVERTER for each level |
--mg-smoother <level mr/ca-gcr/etc.> |
double smoother_tol[QUDA_MAX_MG_LEVEL] |
Tolerance to use for the smoother / solver on each level | Suggest setting each level to 0.25 | --mg-smoother-tol <level resid_tol> |
int nu_pre[QUDA_MAX_MG_LEVEL] |
Number of pre-smoother applications on each level | Suggest setting to 0 | --mg-nu-pre <level 1-20> |
int nu_post[QUDA_MAX_MG_LEVEL] |
Number of post-smoother applications on each level | Suggest setting to 8 | --mg-nu-post <level 1-20> |
double omega[QUDA_MAX_MG_LEVEL] |
Over/under relaxation factor for the smoother at each level | Set to 0.8-1.0 | --mg-omega |
QudaSchwarzType smoother_schwarz_type[QUDA_MAX_MG_LEVEL] |
Whether to use additive or multiplicative Schwarz preconditioning in the smoother | Experimental, set to QUDA_INVALID_SCHWARZ for each level unless you know what you're doing |
--mg-schwarz-type <level false/add/mul> |
int smoother_schwarz_cycle[QUDA_MAX_MG_LEVEL] |
Number of Schwarz cycles to apply | Experimental, set to 1 for each level | --mg-schwarz-cycle <level cycle> |
QudaSolutionType coarse_grid_solution_type[QUDA_MAX_MG_LEVEL] |
The type of residual to send to the next coarse grid, and thus the type of solution to receive back from this coarse grid | Use QUDA_MATPC_SOLUTION for all levels; if solving the full unpreconditioned system on level 0 then set coarse_grid_solution_type[0]=QUDA_MAT_SOLUTION
|
Implicitly depending on --mg-solve-type and --solve-type
|
QudaSolveType smoother_solve_type[QUDA_MAX_MG_LEVEL] |
The type of smoother solve to do on each grid (e/o preconditioning or not) | Suggest setting to QUDA_DIRECT_PC_SOLVE for all levels |
--mg-solve-type <level solve> |
QudaMultigridCycleType cycle_type[QUDA_MAX_MG_LEVEL] |
The type of multigrid cycle to perform at each level | Always set to QUDA_MG_CYCLE_RECURSIVE (this sets the MG cycles to be a K-cycle which is generally superior to a V-cycle for non-Hermitian systems) |
|
QudaBoolean global_reduction[QUDA_MAX_MG_LEVEL] |
Whether to use global reductions or not for the smoother / solver at each level | Experimental, keep set to QUDA_BOOLEAN_YES for all levels unless using a Schwarz smoother |
|
QudaFieldLocation location[QUDA_MAX_MG_LEVEL] |
Location where each level should be done | Set to QUDA_CUDA_FIELD_LOCATION for all levels |
N/A |
QudaComputeNullVector compute_null_vector |
Whether to compute the null vectors or reload them (or use the free field vectors, see * below) | Typically set to QUDA_COMPUTE_NULL_VECTOR_YES
|
--mg-generate-nullspace <true/false> |
QudaBoolean generate_all_levels |
Whether to generate on all levels or just on level 0 | Set to QUDA_BOOLEAN_TRUE for all levels (required if nvec is changed between levels) |
--mg-generate-all-levels <true/false> |
double mu_factor[QUDA_MAX_MG_LEVEL] |
Multiplicative factor for the mu parameter | Only applicable for twisted-mass and twisted-clover fermions | --mg-mu-factor <level factor> |
QudaBoolean run_verify |
Whether to run the verification checks once set up is complete | Keep set to QUDA_BOOLEAN_TRUE unless in final production |
--verify <true/false> |
char vec_infile[256] |
Filename prefix where to load the null-space vectors | Level suffix appended onto for each level | --mg-load-vec file |
char vec_outfile[256] |
Filename prefix for where to save the null-space vectors | Level suffix appended onto for each level | --mg-save-vec file |
double gflops |
The Gflops rate of the multigrid solver setup | ||
double sec |
The time taken by the multigrid solver setup |
* To have the MG setup generate and use the exact free-field null vectors, set QudaComputeNullVector compute_null_vector
to QUDA_COMPUTE_NULL_VECTOR_NO
(equivalently, specify --mg-generate-nullspace false
for tests on the command line) and char vec_infile[256]
to the null string (equivalently, do not specify --mg-load-vec
for tests on the command line). For Wilson-type fermions, this requires setting int n_vec[QUDA_MAX_MG_LEVEL]
to 6 (equivalently, specify --mg-nvec <level> 6
) for each refinement level, where the 6 corresponds to 12 spin-color degrees of freedom divided by a chirality factor of 2.
The instance of QudaInvertParam
sets general metadata that needs to be set when driving a QUDA linear solver. Namely the precision, Dirac operator parameters, etc. Note that regardless of the final solve being done, e.g. whether that of the regular linear system or the even-odd preconditioned system, QudaMultigridParam::invert_param.solve_type
should be set to QUDA_DIRECT_SOLVE
since the preconditioner is setup based on the full linear operator. With respect to precision, the general advice here is to set cuda_prec = QUDA_DOUBLE_PRECISION
, cuda_prec_sloppy = QUDA_SINGLE_PRECISION
and cuda_prec_precondition = QUDA_HALF_PRECISION
(in addition to setting QudaMultigridParam::prec_null = QUDA_HALF_PRECISION
).
With the adaptive setup complete we can run the multigrid-accelerated solver. Instrumenting the invertQuda
solver interface to use the multigrid preconditioner is straightforward. At present, the only Krylov solver that can be used with the multigrid preconditioner is the GCR solver, and the precision settings of the final solver and preconditioner should match.
The following parameters in QudaInvertParam
should be set:
QudaInvertParam inv_param;
//...set general solver parameters
inv_param.inv_type_precondtion = QUDA_MG_INVERTER; // set the preconditioner type to multigrid
QudaMultigridParam mg_param = newQudaMultigridParam();
//...fill out MG parameter...
void *mg_preconditioner = newMultigridQuda(&mg_param); // create the MG instance
inv_param.preconditioner = mg_preconditioner; // set pointer to the relevenat MG preconditioner instance
Suggested parameters for the GCR solver that wraps the MG preconditioner include
inv_param.inv_type = QUDA_GCR_INVERTER;
inv_param.gcrNKrylov = 20;
inv_param.reliable_delta = 1e-5;
The tests/multigrid_invert_test
contains a simple example for how to use the multigrid solver, and is intended to be instructional on how to interface the multigrid solver into other applications. This test support loading gauge fields into it (require QIO support to be enabled), loading previously generated null-space vectors as well as dumping these to disk for future use. A list of the relevant command-line options can be gleamed with the --help
options:
$ ./multigrid_invert_test --help
Usage: tests/multigrid_invert_test [options]
Common options:
--verbosity <silent/summurize/verbose> # The the verbosity on the top level of QUDA( default summarize)
--prec <double/single/half> # Precision in GPU
--prec-sloppy <double/single/half> # Sloppy precision in GPU
--prec-precondition <double/single/half> # Preconditioner precision in GPU
--prec-ritz <double/single/half> # Eigenvector precision in GPU
--recon <8/9/12/13/18> # Link reconstruction type
--recon-sloppy <8/9/12/13/18> # Sloppy link reconstruction type
--recon-precondition <8/9/12/13/18> # Preconditioner link reconstruction type
--dagger # Set the dagger to 1 (default 0)
--dim <n> # Set space-time dimension (X Y Z T)
--sdim <n> # Set space dimension(X/Y/Z) size
--xdim <n> # Set X dimension size(default 24)
--ydim <n> # Set X dimension size(default 24)
--zdim <n> # Set X dimension size(default 24)
--tdim <n> # Set T dimension size(default 24)
--Lsdim <n> # Set Ls dimension size(default 16)
--gridsize <x y z t> # Set the grid size in all four dimension (default 1 1 1 1)
--xgridsize <n> # Set grid size in X dimension (default 1)
--ygridsize <n> # Set grid size in Y dimension (default 1)
--zgridsize <n> # Set grid size in Z dimension (default 1)
--tgridsize <n> # Set grid size in T dimension (default 1)
--partition <mask> # Set the communication topology (X=1, Y=2, Z=4, T=8, and combinations of these)
--rank-order <col/row> # Set the [t][z][y][x] rank order as either column major (t fastest, default) or row major (x fastest)
--kernel-pack-t # Set T dimension kernel packing to be true (default false)
--dslash-type <type> # Set the dslash type, the following values are valid
wilson/clover/twisted-mass/twisted-clover/staggered
/asqtad/domain-wall/domain-wall-4d/mobius/laplace
--flavor <type> # Set the twisted mass flavor type (singlet (default), deg-doublet, nondeg-doublet)
--load-gauge file # Load gauge field "file" for the test (requires QIO)
--save-gauge file # Save gauge field "file" for the test (requires QIO, heatbath test only)
--niter <n> # The number of iterations to perform (default 10)
--ngcrkrylov <n> # The number of inner iterations to use for GCR, BiCGstab-l (default 10)
--pipeline <n> # The pipeline length for fused operations in GCR, BiCGstab-l (default 0, no pipelining)
--solution-pipeline <n> # The pipeline length for fused solution accumulation (default 0, no pipelining)
--inv-type <cg/bicgstab/gcr> # The type of solver to use (default cg)
--precon-type <mr/ (unspecified)> # The type of solver to use (default none (=unspecified)).
--multishift <true/false> # Whether to do a multi-shift solver test or not (default false)
--mass # Mass of Dirac operator (default 0.1)
--kappa # Kappa of Dirac operator (default 0.12195122... [equiv to mass])
--mu # Twisted-Mass of Dirac operator (default 0.1)
--epsilon-naik # Epsilon factor on Naik term (default 0.0, suggested non-zero -0.1)
--compute-clover # Compute the clover field or use random numbers (default false)
--clover-coeff # Clover coefficient (default 1.0)
--anisotropy # Temporal anisotropy factor (default 1.0)
--mass-normalization # Mass normalization (kappa (default) / mass / asym-mass)
--matpc # Matrix preconditioning type (even-even, odd-odd, even-even-asym, odd-odd-asym)
--solve-type # The type of solve to do (direct, direct-pc, normop, normop-pc, normerr, normerr-pc)
--tol <resid_tol> # Set L2 residual tolerance
--tolhq <resid_hq_tol> # Set heavy-quark residual tolerance
--reliable-delta <delta> # Set reliable update delta factor
--test # Test method (different for each test)
--verify <true/false> # Verify the GPU results using CPU results (default true)
--mg-nvec <level nvec> # Number of null-space vectors to define the multigrid transfer operator on a given level
--mg-gpu-prolongate <true/false> # Whether to do the multigrid transfer operators on the GPU (default false)
--mg-levels <2+> # The number of multigrid levels to do (default 2)
--mg-nu-pre <level 1-20> # The number of pre-smoother applications to do at a given multigrid level (default 2)
--mg-nu-post <level 1-20> # The number of post-smoother applications to do at a given multigrid level (default 2)
--mg-coarse-solve-type <level solve> # The type of solve to do on each level (direct, direct-pc) (default = solve_type)
--mg-smoother-solve-type <level solve> # The type of solve to do in smoother (direct, direct-pc (default) )
--mg-solve-location <level cpu/cuda> # The location where the multigrid solver will run (default cuda)
--mg-setup-location <level cpu/cuda> # The location where the multigrid setup will be computed (default cuda)
--mg-setup-inv <level inv> # The inverter to use for the setup of multigrid (default bicgstab)
--mg-setup-maxiter <level iter> # The maximum number of solver iterations to use when relaxing on a null space vector (default 500)
--mg-setup-maxiter-refresh <level iter> # The maximum number of solver iterations to use when refreshing the pre-existing null space vectors (default 100)
--mg-setup-iters <level iter> # The number of setup iterations to use for the multigrid (default 1)
--mg-setup-tol <level tol> # The tolerance to use for the setup of multigrid (default 5e-6)
--mg-setup-type <null/test> # The type of setup to use for the multigrid (default null)
--mg-pre-orth <true/false> # If orthonormalize the vector before inverting in the setup of multigrid (default false)
--mg-post-orth <true/false> # If orthonormalize the vector after inverting in the setup of multigrid (default true)
--mg-omega # The over/under relaxation factor for the smoother of multigrid (default 0.85)
--mg-coarse-solver <level gcr/etc.> # The solver to wrap the V cycle on each level (default gcr, only for levels 1+)
--mg-coarse-solver-tol <level gcr/etc.> # The coarse solver tolerance for each level (default 0.25, only for levels 1+)
--mg-coarse-solver-maxiter <level n> # The coarse solver maxiter for each level (default 100)
--mg-smoother <level mr/etc.> # The smoother to use for multigrid (default mr)
--mg-smoother-tol <level resid_tol> # The smoother tolerance to use for each multigrid (default 0.25)
--mg-smoother-halo-prec # The smoother halo precision (applies to all levels - defaults to null_precision)
--mg-schwarz-type <level false/add/mul> # Whether to use Schwarz preconditioning (requires MR smoother and GCR setup solver) (default false)
--mg-schwarz-cycle <level cycle> # The number of Schwarz cycles to apply per smoother application (default=1)
--mg-block-size <level x y z t> # Set the geometric block size for the each multigrid level's transfer operator (default 4 4 4 4)
--mg-mu-factor <level factor> # Set the multiplicative factor for the twisted mass mu parameter on each level (default 1)
--mg-generate-nullspace <true/false> # Generate the null-space vector dynamically (default true, if set false and mg-load-vec isn't set, creates free-field null vectors)
--mg-generate-all-levels <true/talse> # true=generate null-space on all levels, false=generate on level 0 and create other levels from that (default true)
--mg-load-vec file # Load the vectors "file" for the multigrid_test (requires QIO)
--mg-save-vec file # Save the generated null-space vectors "file" from the multigrid_test (requires QIO)
--mg-verbosity <level verb> # The verbosity to use on each level of the multigrid (default summarize)
--df-nev <nev> # Set number of eigenvectors computed within a single solve cycle (default 8)
--df-max-search-dim <dim> # Set the size of eigenvector search space (default 64)
--df-deflation-grid <n> # Set maximum number of cycles needed to compute eigenvectors(default 1)
--df-eigcg-max-restarts <n> # Set how many iterative refinement cycles will be solved with eigCG within a single physical right hand site solve (default 4)
--df-tol-restart <tol> # Set tolerance for the first restart in the initCG solver(default 5e-5)
--df-tol-inc <tol> # Set tolerance for the subsequent restarts in the initCG solver (default 1e-2)
--df-max-restart-num <n> # Set maximum number of the initCG restarts in the deflation stage (default 3)
--df-tol-eigenval <tol> # Set maximum eigenvalue residual norm (default 1e-1)
--solver-ext-lib-type <eigen/magma> # Set external library for the solvers (default Eigen library)
--df-ext-lib-type <eigen/magma> # Set external library for the deflation methods (default Eigen library)
--df-location-ritz <host/cuda> # Set memory location for the ritz vectors (default cuda memory location)
--df-mem-type-ritz <device/pinned/mapped> # Set memory type for the ritz vectors (default device memory type)
--nsrc <n> # How many spinors to apply the dslash to simultaneusly (experimental for staggered only)
--msrc <n> # Used for testing non-square block blas routines where nsrc defines the other dimension
--heatbath-beta <beta> # Beta value used in heatbath test (default 6.2)
--heatbath-warmup-steps <n> # Number of warmup steps in heatbath test (default 10)
--heatbath-num-steps <n> # Number of measurement steps in heatbath test (default 10)
--heatbath-num-hb-per-step <n> # Number of heatbath hits per heatbath step (default 5)
--heatbath-num-or-per-step <n> # Number of overrelaxation hits per heatbath step (default 5)
--heatbath-coldstart <true/false> # Whether to use a cold or hot start in heatbath test (default false)
--help # Print out this message
Note: this program is multi GPU build
For example, say we have a lattice contained in the file ~/lattices/wl_5p5_x2p38_um0p4086_cfg_1000.lime
of size 16^3x64, that we wish to use, and we want to run the multigrid setup to generate the null space vectors, dump them to disk, and then solve a linear system with mass parameter mass=-0.42. To run this is on two GPUs (two processes) we might do something like this:
export META="--dim 16 16 16 32 --gridsize 1 1 1 2 --mass -0.42 --anisotropy 2.38 --load-gauge wl_5p5_x2p38_um0p4086_cfg_1000.lime"
export SOLVER_PARAM="--prec double --prec-sloppy single --prec-precondition single --tol 1e-8"
export MG_PARAM="--mg-levels 3 --mg-generate-nullspace true --mg-save-vec /tmp/null"
export LEVEL0_PARAM="--mg-nu-pre 0 0 --mg-nu-post 0 8 --mg-block-size 0 4 4 4 4 --mg-nvec 0 24 --mg-setup-inv 0 cg --mg-smoother 0 ca-gcr"
export LEVEL1_PARAM="--mg-nu-pre 1 0 --mg-nu-post 1 8 --mg-block-size 1 2 2 2 2 --mg-nvec 1 32 --mg-setup-inv 1 cg --mg-smoother 1 ca-gcr"
export LEVEL2_PARAM="--mg-coarse-solver ca-gcr --mg-coarse-solver-maxiter 8"
mpirun -np 2 tests/multigrid_invert_test $META $SOLVER_PARAM $MG_PARAM $LEVEL0_PARAM $LEVEL1_PARAM $LEVEL2_PARAM
This will run a 3-level multigrid process, with 16^3x32 volume on each GPU at the fine grid. The first coarse grid will have 4^3x8 geometry per GPU, and the second will have 2^3x4. The parameters to the different multigrid levels are set in the LEVEL0_PARAM
and LEVEL1_PARAM
variables, e.g., number of null-space vectors per level, block size, solver to use in the dynamic setup, and the solver to use as a smoother. The parameter --mg-smoother-tol
parameter is the tolerance used for each level of the solver (excluding the fine grid) when a multigrid K-cycle is used (QudaMultigridParam::cycle_type = QUDA_MG_CYCLE_RECURSIVE
), as well as determining the tolerance of the coarsest grid solve.
If we want to reuse these same null-space vectors in a subsequent run, and not rerun the null-space generation, we would use
export META="--dim 16 16 16 32 --gridsize 1 1 1 2 --mass -0.42 --anisotropy 2.38 --load-gauge wl_5p5_x2p38_um0p4086_cfg_1000.lime"
export SOLVER_PARAM="--prec double --prec-sloppy single --prec-precondition single --tol 1e-8"
export MG_PARAM="--mg-levels 3 --mg-generate-nullspace false --mg-load-vec /tmp/null"
export LEVEL0_PARAM="--mg-nu-pre 0 0 --mg-nu-post 0 8 --mg-block-size 0 4 4 4 4 --mg-nvec 0 24 --mg-setup-inv 0 cg --mg-smoother 0 ca-gcr"
export LEVEL1_PARAM="--mg-nu-pre 1 0 --mg-nu-post 1 8 --mg-block-size 1 2 2 2 2 --mg-nvec 1 32 --mg-setup-inv 1 cg --mg-smoother 1 ca-gcr"
export LEVEL2_PARAM="--mg-coarse-solver ca-gcr --mg-coarse-solver-maxiter 8"
mpirun -np 2 tests/multigrid_invert_test $META $SOLVER_PARAM $MG_PARAM $LEVEL0_PARAM $LEVEL1_PARAM $LEVEL2_PARAM
QUDA has the ability to use Schwarz preconditioning as a smoother, as well as a preconditioner for the coarsest-grid solver. There are two parameters that are used to control this
QudaMultigridParam::smoother_schwarz_type[QUDA_MAX_MG_LEVEL];
QudaMultigridParam::smoother_schwarz_cycle[QUDA_MAX_MG_LEVEL];
The first parameter is used to enable Schwarz preconditioning on a given level, valid parameters are QUDA_INVALID_SCHWARZ
(default), QUDA_ADDITIVE_SCHWARZ
and QUDA_MULTIPLICATIVE_SCHWARZ
. These equate to --mg-schwarz-type <level false/add/mul>
and --mg-schwarz-cycle <level cycle>
command-line parameters for multigrid_invert_test.
QUDA's Schwarz preconditioning applies the domain decomposition at the node process interfaces, so this means that each Schwarz block is exactly the local volume, and hence dependent on the number of processes and process topology. The second parameter is used to determine the number of cycles of the preconditioner that are applied per step (default 1). When doing multiplicative preconditioning, the number of cycles must be even so that the entire lattice is updated. The number of iterations of each Schwarz cycle is set by the nu_pre
and nu_post
parameters. Hence for every cycle iteration, there are nu iterations of the domain solver.
At present, the Schwarz preconditioner is applied to both the setup solver and to actual smoother in the final multigrid algorithm. This has the implication that the setup solver must support Schwarz preconditioning: in practice (at present) this means that GCR must be used as the setup solver. Similarly, the actual solver used for solving the Schwarz system must be the MR.
For example, in the below example we use additive Schwarz as a smoother for the fine and intermediate levels, and use multiplicative Schwarz preconditioning on the coarsest grid.
export META="--dim 16 16 16 32 --gridsize 1 1 1 2 --mass -0.42 --anisotropy 2.38 --load-gauge gauge.conf"
export SOLVER_PARAM="--prec double --prec-sloppy single --prec-precondition single --tol 1e-8"
export MG_PARAM="--mg-levels 3 --mg-generate-nullspace true --mg-save-vec /tmp/null"
export LEVEL0_PARAM="--mg-nu-pre 0 0 --mg-nu-post 0 8 --mg-block-size 0 4 4 4 4 --mg-nvec 0 24 --mg-setup-inv 0 gcr --mg-schwarz-type 0 add --mg-smoother 0 mr"
export LEVEL1_PARAM="--mg-nu-pre 0 0 --mg-nu-post 0 8 --mg-block-size 1 2 2 2 2 --mg-nvec 1 32 --mg-setup-inv 1 gcr --mg-schwarz-type 1 add --mg-smoother 1 mr"
export LEVEL2_PARAM="--mg-schwarz-type 2 mul --mg-schwarz-cycle 2 2 --mg-smoother 2 mr --mg-coarse-solver 2 gcr"
mpirun -np 2 tests/multigrid_invert_test $META $SOLVER_PARAM $MG_PARAM $LEVEL0_PARAM $LEVEL1_PARAM $LEVEL2_PARAM
The Multigrid Solver has many important elements that you need to consider for your problem. You may find that many of them have little to no effect if solving heavy quark or unphysical problems, but for large and physical lattices they must be given proper attention. The tuning is done in two stages.
Typical problems will not require more than a 3-level solve, and 3-level solves offer good consistency over different configurations, but you may find that a 2-level solve is better. On each level you may choose to use either 24 or 32 null vectors to generate the null space that creates the coarse grid. On each level there is a blocking scheme (or aggregation scheme) that dictates the coarseness of the coarse-grid solver. For example, you wish to perform a 2-level solve on an L^3xT=16^3x32 lattice, with aggregates of size 4^3x8 for a coarse-grid problem of size of 4^4.
In general, QUDA's multigrid parameters can all be requested at runtime, the main exception to this is the aggregate size: the restrictor requires that the aggregate size size declared at compile time in order to generate efficient code. If you receive an error block size not instantiated
simply edit lib/restrictor.cu
to instantiate the correct template accordingly: the CUDA thread block size should be set to half the value of the geometric aggregate size (this is the checkerboarded aggregate size - the parity is handled separately in the thread y dimension. Hardware constraints mean that only 1024 threads can be launched per CUDA block, so the largest aggregation scheme allowed on any MG level is 4x4x8x8.
Once you have decided on the range of null vectors, levels, and aggregation schemes to test, launch tests/multigrid_invert_test
(or your own test code) for each combination. QUDA's internal hardware tuning routines will then profile these combinations so that further jobs are performed as quickly as possible. This will take some time, especially for large lattices, but need be done only once. If you're testing 2- and 3-level solves, and your coarse 2-level set-up is the same as the intermediate set-up on 3-level (say 2LVL MG=4444 Null=24, and 3LVL MG=4444-2222 Null=24-24) then do all the 2-level tuning first. The profiles for the intermediate level will be picked up and used, saving a lot of time.
Now that QUDA has profiled each MG combination, you can move on to numerical parameters. One of the most important parameters is the the tolerance of the inner GCR smoother. We have found that setting this to a value of 0.25
usually achieves good results, but you may wish to fine tune this number. Variations of up to 100% in outer solve times can be seen for smoother tolerance values of 0.1-0.3
.
When using twisted-mass or twisted-clover fermions, changing the value of mu
on the coarsest grid can have a significant effect on the outer solve time. Use the command line option --mg-mu-factor <level> <mu-factor>
where <mu-factor>
is the multiplicative factor by which mu is increased on the coarse levels. The number of (pre and post) smoother iterations is another variable to consider.
It is best to use the CG solver to generate the null space --mg-setup-inv <N level> cg
for twisted fermions as it's more stable. For precisions, make sure you pass --prec-sloppy single
, --prec-precondition half
and --prec-null half
. Note that with twisted-clover fermions, half precision is only recommended if DYNAMIC_CLOVER
enabled.
Note that all MG command line options refer to the Nth multigrid level with the integer N-1. For example, for a 2-level solve one would pass --mg-block-size 0 4 4 4 8
to set the blocking scheme on the first level, and --mg-mu-factor 1 16.0
to set the mu value of on the second level.
There are two ways in which one can use Lanczos to accelerate the multigrid inverter. We construct a QudaEigParam
for each level via this exhaustive run-down of the options:
--mg-eig <level> <true/false> # Use the eigensolver on this level (default false)
--mg-eig-nEv <level> <n> # The size of eigenvector search space in the eigensolver
--mg-eig-nKr <level> <n> # The size of the Krylov subspace to use in the eigensolver
--mg-eig-check-interval <level> <n> # Perform a convergence check every nth restart/iteration (only used in Implicit Restart types)
--mg-eig-max-restarts <level> <n> # Perform a maximun of n restarts in eigensolver (default 100)
--mg-eig-use-normop <level> <true/false> # Solve the MdagM problem instead of M (MMdag if eig-use-dagger == true) (default false)
--mg-eig-use-dagger <level> <true/false> # Solve the MMdag problem instead of M (MMdag if eig-use-normop == true) (default false)
--mg-eig-tol <level> <tol> # The tolerance to use in the eigensolver (default 1e-6)
--mg-eig-use-poly-acc <level> <true/false> # Use Chebyshev polynomial acceleration in the eigensolver (default true)
--mg-eig-poly-deg <level> <n> # Set the degree of the Chebyshev polynomial (default 100)
--mg-eig-amin <level> <Float> # The minimum in the polynomial acceleration (default 0.1)
--mg-eig-amax <level> <Float> # The maximum in the polynomial acceleration (default 4.0)
--mg-eig-spectrum <level> <SR/LR/SM/LM/SI/LI> # The spectrum part to be calulated. S=smallest L=largest R=real M=modulus I=imaginary (default SR)
--mg-eig-type <level> <eigensolver> # The type of eigensolver to use (default trlm)
--mg-eig-coarse-guess <true/false> # If deflating on the coarse grid, optionaly use an initial guess (default false)
We always use the regular coarse operator for constructing the null space, rather than the preconditioned operator, regardless of what operator is being coarsened.
You should experiment with the Chebyshev acceleration parameters as documented here and here. As a rule of thumb, one can optimise the eigensolver for fine operator, and use these values on each level of multigrid as a good starting set. You will likely need to make minor adjustments for optimal performance though.
One may use the eigenvectors of the normal operator MdagM
as the null vectors for the prolongation and restriction of multigrid levels. To enable this feature, use the flag,
--mg-eig <level> <true/false> # Use the eigensolver on this level (default false)
This will instruct the QUDA test routine to create a set of eigensolver parameters for the specified level, and will call the eigensolver on that level construction. Be sure to select appropriate eigensolver parameters for your problem. If calling the MG routine from inside you application code, see multigrid_invert_test.cpp
for an example of how QUDA parameters are populated.
One may also deflate the coarsest grid solve. This is enabled via the same flag,
--mg-eig <coarsest level> <true/false> # Use the eigensolver on this level (default false)
The same rules apply when it comes to passing eigensolver parameters.
We here give a series of example commands that demonstrate the usage of the two MG+Lanczos features. We shall work with an L=16, T=32
random lattice throughout, use a three level solve with 2222
and 2222
blocking on levels 0
and 1
, and use the appropriate solver types and parameters for the given feature. We give first a command that performs the solve using MG only as a reference.
./multigrid_invert_test --prec double --prec-sloppy single --prec-null half --prec-precondition half --solve-type direct-pc --dim 16 16 16 32 --mg-levels 3 --mass 0.1 --mg-nvec 0 24 --mg-block-size 0 2 2 2 2 --mg-nvec 1 24 --mg-block-size 1 2 2 2 2 --nsrc 1 --verify false --mg-nu-pre 0 0 --mg-nu-post 0 8 --mg-smoother 0 ca-gcr --mg-nu-pre 1 0 --mg-nu-post 1 8 --mg-smoother 1 ca-gcr --tol 1e-10 --niter 100 --mg-coarse-solver 2 ca-gcr --mg-coarse-solver-maxiter 2 8 --mg-coarse-solver-tol 2 0.25 --nsrc 10 --mg-setup-inv 0 cg --mg-setup-inv 1 cg
This is a fairly typical type of MG solve. Note ca-gcr
is the favoured solver on the coarse level, and the tolerance is very loose. This is (for Wilson type solves) the recommended set-up.
./multigrid_invert_test --prec double --prec-sloppy single --prec-null half --prec-precondition half --solve-type direct-pc --dim 16 16 16 32 --mg-levels 3 --mass 0.1 --mg-nvec 0 24 --mg-block-size 0 2 2 2 2 --mg-nvec 1 24 --mg-block-size 1 2 2 2 2 --nsrc 1 --verify false --mg-nu-pre 0 0 --mg-nu-post 0 8 --mg-smoother 0 ca-gcr --mg-nu-pre 1 0 --mg-nu-post 1 8 --mg-smoother 1 ca-gcr --tol 1e-10 --niter 100 --mg-coarse-solver 2 ca-gcr --mg-coarse-solver-maxiter 2 8 --mg-coarse-solver-tol 2 0.25 --mg-eig 0 true --mg-eig 1 true --mg-eig 2 true --mg-low-mode-check false --mg-nvec 2 32 --mg-eig-nEv 0 34 --mg-eig-nKr 0 102 --mg-eig-tol 0 1e-4 --mg-eig-poly-deg 0 50 --mg-eig-amin 0 1e-1 --mg-eig-amax 0 5.0 --mg-eig-nEv 1 32 --mg-eig-nKr 1 64 --mg-eig-tol 1 1e-6 --mg-eig-poly-deg 1 50 --mg-eig-amin 1 1e-1 --mg-eig-amax 1 5.0 --mg-eig-nEv 2 32 --mg-eig-nKr 2 64 --mg-eig-tol 2 1e-6 --mg-eig-poly-deg 2 50 --mg-eig-amin 2 1e-0 --mg-eig-amax 2 5.0 --mg-eig-max-restarts 0 100 --mg-eig-max-restarts 1 100 --mg-eig-max-restarts 2 100 --nsrc 10 --mg-eig-use-dagger 0 false --mg-eig-use-dagger 1 false --mg-eig-use-dagger 2 false --mg-eig-use-normop 0 true --mg-eig-use-normop 1 true --mg-eig-use-normop 2 true --nsrc 10 --verbosity verbose --mg-verbosity 2 verbose --mg-verbosity 1 verbose --mg-verbosity 0 verbose
This solves the same problem, with the same MG blocking scheme, but with the addition of Lanczos at every step. Good luck!
- Amalgamate the eigensolver code with the EigCG code as much as possible. E.g., use shared parameters, use a single deflation code path, check EicCG vectors with low prec Lanczos solves, make the
deflation.cpp
host routines faster with template instantiated sizes ofMatrixXcd
, such as 24, 32, 48, 64. - Ensure the solver variables
nEv
,nKr
, andnConv
are used sensibly throughout the eigensolver. E.g., only save/loadnConv
vectors, trim vector arrays where appropriate. - Make a staggered/Laplace eigensolve test script.
The following is an incomplete list of what needs to be done:
- Optimize CPU coarse grid operator - at present no attempt has been made to optimize the CPU coarse grid operator despite it being on the critical path in a multi-level implementation since we have the expectation that the coarsest grid will be on the CPU. Optimization here will almost certainly include OpenMP parallelisation as well as potentially extending the QUDA autotuner to encompass CPU kernels, as well as vectorization.
- Optimize CPU BLAS kernels
- Implement reduced precision for halo exchange, 8-bit and 16-bit halo exchange
- Improve how precision is exposed in the interface
- understand james Osborn's set up process
- "For the setup I ran several passes of ~100 bicgstab iterations on a random vector until I determined that the rayleigh quotient wasn't being reduced much, then used that vector. I then started with the vector I had before the previous bicgstab pass and orthogonalized it against the current set of null vectors, then ran another set of bicgstab passes on that, and so on until I have enough vectors."
- block solvers as smoothers
- multi rhs interface
- add support to overlap halo exchange with interior compute
- fix the profiling to correctly report the GCR wrapper solver
- merge the multi-grid/generic ghost zone with the dslash ghost zones - this will enable peer-to-peer for MG
- look at how BiCGstab works as an outer solver and coarse-grid solver
Possible strategies for further memory reductions
- Support storing the B vectors in half precision.
- There is redundant information having both of B and V arrays since V is just the block orthogonalized version of B. This is complicated by the fact that B is an array of vectors and V is a block packed vector. Note we could just store the coarsened vectors B_c since we can reconstruct the B vectors from
B[i] = P * B_c[i]
- Use half precision for storing the coarse links. To do this we'd need to introduce Y and X temporaries to accumulate into and then copy into the low precision fields. Memory saving here since we loop over direction, so the Y tmp need only be a scalar matrix field. We'd want to form Xinv from X temp before discarding it.
- There are multiples places throughout the solver were have persistent fields allocated but we do not simultaneously use them, e.g., temporaries is MG vs Transfer vs temporaries used in the solvers. If we can share the temporaries between these then we would be able to reduce the overall memory footprint.