Skip to content

Commit

Permalink
ice_dyn_vp: use 'global_sum_prod' for bit-for-bit output reproducibility
Browse files Browse the repository at this point in the history
The VP solver uses a linear solver, FGMRES, as part of the non-linear
iteration. The FGMRES algorithm involves computing the norm of a
distributed vector field, thus performing global sums.

These norms are computed by first summing the squared X and Y components
of a vector field in subroutine 'calc_L2norm_squared', summing these
over the local blocks, and then doing a global (MPI) sum using
'global_sum'.

This approach does not lead to reproducible results when the MPI
distribution, or the number of local blocks, is changed, for reasons
explained in the "Reproducible sums" section of the Developer Guide
(mostly, floating point addition is not associative). This was partly
pointed out in [1] but I failed to realize it at the time.

Make the results of the VP solver more reproducible by using two calls
to 'global_sum_prod' to individually compute the squares of the X and Y
components when computing norms, and then summing these two reproducible
scalars.

The same pattern appears in the FGMRES solver (subroutine 'fgmres'), the
preconditioner 'pgmres' which uses the same algorithm, and the
Classical and Modified Gram-Schmidt algorithms in 'orthogonalize'.

These changes result in twice the number of global sums for fgmres,
pgmres and the MGS algorithm. For the CGS algorithm, the performance
impact is higher as 'global_sum_prod' is called inside the loop, whereas
previously we called 'global_allreduce_sum' after the loop to compute
all 'initer' sums at the same time.

To keep that optimization, we would have to implement a new interface
'global_allreduce_sum_prod' which would take two arrays of shape
(nx_block,ny_block,max_blocks,k) and sum these over their first three
dimensions before performing the global reduction over the k dimension.

We choose to not go that route for now mostly because anyway the CGS
algorithm is (by default) only used for the PGMRES preconditioner, and
so the cost should be relatively low as 'initer' corresponds to
'dim_pgmres' in the namelist, which should be kept low for efficiency
(default 5).

These changes lead to bit-for-bit reproducibility (the decomp_suite
passes) when using 'precond=ident' and 'precond=diag' along with
'bfbflag=reprosum'. 'precond=pgmres' is still not bit-for-bit because
some halo updates are skipped for efficiency. This will be addressed in
a following commit.

Note that calc_bvec loops only over ice points to compute b[xy], so
zero-initialize b[xy] since global_sum_prod loops over the whole array.
The arnoldi_basis_[xy] arrays are already zero-initialized in fgmres and
pgmres.

[1] CICE-Consortium#491 (comment)
  • Loading branch information
phil-blain committed Aug 25, 2022
1 parent dea537a commit a2dad16
Showing 1 changed file with 31 additions and 96 deletions.
127 changes: 31 additions & 96 deletions cicecore/cicedynB/dynamics/ice_dyn_vp.F90
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ module ice_dyn_vp
seabed_stress, Ktens, stack_fields, unstack_fields
use ice_fileunits, only: nu_diag
use ice_flux, only: fmU
use ice_global_reductions, only: global_sum, global_allreduce_sum
use ice_global_reductions, only: global_sum, global_sum_prod
use ice_grid, only: dxT, dyT, dxhy, dyhx, cxp, cyp, cxm, cym, uarear
use ice_exit, only: abort_ice
use icepack_intfc, only: icepack_warnings_flush, icepack_warnings_aborted
Expand Down Expand Up @@ -1941,6 +1941,10 @@ subroutine calc_bvec (nx_block, ny_block, &
if (icepack_warnings_aborted()) call abort_ice(error_message=subname, &
file=__FILE__, line=__LINE__)

! Initialize to zero
bx = c0
by = c0

do ij = 1, icellu
i = indxui(ij)
j = indxuj(ij)
Expand Down Expand Up @@ -2821,18 +2825,8 @@ subroutine fgmres (zetax2 , etax2 , &
arnoldi_basis_y = c0

! solution is zero if RHS is zero
!$OMP PARALLEL DO PRIVATE(iblk)
do iblk = 1, nblocks
call calc_L2norm_squared(nx_block , ny_block , &
icellu (iblk), &
indxui (:,iblk), indxuj(:,iblk), &
bx(:,:,iblk) , &
by(:,:,iblk) , &
norm_squared(iblk))

enddo
!$OMP END PARALLEL DO
norm_rhs = sqrt(global_sum(sum(norm_squared), distrb_info))
norm_rhs = sqrt(global_sum_prod(bx(:,:,:), bx(:,:,:), distrb_info, field_loc_NEcorner) + &
global_sum_prod(by(:,:,:), by(:,:,:), distrb_info, field_loc_NEcorner))
if (norm_rhs == c0) then
solx = bx
soly = by
Expand Down Expand Up @@ -2870,18 +2864,11 @@ subroutine fgmres (zetax2 , etax2 , &
! Start outer (restarts) loop
do
! Compute norm of initial residual
!$OMP PARALLEL DO PRIVATE(iblk)
do iblk = 1, nblocks
call calc_L2norm_squared(nx_block , ny_block , &
icellu (iblk), &
indxui (:,iblk), indxuj(:,iblk), &
arnoldi_basis_x(:,:,iblk, 1) , &
arnoldi_basis_y(:,:,iblk, 1) , &
norm_squared(iblk))

enddo
!$OMP END PARALLEL DO
norm_residual = sqrt(global_sum(sum(norm_squared), distrb_info))
norm_residual = sqrt(global_sum_prod(arnoldi_basis_x(:,:,:,1), &
arnoldi_basis_x(:,:,:,1), distrb_info, field_loc_NEcorner) + &
global_sum_prod(arnoldi_basis_y(:,:,:,1), &
arnoldi_basis_y(:,:,:,1), distrb_info, field_loc_NEcorner))

if (my_task == master_task .and. monitor_fgmres) then
write(nu_diag, '(a,i4,a,d26.16)') "monitor_fgmres: iter_fgmres= ", nbiter, &
Expand Down Expand Up @@ -2975,17 +2962,10 @@ subroutine fgmres (zetax2 , etax2 , &
hessenberg)

! Compute norm of new Arnoldi vector and update Hessenberg matrix
!$OMP PARALLEL DO PRIVATE(iblk)
do iblk = 1, nblocks
call calc_L2norm_squared(nx_block , ny_block , &
icellu (iblk), &
indxui (:,iblk), indxuj(:, iblk) , &
arnoldi_basis_x(:,:,iblk, nextit), &
arnoldi_basis_y(:,:,iblk, nextit), &
norm_squared(iblk))
enddo
!$OMP END PARALLEL DO
hessenberg(nextit,initer) = sqrt(global_sum(sum(norm_squared), distrb_info))
hessenberg(nextit,initer) = sqrt(global_sum_prod(arnoldi_basis_x(:,:,:,nextit), &
arnoldi_basis_x(:,:,:,nextit), distrb_info, field_loc_NEcorner) + &
global_sum_prod(arnoldi_basis_y(:,:,:,nextit), &
arnoldi_basis_y(:,:,:,nextit), distrb_info, field_loc_NEcorner))

! Watch out for happy breakdown
if (.not. almost_zero( hessenberg(nextit,initer) ) ) then
Expand Down Expand Up @@ -3263,18 +3243,10 @@ subroutine pgmres (zetax2 , etax2 , &
! Start outer (restarts) loop
do
! Compute norm of initial residual
!$OMP PARALLEL DO PRIVATE(iblk)
do iblk = 1, nblocks
call calc_L2norm_squared(nx_block , ny_block , &
icellu (iblk), &
indxui (:,iblk), indxuj(:, iblk), &
arnoldi_basis_x(:,:,iblk, 1), &
arnoldi_basis_y(:,:,iblk, 1), &
norm_squared(iblk))

enddo
!$OMP END PARALLEL DO
norm_residual = sqrt(global_sum(sum(norm_squared), distrb_info))
norm_residual = sqrt(global_sum_prod(arnoldi_basis_x(:,:,:,1), &
arnoldi_basis_x(:,:,:,1), distrb_info, field_loc_NEcorner) + &
global_sum_prod(arnoldi_basis_y(:,:,:,1), &
arnoldi_basis_y(:,:,:,1), distrb_info, field_loc_NEcorner))

if (my_task == master_task .and. monitor_pgmres) then
write(nu_diag, '(a,i4,a,d26.16)') "monitor_pgmres: iter_pgmres= ", nbiter, &
Expand Down Expand Up @@ -3357,17 +3329,10 @@ subroutine pgmres (zetax2 , etax2 , &
hessenberg)

! Compute norm of new Arnoldi vector and update Hessenberg matrix
!$OMP PARALLEL DO PRIVATE(iblk)
do iblk = 1, nblocks
call calc_L2norm_squared(nx_block , ny_block , &
icellu (iblk), &
indxui (:,iblk), indxuj(:, iblk) , &
arnoldi_basis_x(:,:,iblk, nextit), &
arnoldi_basis_y(:,:,iblk, nextit), &
norm_squared(iblk))
enddo
!$OMP END PARALLEL DO
hessenberg(nextit,initer) = sqrt(global_sum(sum(norm_squared), distrb_info))
hessenberg(nextit,initer) = sqrt(global_sum_prod(arnoldi_basis_x(:,:,:,nextit), &
arnoldi_basis_x(:,:,:,nextit), distrb_info, field_loc_NEcorner) + &
global_sum_prod(arnoldi_basis_y(:,:,:,nextit), &
arnoldi_basis_y(:,:,:,nextit), distrb_info, field_loc_NEcorner))

! Watch out for happy breakdown
if (.not. almost_zero( hessenberg(nextit,initer) ) ) then
Expand Down Expand Up @@ -3646,39 +3611,20 @@ subroutine orthogonalize(ortho_type , initer , &
ij , & ! compressed index
i, j ! grid indices

real (kind=dbl_kind), dimension (max_blocks) :: &
local_dot ! local array value to accumulate dot product of grid function over blocks

real (kind=dbl_kind), dimension(maxinner) :: &
dotprod_local ! local array to accumulate several dot product computations

character(len=*), parameter :: subname = '(orthogonalize)'

if (trim(ortho_type) == 'cgs') then ! Classical Gram-Schmidt
! Classical Gram-Schmidt orthogonalisation process
! First loop of Gram-Schmidt (compute coefficients)
dotprod_local = c0
do it = 1, initer
local_dot = c0

!$OMP PARALLEL DO PRIVATE(iblk,ij,i,j)
do iblk = 1, nblocks
do ij = 1, icellu(iblk)
i = indxui(ij, iblk)
j = indxuj(ij, iblk)

local_dot(iblk) = local_dot(iblk) + &
(arnoldi_basis_x(i, j, iblk, it) * arnoldi_basis_x(i, j, iblk, nextit)) + &
(arnoldi_basis_y(i, j, iblk, it) * arnoldi_basis_y(i, j, iblk, nextit))
enddo ! ij
enddo
!$OMP END PARALLEL DO
hessenberg(it, initer) = global_sum_prod(arnoldi_basis_x(:,:,:,it), &
arnoldi_basis_x(:,:,:,nextit), distrb_info, field_loc_NEcorner) + &
global_sum_prod(arnoldi_basis_y(:,:,:,it), &
arnoldi_basis_y(:,:,:,nextit), distrb_info, field_loc_NEcorner)

dotprod_local(it) = sum(local_dot)
end do

hessenberg(1:initer, initer) = global_allreduce_sum(dotprod_local(1:initer), distrb_info)

! Second loop of Gram-Schmidt (orthonormalize)
do it = 1, initer
!$OMP PARALLEL DO PRIVATE(iblk,ij,i,j)
Expand All @@ -3698,22 +3644,11 @@ subroutine orthogonalize(ortho_type , initer , &
elseif (trim(ortho_type) == 'mgs') then ! Modified Gram-Schmidt
! Modified Gram-Schmidt orthogonalisation process
do it = 1, initer
local_dot = c0

!$OMP PARALLEL DO PRIVATE(iblk,ij,i,j)
do iblk = 1, nblocks
do ij = 1, icellu(iblk)
i = indxui(ij, iblk)
j = indxuj(ij, iblk)

local_dot(iblk) = local_dot(iblk) + &
(arnoldi_basis_x(i, j, iblk, it) * arnoldi_basis_x(i, j, iblk, nextit)) + &
(arnoldi_basis_y(i, j, iblk, it) * arnoldi_basis_y(i, j, iblk, nextit))
enddo ! ij
enddo
!$OMP END PARALLEL DO

hessenberg(it,initer) = global_sum(sum(local_dot), distrb_info)
hessenberg(it, initer) = global_sum_prod(arnoldi_basis_x(:,:,:,it), &
arnoldi_basis_x(:,:,:,nextit), distrb_info, field_loc_NEcorner) + &
global_sum_prod(arnoldi_basis_y(:,:,:,it), &
arnoldi_basis_y(:,:,:,nextit), distrb_info, field_loc_NEcorner)

!$OMP PARALLEL DO PRIVATE(iblk,ij,i,j)
do iblk = 1, nblocks
Expand Down

0 comments on commit a2dad16

Please sign in to comment.