ice_dyn_vp: allow for bit-for-bit reproducibility under bfbflag (#774)

* doc: fix typo in index (bfbflag) * doc: correct default value of 'maxits_nonlin' The "Table of namelist options" in the user guide lists 'maxits_nonlin' as having a default value of 1000, whereas its actual default is 4, both in the namelist and in 'ice_init.F90'. This has been the case since the original implementation of the implicit solver in f7fd063 (dynamics: add implicit VP solver (#491), 2020-09-22). Fix the documentation. * doc: VP solver is validated with OpenMP When the implicit VP solver was added in f7fd063 (dynamics: add implicit VP solver (#491), 2020-09-22), it had not yet been tested with OpenMP enabled. The OpenMP implementation was carefully reviewed and then fixed in d1e972a (Update OMP (#680), 2022-02-18), which lead to all runs of the 'decomp' suite completing and all restart tests passing. The 'bfbcomp' tests are still failing, but this is due to the code not using the CICE global sum implementation correctly, which will be fixed in the next commits. Update the documentation accordingly. * ice_dyn_vp: activate OpenMP in 'dyn_prep2' loop When the OpenMP implementation was reviewed and fixed in d1e972a (Update OMP (#680), 2022-02-18), the 'PRIVATE' clause of the OpenMP directive for the loop where 'dyn_prep2' is called in 'implicit_solver' was corrected in line with what was done in 'ice_dyn_evp', but OpenMP was left unactivated for this loop (the 'TCXOMP' was not changed to a real 'OMP' directive). Activate OpenMP for this loop. All runs and restart tests of the 'decomp_suite' still pass with this change. * machines: eccc : add ICE_MACHINE_MAXRUNLENGTH to ppp[56] * machines: eccc: use PBS-enabled OpenMPI for 'ppp6_gnu' The system installation of OpenMPI at /usr/mpi/gcc/openmpi-4.1.2a1/ is not compiled with support for PBS. This leads to failures as the MPI runtime does not have the same view of the number of available processors as the job scheduler. Use our own build of OpenMPI, compiled with PBS support, for the 'ppp6_gnu' environment, which uses OpenMPI. * machines: eccc: set I_MPI_FABRICS=ofi Intel MPI 2021.5.1, which comes with oneAPI 2022.1.2, seems to have an intermittent bug where a call to 'MPI_Waitall' fails with: Abort(17) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Waitall: See the MPI_ERROR field in MPI_Status for the error code and no core dump is produced. This affects at least these cases of the 'decomp' suite: - *_*_restart_gx3_16x2x1x1x800_droundrobin - *_*_restart_gx3_16x2x2x2x200_droundrobin This was reported to Intel and they suggested setting the variable 'I_MPI_FABRICS' to 'ofi' (the default being 'shm:ofi' [1]). This disables shared memory transport and indeeds fixes the failures. Set this variable for all ECCC machine files using Intel MPI. [1] https://www.intel.com/content/www/us/en/develop/documentation/mpi-developer-reference-linux/top/environment-variable-reference/environment-variables-for-fabrics-control/communication-fabrics-control.html * machines: eccc: set I_MPI_CBWR for BASEGEN/BASECOM runs Intel MPI, in contrast to OpenMPI (as far as I was able to test, and see [1], [2]), does not (by default) guarantee that repeated runs of the same code on the same machine with the same number of MPI ranks yield the same results when collective operations (e.g. 'MPI_ALLREDUCE') are used. Since the VP solver uses MPI_ALLREDUCE in its algorithm, this leads to repeated runs of the code giving different answers, and baseline comparing runs with code built from the same commit failing. When generating a baseline or comparing against an existing baseline, set the environment variable 'I_MPI_CBWR' to 1 for ECCC machine files using Intel MPI [3], so that (processor) topology-aware collective algorithms are not used and results are reproducible. Note that we do not need to set this variable on robert or underhill, on which jobs have exclusive node access and thus job placement (on processors) is guaranteed to be reproducible. [1] https://stackoverflow.com/a/45916859/ [2] https://scicomp.stackexchange.com/a/2386/ [3] https://www.intel.com/content/www/us/en/develop/documentation/mpi-developer-reference-linux/top/environment-variable-reference/i-mpi-adjust-family-environment-variables.html#i-mpi-adjust-family-environment-variables_GUID-A5119508-5588-4CF5-9979-8D60831D1411 * ice_dyn_vp: fgmres: exit early if right-hand-side vector is zero If starting a run with with "ice_ic='none'" (no ice), the linearized problem for the ice velocity A x = b will have b = 0, since all terms in the right hand side vector will be zero: - strint[xy] is zero because the velocity is zero - tau[xy] is zero because the ocean velocity is also zero - [uv]vel_init is zero - strair[xy] is zero because the concentration is zero - strtlt[xy] is zero because the ocean velocity is zero We thus have a linear system A x = b with b=0, so we must have x=0. In the FGMRES linear solver, this special case is not taken into account, and so we end up with an all-zero initial residual since workspace_[xy] is also zero because of the all-zero initial guess 'sol[xy]', which corresponds to the initial ice velocity. This then leads to a division by zero when normalizing the first Arnoldi vector. Fix this special case by computing the norm of the right-hand-side vector before starting the iterations, and exiting early if it is zero. This is in line with the GMRES implementation in SciPy [1]. [1] https://github.com/scipy/scipy/blob/651a9b717deb68adde9416072c1e1d5aa14a58a1/scipy/sparse/linalg/_isolve/iterative.py#L620-L628 Close: phil-blain#42 * ice_dyn_vp: add global_norm, global_dot_product functions The VP solver uses a linear solver, FGMRES, as part of the non-linear iteration. The FGMRES algorithm involves computing the norm of a distributed vector field, thus performing global sums. These norms are computed by first summing the squared X and Y components of a vector field in subroutine 'calc_L2norm_squared', summing these over the local blocks, and then doing a global (MPI) sum using 'global_sum'. This approach does not lead to reproducible results when the MPI distribution, or the number of local blocks, is changed, for reasons explained in the "Reproducible sums" section of the Developer Guide (mostly, floating point addition is not associative). This was partly pointed out in [1] but I failed to realize it at the time. Introduce a new function, 'global_dot_product', to encapsulate the computation of the dot product of two grid vectors, each split into two arrays (for the X and Y components). Compute the reduction locally as is done in 'calc_L2norm_squared', but throw away the result and use the existing 'global_sum' function when 'bfbflag' is active, passing it the temporary array used to compute the element-by-element product. This approach avoids a performance regression from the added work done in 'global_sum', such that non-bfbflag runs are as fast as before. Note that since 'global_sum' loops on the whole array (and not just ice points as 'global_dot_product'), make sure to zero-initialize the 'prod' local array. Also add a 'global_norm' function implemented using 'global_dot_product'. Both functions will be used in subsequent commits to ensure bit-for-bit reproducibility. * ice_dyn_vp: use global_{norm,dot_product} for bit-for-bit output reproducibility Make the results of the VP solver reproducible if desired by refactoring the code to use the subroutines 'global_norm' and 'global_dot_product' added in the previous commit. The same pattern appears in the FGMRES solver (subroutine 'fgmres'), the preconditioner 'pgmres' which uses the same algorithm, and the Classical and Modified Gram-Schmidt algorithms in 'orthogonalize'. These modifications do not change the number of global sums in the fgmres, pgmres and the MGS algorithm. For the CGS algorithm, there is (in theory) a slight performance impact as 'global_dot_product' is called inside the loop, whereas previously we called 'global_allreduce_sum' after the loop to compute all 'initer' sums at the same time. To keep that optimization, we would have to implement a new interface 'global_allreduce_sum' which would take an array of shape (nx_block,ny_block,max_blocks,k) and sum over their first three dimensions before performing the global reduction over the k dimension. We choose to not go that route for now mostly because anyway the CGS algorithm is (by default) only used for the PGMRES preconditioner, and so the cost should be relatively low as 'initer' corresponds to 'dim_pgmres' in the namelist, which should be kept low for efficiency (default 5). These changes lead to bit-for-bit reproducibility (the decomp_suite passes) when using 'precond=ident' and 'precond=diag' along with 'bfbflag=reprosum'. 'precond=pgmres' is still not bit-for-bit because some halo updates are skipped for efficiency. This will be addressed in a following commit. [1] #491 (comment) * ice_dyn_vp: do not skip halo updates in 'pgmres' under 'bfbflag' The 'pgmres' subroutine implements a separate GMRES solver and is used as a preconditioner for the FGMRES linear solver. Since it is only a preconditioner, it was decided to skip the halo updates after computing the matrix-vector product (in 'matvec'), for efficiency. This leads to non-reproducibility since the content of the non-updated halos depend on the block / MPI distribution. Add the required halo updates, but only perform them when we are explicitely asking for bit-for-bit global sums, i.e. when 'bfbflag' is set to something else than 'not'. Adjust the interfaces of 'pgmres' and 'precondition' (from which 'pgmres' is called) to accept 'halo_info_mask', since it is needed for masked updates. Closes #518 * ice_dyn_vp: use global_{norm,dot_product} for bit-for-bit log reproducibility In the previous commits we ensured bit-for-bit reproducibility of the outputs when using the VP solver. Some global norms computed during the nonlinear iteration still use the same non-reproducible pattern of summing over blocks locally before performing the reduction. However, these norms are used only to monitor the convergence in the log file, as well as to exit the iteration when the required convergence level is reached ('nlres_norm'). Only 'nlres_norm' could (in theory) influence the output, but it is unlikely that a difference due to floating point errors would influence the 'if (nlres_norm < tol_nl)' condition used to exist the nonlinear iteration. Change these remaining cases to also use 'global_norm', leading to bit-for-bit log reproducibility. * ice_dyn_vp: remove unused subroutine and cleanup interfaces The previous commit removed the last caller of 'calc_L2norm_squared'. Remove the subroutine. Also, do not compute 'sum_squared' in 'residual_vec', since the variable 'L2norm' which receives this value is also unused in 'anderson_solver' since the previous commit. Remove that variable, and adjust the interface of 'residual_vec' accordingly. * ice_global_reductions: remove 'global_allreduce_sum' In a previous commit, we removed the sole caller of 'global_allreduce_sum' (in ice_dyn_vp::orthogonalize). We do not anticipate that function to be ued elsewhere in the code, so remove it from ice_global_reductions. Update the 'sumchk' unit test accordingly. * doc: mention VP solver is only reproducible using 'bfbflag' The previous commits made sure that the model outputs as well as the log file output are bit-for-bit reproducible when using the VP solver by refactoring the code to use the existing 'global_sum' subroutine. Add a note in the documentation mentioning that 'bfbflag' is required to get bit-for-bit reproducible results under different decompositions / MPI counts when using the VP solver. Also, adjust the doc about 'bfbflag=lsum8' being the same as 'bfbflag=off' since this is not the case for the VP solver: in the first case we use the scalar version of 'global_sum', in the second case we use the array version. * ice_dyn_vp: improve default parameters for VP solver During QC testing of the previous commit, the 5 years QC test with the updated VP solver failed twice with "bad departure points" after a few years of simulation. Simply bumping the number of nonlinear iterations (maxits_nonlin) from 4 to 5 makes these failures disappear and allow the simulations to run to completion, suggesting the solution is not converged enough with 4 iterations. We also noticed that in these failing cases, the relative tolerance for the linear solver (reltol_fmgres = 1E-2) is too small to be reached in less than 50 iterations (maxits_fgmres), and that's the case at each nonlinear iteration. Other papers mention a relative tolerance of 1E-1 for the linear solver, and using this value also allows both cases to run to completion (even without changing maxits_nonlin). Let's set the default tolerance for the linear solver to 1E-1, and let's be conservative and bump the number of nonlinear iterations to 10. This should give us a more converged solution and add robustness to the default settings.
CICE-Consortium · Oct 20, 2022 · 16b78da · 16b78da
1 parent 2435fa7
commit 16b78da
Show file tree

Hide file tree

Showing 17 changed files with 240 additions and 373 deletions.
diff --git a/cicecore/cicedynB/dynamics/ice_dyn_vp.F90 b/cicecore/cicedynB/dynamics/ice_dyn_vp.F90
diff --git a/cicecore/cicedynB/general/ice_init.F90 b/cicecore/cicedynB/general/ice_init.F90
@@ -419,7 +419,7 @@ subroutine input_data
       deltaminEVP = 1e-11_dbl_kind ! minimum delta for viscosities (EVP, Hunke 2001)
       deltaminVP  = 2e-9_dbl_kind  ! minimum delta for viscosities (VP, Hibler 1979)
       capping_method  = 'max'  ! method for capping of viscosities (max=Hibler 1979,sum=Kreyscher2000)
-      maxits_nonlin = 4        ! max nb of iteration for nonlinear solver
+      maxits_nonlin = 10       ! max nb of iteration for nonlinear solver
       precond = 'pgmres'       ! preconditioner for fgmres: 'ident' (identity), 'diag' (diagonal),
                                ! 'pgmres' (Jacobi-preconditioned GMRES)
       dim_fgmres = 50          ! size of fgmres Krylov subspace
@@ -431,7 +431,7 @@ subroutine input_data
       monitor_pgmres = .false. ! print pgmres residual norm
       ortho_type = 'mgs'       ! orthogonalization procedure 'cgs' or 'mgs'
       reltol_nonlin = 1e-8_dbl_kind ! nonlinear stopping criterion: reltol_nonlin*res(k=0)
-      reltol_fgmres = 1e-2_dbl_kind ! fgmres stopping criterion: reltol_fgmres*res(k)
+      reltol_fgmres = 1e-1_dbl_kind ! fgmres stopping criterion: reltol_fgmres*res(k)
       reltol_pgmres = 1e-6_dbl_kind ! pgmres stopping criterion: reltol_pgmres*res(k)
       algo_nonlin = 'picard'        ! nonlinear algorithm: 'picard' (Picard iteration), 'anderson' (Anderson acceleration)
       fpfunc_andacc = 1        ! fixed point function for Anderson acceleration:

diff --git a/cicecore/cicedynB/infrastructure/comm/mpi/ice_global_reductions.F90 b/cicecore/cicedynB/infrastructure/comm/mpi/ice_global_reductions.F90
@@ -36,7 +36,6 @@ module ice_global_reductions
    private
 
    public :: global_sum,      &
-             global_allreduce_sum, &
              global_sum_prod, &
              global_maxval,   &
              global_minval
@@ -56,12 +55,6 @@ module ice_global_reductions
                       global_sum_scalar_int
    end interface
 
-   interface global_allreduce_sum
-     module procedure global_allreduce_sum_vector_dbl!,   &
-     ! module procedure global_allreduce_sum_vector_real, & ! not yet implemented
-     ! module procedure global_allreduce_sum_vector_int     ! not yet implemented
-   end interface
-
    interface global_sum_prod
      module procedure global_sum_prod_dbl,         &
                       global_sum_prod_real,        &
@@ -707,68 +700,6 @@ function global_sum_scalar_int(scalar, dist) &
 
  end function global_sum_scalar_int
 
-!***********************************************************************
-
- function global_allreduce_sum_vector_dbl(vector, dist) &
-          result(globalSums)
-
-!  Computes the global sums of sets of scalars (elements of 'vector')
-!  distributed across a parallel machine.
-!
-!  This is actually the specific interface for the generic global_allreduce_sum
-!  function corresponding to double precision vectors.  The generic
-!  interface is identical but will handle real and integer vectors.
-
-   real (dbl_kind), dimension(:), intent(in) :: &
-      vector               ! vector whose components are to be summed
-
-   type (distrb), intent(in) :: &
-      dist                 ! block distribution
-
-   real (dbl_kind), dimension(size(vector)) :: &
-      globalSums           ! resulting array of global sums
-
-!-----------------------------------------------------------------------
-!
-!  local variables
-!
-!-----------------------------------------------------------------------
-
-   integer (int_kind) :: &
-      numProcs,     &! number of processor participating
-      numBlocks,    &! number of local blocks
-      communicator, &! communicator for this distribution
-      numElem        ! number of elements in vector
-
-   real (dbl_kind), dimension(:,:), allocatable :: &
-      work           ! temporary local array
-
-   character(len=*), parameter :: subname = '(global_allreduce_sum_vector_dbl)'
-
-!-----------------------------------------------------------------------
-!
-!  get communicator for MPI calls
-!
-!-----------------------------------------------------------------------
-
-   call ice_distributionGet(dist, &
-                            numLocalBlocks = numBlocks, &
-                            nprocs = numProcs,        &
-                            communicator = communicator)
-
-   numElem = size(vector)
-   allocate(work(1,numElem))
-   work(1,:) = vector
-   globalSums = c0
-
-   call compute_sums_dbl(work,globalSums,communicator,numProcs)
-
-   deallocate(work)
-
-!-----------------------------------------------------------------------
-
- end function global_allreduce_sum_vector_dbl
-
 !***********************************************************************
 
  function global_sum_prod_dbl (array1, array2, dist, field_loc, &

diff --git a/cicecore/cicedynB/infrastructure/comm/serial/ice_global_reductions.F90 b/cicecore/cicedynB/infrastructure/comm/serial/ice_global_reductions.F90
@@ -37,7 +37,6 @@ module ice_global_reductions
    private
 
    public :: global_sum,      &
-             global_allreduce_sum, &
              global_sum_prod, &
              global_maxval,   &
              global_minval
@@ -57,12 +56,6 @@ module ice_global_reductions
                       global_sum_scalar_int
    end interface
 
-   interface global_allreduce_sum
-     module procedure global_allreduce_sum_vector_dbl!,   &
-     ! module procedure global_allreduce_sum_vector_real, & ! not yet implemented
-     ! module procedure global_allreduce_sum_vector_int     ! not yet implemented
-   end interface
-
    interface global_sum_prod
      module procedure global_sum_prod_dbl,         &
                       global_sum_prod_real,        &
@@ -708,68 +701,6 @@ function global_sum_scalar_int(scalar, dist) &
 
  end function global_sum_scalar_int
 
-!***********************************************************************
-
- function global_allreduce_sum_vector_dbl(vector, dist) &
-          result(globalSums)
-
-!  Computes the global sums of sets of scalars (elements of 'vector')
-!  distributed across a parallel machine.
-!
-!  This is actually the specific interface for the generic global_allreduce_sum
-!  function corresponding to double precision vectors.  The generic
-!  interface is identical but will handle real and integer vectors.
-
-   real (dbl_kind), dimension(:), intent(in) :: &
-      vector               ! vector whose components are to be summed
-
-   type (distrb), intent(in) :: &
-      dist                 ! block distribution
-
-   real (dbl_kind), dimension(size(vector)) :: &
-      globalSums           ! resulting array of global sums
-
-!-----------------------------------------------------------------------
-!
-!  local variables
-!
-!-----------------------------------------------------------------------
-
-   integer (int_kind) :: &
-      numProcs,     &! number of processor participating
-      numBlocks,    &! number of local blocks
-      communicator, &! communicator for this distribution
-      numElem        ! number of elements in vector
-
-   real (dbl_kind), dimension(:,:), allocatable :: &
-      work           ! temporary local array
-
-   character(len=*), parameter :: subname = '(global_allreduce_sum_vector_dbl)'
-
-!-----------------------------------------------------------------------
-!
-!  get communicator for MPI calls
-!
-!-----------------------------------------------------------------------
-
-   call ice_distributionGet(dist, &
-                            numLocalBlocks = numBlocks, &
-                            nprocs = numProcs,        &
-                            communicator = communicator)
-
-   numElem = size(vector)
-   allocate(work(1,numElem))
-   work(1,:) = vector
-   globalSums = c0
-
-   call compute_sums_dbl(work,globalSums,communicator,numProcs)
-
-   deallocate(work)
-
-!-----------------------------------------------------------------------
-
- end function global_allreduce_sum_vector_dbl
-
 !***********************************************************************
 
  function global_sum_prod_dbl (array1, array2, dist, field_loc, &

diff --git a/cicecore/drivers/unittest/sumchk/sumchk.F90 b/cicecore/drivers/unittest/sumchk/sumchk.F90
@@ -58,9 +58,6 @@ program sumchk
       integer(int_kind),parameter :: ntests3 = 3
       character(len=8)  :: errorflag3(ntests3)
       character(len=32) :: stringflag3(ntests3)
-      integer(int_kind),parameter :: ntests4 = 1
-      character(len=8)  :: errorflag4(ntests4)
-      character(len=32) :: stringflag4(ntests4)
 
       integer(int_kind) :: npes, ierr, ntask
 
@@ -100,7 +97,6 @@ program sumchk
       errorflag1 = passflag
       errorflag2 = passflag
       errorflag3 = passflag
-      errorflag4 = passflag
       npes = get_num_procs()
 
       if (my_task == master_task) then
@@ -600,63 +596,6 @@ program sumchk
          endif
       enddo
 
-      ! ---------------------------
-      ! Test Vector Reductions
-      ! ---------------------------
-
-      if (my_task == master_task) write(6,*) ' '
-
-      n = 1    ; stringflag4(n) = 'dble sum vector'
-      allocate(vec8(3))
-      allocate(sum8(3))
-
-      minval = -5.
-      maxval =  8.
-
-      vec8(1) = 1.
-
-      ! fill one gridcell with a min and max value
-      ntask = max(npes-1,1)-1
-      if (my_task == ntask) then
-         vec8(1) = minval
-      endif
-      ntask = min(npes,2)-1
-      if (my_task == ntask) then
-         vec8(1) = maxval
-      endif
-      vec8(2) = 2. * vec8(1)
-      vec8(3) = 3. * vec8(1)
-
-      ! compute correct results
-      if (npes == 1) then
-         minval = maxval
-         corval = maxval
-      else
-         corval = (npes - 2) * 1.0 + minval + maxval
-      endif
-
-      do k = 1,ntests4
-         string = stringflag4(k)
-         sum8 = -888e12
-         if (k == 1) then
-            sum8 = global_allreduce_sum(vec8, distrb_info)
-         else
-            call abort_ice(subname//' illegal k vector',file=__FILE__,line=__LINE__)
-         endif
-
-         if (my_task == master_task) then
-            write(6,'(1x,a,3g16.8)') string, sum8(1),sum8(2),sum8(3)
-         endif
-
-         if (sum8(1) /= corval .or. sum8(2) /= 2.*corval .or. sum8(3) /= 3.*corval) then
-           errorflag4(k) = failflag
-           errorflag0 = failflag
-           if (my_task == master_task) then
-              write(6,*) '**** ERROR ', sum8(1),sum8(2),sum8(3),corval
-           endif
-         endif
-      enddo
-
       ! ---------------------------
 
       if (my_task == master_task) then
@@ -670,9 +609,6 @@ program sumchk
          do k = 1,ntests3
             write(6,*) errorflag3(k),stringflag3(k)
          enddo
-         do k = 1,ntests4
-            write(6,*) errorflag4(k),stringflag4(k)
-         enddo
          write(6,*) ' '
          write(6,*) 'SUMCHK COMPLETED SUCCESSFULLY'
          if (errorflag0 == passflag) then

diff --git a/configuration/scripts/ice_in b/configuration/scripts/ice_in
@@ -167,7 +167,7 @@
     kridge          = 1
     ktransport      = 1
     ssh_stress      = 'geostrophic'
-    maxits_nonlin   = 4
+    maxits_nonlin   = 10
     precond         = 'pgmres'
     dim_fgmres       = 50
     dim_pgmres       = 5
@@ -178,7 +178,7 @@
     monitor_pgmres  = .false.
     ortho_type      = 'mgs'
     reltol_nonlin   = 1e-8
-    reltol_fgmres   = 1e-2
+    reltol_fgmres   = 1e-1
     reltol_pgmres   = 1e-6
     algo_nonlin     = 'picard'
     use_mean_vrel   = .true.

diff --git a/configuration/scripts/machines/env.ppp5_intel b/configuration/scripts/machines/env.ppp5_intel
@@ -18,6 +18,12 @@ source $ssmuse -d /fs/ssm/main/opt/intelcomp/inteloneapi-2022.1.2/intelcomp+mpi+
 # module load -s icc mpi
 setenv FOR_DUMP_CORE_FILE 1
 setenv I_MPI_DEBUG_COREDUMP 1
+# Reproducible collectives
+if (${ICE_BASEGEN} != ${ICE_SPVAL} || ${ICE_BASECOM} != ${ICE_SPVAL}) then
+  setenv I_MPI_CBWR 1
+endif
+# Stop being buggy
+setenv I_MPI_FABRICS ofi
 # NetCDF
 source $ssmuse -d main/opt/hdf5-netcdf4/serial/shared/inteloneapi-2022.1.2/01
 
@@ -32,6 +38,7 @@ setenv ICE_MACHINE_MAKE make
 setenv ICE_MACHINE_WKDIR ~/data/ppp5/cice/runs/
 setenv ICE_MACHINE_INPUTDATA /space/hall5/sitestore/eccc/cmd/e/sice500/
 setenv ICE_MACHINE_BASELINE ~/data/ppp5/cice/baselines/
+setenv ICE_MACHINE_MAXRUNLENGTH 6
 setenv ICE_MACHINE_SUBMIT qsub
 setenv ICE_MACHINE_TPNODE 80 
 setenv ICE_MACHINE_ACCT unused

diff --git a/configuration/scripts/machines/env.ppp6_gnu b/configuration/scripts/machines/env.ppp6_gnu
@@ -8,7 +8,7 @@ endif
 if ("$inp" != "-nomodules") then
 
 # OpenMPI
-source /usr/mpi/gcc/openmpi-4.1.2a1/bin/mpivars.csh
+setenv PATH "/home/phb001/.local_rhel-8-icelake-64_gcc/bin:$PATH"
 
 # OpenMP
 setenv OMP_STACKSIZE 64M
@@ -21,6 +21,7 @@ setenv ICE_MACHINE_MAKE make
 setenv ICE_MACHINE_WKDIR ~/data/site6/cice/runs/
 setenv ICE_MACHINE_INPUTDATA /space/hall6/sitestore/eccc/cmd/e/sice500/
 setenv ICE_MACHINE_BASELINE ~/data/site6/cice/baselines/
+setenv ICE_MACHINE_MAXRUNLENGTH 6
 setenv ICE_MACHINE_SUBMIT qsub
 setenv ICE_MACHINE_TPNODE 80 
 setenv ICE_MACHINE_ACCT unused

diff --git a/configuration/scripts/machines/env.ppp6_gnu-impi b/configuration/scripts/machines/env.ppp6_gnu-impi
@@ -18,6 +18,12 @@ setenv I_MPI_F90 gfortran
 setenv I_MPI_FC gfortran
 setenv I_MPI_CC gcc
 setenv I_MPI_CXX g++
+# Reproducible collectives
+if (${ICE_BASEGEN} != ${ICE_SPVAL} || ${ICE_BASECOM} != ${ICE_SPVAL}) then
+  setenv I_MPI_CBWR 1
+endif
+# Stop being buggy
+setenv I_MPI_FABRICS ofi
 
 # OpenMP
 setenv OMP_STACKSIZE 64M
@@ -30,6 +36,7 @@ setenv ICE_MACHINE_MAKE make
 setenv ICE_MACHINE_WKDIR ~/data/site6/cice/runs/
 setenv ICE_MACHINE_INPUTDATA /space/hall6/sitestore/eccc/cmd/e/sice500/
 setenv ICE_MACHINE_BASELINE ~/data/site6/cice/baselines/
+setenv ICE_MACHINE_MAXRUNLENGTH 6
 setenv ICE_MACHINE_SUBMIT qsub
 setenv ICE_MACHINE_TPNODE 80 
 setenv ICE_MACHINE_ACCT unused

diff --git a/configuration/scripts/machines/env.ppp6_intel b/configuration/scripts/machines/env.ppp6_intel
@@ -18,6 +18,12 @@ source $ssmuse -d /fs/ssm/main/opt/intelcomp/inteloneapi-2022.1.2/intelcomp+mpi+
 # module load -s icc mpi
 setenv FOR_DUMP_CORE_FILE 1
 setenv I_MPI_DEBUG_COREDUMP 1
+# Reproducible collectives
+if (${ICE_BASEGEN} != ${ICE_SPVAL} || ${ICE_BASECOM} != ${ICE_SPVAL}) then
+  setenv I_MPI_CBWR 1
+endif
+# Stop being buggy
+setenv I_MPI_FABRICS ofi
 # NetCDF
 source $ssmuse -d main/opt/hdf5-netcdf4/serial/shared/inteloneapi-2022.1.2/01
 
@@ -32,6 +38,7 @@ setenv ICE_MACHINE_MAKE make
 setenv ICE_MACHINE_WKDIR ~/data/ppp6/cice/runs/
 setenv ICE_MACHINE_INPUTDATA /space/hall6/sitestore/eccc/cmd/e/sice500/
 setenv ICE_MACHINE_BASELINE ~/data/ppp6/cice/baselines/
+setenv ICE_MACHINE_MAXRUNLENGTH 6
 setenv ICE_MACHINE_SUBMIT qsub
 setenv ICE_MACHINE_TPNODE 80 
 setenv ICE_MACHINE_ACCT unused