From a39a94bcdc61f7b0b1cb763f1579271e9bc79320 Mon Sep 17 00:00:00 2001 From: Marc Honnorat Date: Thu, 16 Dec 2021 20:00:55 +0100 Subject: [PATCH] Actually fix MPI synchronization in real.exe #1268 (#1600) TYPE: [bug fix] KEYWORDS: real.exe, MPI, bug fix SOURCE: Marc Honnorat (EXWEXs) DESCRIPTION OF CHANGES: Problem: The communicator `mpi_comm_allcompute`, created by subroutine `split_communicator` called by `init_modules(1)`, is not explicitly activated for the call to `wrf_dm_bcast_bytes( configbuf, nbytes )` in real.exe. On some platforms, this may prevent broadcast of namelist configuration (put in `configbuf` after the call to `get_config_as_buffer()`) across the MPI processes _before_ the call to `setup_physics_suite()`. An example of a problematic platform: a cluster of Intel Xeon E5-2650 v4 running on CentOS Linux release 7.6.1810, with Intel Parallel Studio XE (various versions, including 2018u3 and 2020u4) and Intel MPI Library (same version). Solution: The initialization step used in the WRF executable never triggers a failure as described in issue #1267. This PR reuses the temporary MPI context switch from WRF code. ISSUE: Fixes #1267 LIST OF MODIFIED FILES: M main/real_em.F TESTS CONDUCTED: 1. The modification systematically solves the problem on the noted cluster. 2. Jenkins tests are all passing. RELEASE NOTE: A fix for an MPI synchronization bug related to (not used) split communicators in the real program provides a solution to issue #1267. For users that have had no troubles with the real program running MPI, this will have no impact. --- main/real_em.F | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/main/real_em.F b/main/real_em.F index 6ae526f00c..704e372bb5 100644 --- a/main/real_em.F +++ b/main/real_em.F @@ -4,7 +4,7 @@ PROGRAM real_data USE module_machine #ifdef DM_PARALLEL - USE module_dm, ONLY : wrf_dm_initialize + USE module_dm, ONLY : wrf_dm_initialize, mpi_comm_allcompute #endif USE module_domain, ONLY : domain, alloc_and_configure_domain, & domain_clock_set, head_grid, program_name, domain_clockprint, & @@ -56,9 +56,9 @@ END SUBROUTINE med_read_wrf_chem_bioemiss INTEGER :: max_dom, domain_id , grid_id , parent_id , parent_id1 , id INTEGER :: e_we , e_sn , i_parent_start , j_parent_start - INTEGER :: idum1, idum2 + INTEGER :: idum1, idum2 #ifdef DM_PARALLEL - INTEGER :: nbytes + INTEGER :: nbytes, save_comm INTEGER, PARAMETER :: configbuflen = 4* CONFIG_BUF_LEN INTEGER :: configbuf( configbuflen ) LOGICAL , EXTERNAL :: wrf_dm_on_monitor @@ -119,6 +119,8 @@ END SUBROUTINE Setup_Timekeeping ! The configuration switches mostly come from the NAMELIST input. #ifdef DM_PARALLEL + CALL wrf_get_dm_communicator( save_comm ) + CALL wrf_set_dm_communicator( mpi_comm_allcompute ) IF ( wrf_dm_on_monitor() ) THEN CALL initial_config END IF @@ -126,6 +128,7 @@ END SUBROUTINE Setup_Timekeeping CALL wrf_dm_bcast_bytes( configbuf, nbytes ) CALL set_config_as_buffer( configbuf, configbuflen ) CALL wrf_dm_initialize + CALL wrf_set_dm_communicator( save_comm ) #else CALL initial_config #endif