Skip to content

Commit

Permalink
Actually fix MPI synchronization in real.exe #1268 (#1600)
Browse files Browse the repository at this point in the history
TYPE: [bug fix]

KEYWORDS: real.exe, MPI, bug fix

SOURCE: Marc Honnorat (EXWEXs)

DESCRIPTION OF CHANGES:
Problem:
The communicator `mpi_comm_allcompute`, created by subroutine `split_communicator` called by `init_modules(1)`, 
is not explicitly activated for the call to `wrf_dm_bcast_bytes( configbuf, nbytes )` in real.exe. On some platforms, 
this may prevent broadcast of namelist configuration (put in `configbuf` after the call to `get_config_as_buffer()`) 
across the MPI processes _before_ the call to `setup_physics_suite()`.

An example of a problematic platform: a cluster of Intel Xeon E5-2650 v4 running on CentOS Linux release 7.6.1810, 
with Intel Parallel Studio XE (various versions, including 2018u3 and 2020u4) and Intel MPI Library (same version).

Solution:
The initialization step used in the WRF executable never triggers a failure as described in issue #1267. This PR reuses 
the temporary MPI context switch from WRF code.

ISSUE: 
Fixes #1267

LIST OF MODIFIED FILES:
M       main/real_em.F

TESTS CONDUCTED: 
1. The modification systematically solves the problem on the noted cluster.
2. Jenkins tests are all passing.

RELEASE NOTE: A fix for an MPI synchronization bug related to (not used) split communicators in the real program provides a solution to issue #1267. For users that have had no troubles with the real program running MPI, this will have no impact.
  • Loading branch information
honnorat authored Dec 16, 2021
1 parent abff5aa commit a39a94b
Showing 1 changed file with 6 additions and 3 deletions.
9 changes: 6 additions & 3 deletions main/real_em.F
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ PROGRAM real_data

USE module_machine
#ifdef DM_PARALLEL
USE module_dm, ONLY : wrf_dm_initialize
USE module_dm, ONLY : wrf_dm_initialize, mpi_comm_allcompute
#endif
USE module_domain, ONLY : domain, alloc_and_configure_domain, &
domain_clock_set, head_grid, program_name, domain_clockprint, &
Expand Down Expand Up @@ -56,9 +56,9 @@ END SUBROUTINE med_read_wrf_chem_bioemiss

INTEGER :: max_dom, domain_id , grid_id , parent_id , parent_id1 , id
INTEGER :: e_we , e_sn , i_parent_start , j_parent_start
INTEGER :: idum1, idum2
INTEGER :: idum1, idum2
#ifdef DM_PARALLEL
INTEGER :: nbytes
INTEGER :: nbytes, save_comm
INTEGER, PARAMETER :: configbuflen = 4* CONFIG_BUF_LEN
INTEGER :: configbuf( configbuflen )
LOGICAL , EXTERNAL :: wrf_dm_on_monitor
Expand Down Expand Up @@ -119,13 +119,16 @@ END SUBROUTINE Setup_Timekeeping
! The configuration switches mostly come from the NAMELIST input.

#ifdef DM_PARALLEL
CALL wrf_get_dm_communicator( save_comm )
CALL wrf_set_dm_communicator( mpi_comm_allcompute )
IF ( wrf_dm_on_monitor() ) THEN
CALL initial_config
END IF
CALL get_config_as_buffer( configbuf, configbuflen, nbytes )
CALL wrf_dm_bcast_bytes( configbuf, nbytes )
CALL set_config_as_buffer( configbuf, configbuflen )
CALL wrf_dm_initialize
CALL wrf_set_dm_communicator( save_comm )
#else
CALL initial_config
#endif
Expand Down

0 comments on commit a39a94b

Please sign in to comment.