Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute petlist bounds for each subcomponent from number of tasks. Update CICE #1200

Merged
merged 18 commits into from
May 10, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion CICE-interface/CICE
Submodule CICE updated 114 files
240 changes: 120 additions & 120 deletions tests/RegressionTests_hera.gnu.log

Large diffs are not rendered by default.

798 changes: 399 additions & 399 deletions tests/RegressionTests_hera.intel.log

Large diffs are not rendered by default.

690 changes: 345 additions & 345 deletions tests/RegressionTests_jet.intel.log

Large diffs are not rendered by default.

800 changes: 400 additions & 400 deletions tests/RegressionTests_orion.intel.log

Large diffs are not rendered by default.

534 changes: 267 additions & 267 deletions tests/RegressionTests_wcoss_cray.log

Large diffs are not rendered by default.

1,025 changes: 608 additions & 417 deletions tests/RegressionTests_wcoss_dell_p3.log

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion tests/compile.sh
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ BUILD_DIR=$(pwd)/build_${BUILD_NAME}
if [[ $MACHINE_ID == cheyenne.* ]] ; then
BUILD_JOBS=${BUILD_JOBS:-3}
elif [[ $MACHINE_ID == wcoss_dell_p3 ]] ; then
BUILD_JOBS=${BUILD_JOBS:-4}
BUILD_JOBS=${BUILD_JOBS:-2}
source $PATHTR/tests/module-setup.sh
fi

Expand Down
1,073 changes: 532 additions & 541 deletions tests/default_vars.sh

Large diffs are not rendered by default.

14 changes: 7 additions & 7 deletions tests/opnReqTests/dcp.sh
Original file line number Diff line number Diff line change
Expand Up @@ -14,22 +14,22 @@ elif [[ $application == 'regional' ]]; then
if [[ $CI_TEST == 'true' ]]; then
INPES=10
JNPES=3
TASKS=$((INPES*JNPES + WRITE_GROUP*WRTTASK_PER_GROUP))
NTILES=1
TASKS=$((INPES*JNPES*NTILES + WRITE_GROUP*WRTTASK_PER_GROUP))
NODES=$(((TASKS+TPN-1)/TPN))
else
INPES=5
JNPES=12
NTILES=1
fi
elif [[ $application == 'cpld' ]]; then
if [[ $CI_TEST == 'true' ]]; then
INPES=3
JNPES=1
NPROC_ICE=6
med_petlist_bounds="0 17"
atm_petlist_bounds="0 23"
ocn_petlist_bounds="24 33"
ice_petlist_bounds="34 39"
TASKS=$((INPES*JNPES*6 + WRITE_GROUP*WRTTASK_PER_GROUP + 10 + 6))
OCN_tasks=10
ICE_tasks=6
NPROC_ICE=$ICE_tasks
TASKS=$((INPES*JNPES*NTILES + WRITE_GROUP*WRTTASK_PER_GROUP + OCN_tasks + ICE_tasks))
else
temp=$INPES
INPES=$JNPES
Expand Down
2 changes: 1 addition & 1 deletion tests/opnReqTests/fhz.sh
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ if [[ $application == 'global' ]]; then
| sed -E "s/GFSPRS.GrbF24 ?//g" \
| sed -e "s/^ *//" -e "s/ *$//")
elif [[ $application == 'cpld' ]]; then
if [[ $TEST_NAME == 'cpld_control_c96_p8' ]] || [[ $TEST_NAME == 'cpld_control_p8' ]]; then
if [[ $TEST_NAME =~ 'cpld_control_c96_p8' ]] || [[ $TEST_NAME =~ 'cpld_control_p8' ]] || [[ $TEST_NAME =~ 'cpld_control_c96_noaero_p8' ]]; then
FHZERO=3
LIST_FILES=$(echo -n $LIST_FILES | sed -E "s/sfcf024.tile[1-6].nc ?//g" \
| sed -E "s/atmf024.tile[1-6].nc ?//g" \
Expand Down
2 changes: 1 addition & 1 deletion tests/opnReqTests/mpi.sh
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ if [[ $application == 'global' ]]; then
fi
WRITE_GROUP=2
WRTTASK_PER_GROUP=12
TASKS=$(( INPES*JNPES*6 + WRITE_GROUP*WRTTASK_PER_GROUP ))
TASKS=$(( INPES*JNPES*NTILES + WRITE_GROUP*WRTTASK_PER_GROUP ))
NODES=$(((TASKS+TPN-1)/TPN))
elif [[ $application == 'regional' ]]; then
echo "Regional application not yet implemented for mpi"
Expand Down
15 changes: 7 additions & 8 deletions tests/opnReqTests/std.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,27 +6,26 @@ if [[ $application == 'global' ]]; then
JNPES=2
WRITE_GROUP=1
WRTTASK_PER_GROUP=12
TASKS=$((INPES*JNPES*6 + WRITE_GROUP*WRTTASK_PER_GROUP))
TASKS=$((INPES*JNPES*NTILES + WRITE_GROUP*WRTTASK_PER_GROUP))
fi
RESTART_N=$(( FHMAX/2 ))
RESTART_INTERVAL="${RESTART_N} -1"
elif [[ $application == 'regional' ]]; then
if [[ $CI_TEST == 'true' ]]; then
INPES=4
JNPES=6
NTILES=1
WRTTASK_PER_GROUP=8
TASKS=$((INPES*JNPES + WRITE_GROUP*WRTTASK_PER_GROUP))
TASKS=$((INPES*JNPES*NTILES + WRITE_GROUP*WRTTASK_PER_GROUP))
fi
elif [[ $application == 'cpld' ]]; then
if [ $CI_TEST == 'true' ]; then
INPES=2
JNPES=2
NPROC_ICE=6
med_petlist_bounds="0 23"
atm_petlist_bounds="0 29"
ocn_petlist_bounds="30 39"
ice_petlist_bounds="40 45"
TASKS=$((INPES*JNPES*6 + WRITE_GROUP*WRTTASK_PER_GROUP + 10 + 6))
OCN_tasks=10
ICE_tasks=6
NPROC_ICE=$ICE_tasks
TASKS=$((INPES*JNPES*NTILES + WRITE_GROUP*WRTTASK_PER_GROUP + OCN_tasks + ICE_tasks))
fi
RESTART_N=$(( FHMAX/2 ))
RESTART_INTERVAL="${RESTART_N} -1"
Expand Down
42 changes: 19 additions & 23 deletions tests/opnReqTests/thr.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,46 +5,42 @@ THRD=2
TPN=$((TPN/THRD))
if [[ $application == 'global' ]]; then
JNPES=$((JNPES/THRD))
TASKS=$((INPES*JNPES*6 + WRITE_GROUP*WRTTASK_PER_GROUP))
TASKS=$((INPES*JNPES*NTILES + WRITE_GROUP*WRTTASK_PER_GROUP))
NODES=$(((TASKS+TPN-1)/TPN))
elif [[ $application == 'regional' ]]; then
if [[ $CI_TEST == 'true' ]]; then
INPES=4
JNPES=4
TASKS=$((INPES*JNPES + WRITE_GROUP*WRTTASK_PER_GROUP))
NTILES=1
TASKS=$((INPES*JNPES*NTILES + WRITE_GROUP*WRTTASK_PER_GROUP))
fi
NODES=$(((TASKS+TPN-1)/TPN))
elif [[ $application == 'cpld' ]]; then
if [[ $CI_TEST != 'true' ]]; then
if [[ $TEST_NAME == 'cpld_control_c96_p8' ]]; then
if [[ $TEST_NAME =~ 'cpld_control_c96_p8' ]]; then
INPES=3
JNPES=4
med_petlist_bounds="0 71"
chm_petlist_bounds="0 71"
atm_petlist_bounds="0 77"
ocn_petlist_bounds="78 107"
ice_petlist_bounds="108 119"
TASKS=$((INPES*JNPES*6 + WRITE_GROUP*WRTTASK_PER_GROUP + 30 + 12))
OCN_tasks=30
ICE_tasks=12
NPROC_ICE=$ICE_tasks
TASKS=$((INPES*JNPES*NTILES + WRITE_GROUP*WRTTASK_PER_GROUP + OCN_tasks + ICE_tasks))
NODES=$(((TASKS+TPN-1)/TPN))
elif [[ $TEST_NAME == 'cpld_control_c96_noaero_p8' ]]; then
elif [[ $TEST_NAME =~ 'cpld_control_c96_noaero_p8' ]]; then
INPES=3
JNPES=4
med_petlist_bounds="0 71"
atm_petlist_bounds="0 77"
ocn_petlist_bounds="78 107"
ice_petlist_bounds="108 119"
TASKS=$((INPES*JNPES*6 + WRITE_GROUP*WRTTASK_PER_GROUP + 30 + 12))
OCN_tasks=30
ICE_tasks=12
NPROC_ICE=$ICE_tasks
TASKS=$((INPES*JNPES*NTILES + WRITE_GROUP*WRTTASK_PER_GROUP + OCN_tasks + ICE_tasks))
NODES=$(((TASKS+TPN-1)/TPN))
elif [[ $TEST_NAME == 'cpld_control_p8' ]]; then
elif [[ $TEST_NAME =~ 'cpld_control_p8' ]]; then
INPES=3
JNPES=4
med_petlist_bounds="0 71"
chm_petlist_bounds="0 71"
atm_petlist_bounds="0 77"
ocn_petlist_bounds="78 97"
ice_petlist_bounds="98 107"
wav_petlist_bounds="108 119"
TASKS=$((INPES*JNPES*6 + WRITE_GROUP*WRTTASK_PER_GROUP + 20 + 10 + 12))
OCN_tasks=20
ICE_tasks=10
WAV_tasks=12
NPROC_ICE=$ICE_tasks
TASKS=$((INPES*JNPES*NTILES + WRITE_GROUP*WRTTASK_PER_GROUP + OCN_tasks + ICE_tasks + WAV_tasks))
NODES=$(((TASKS+TPN-1)/TPN))
elif [[ $TEST_NAME == 'cpld_bmark_p8' ]]; then
NODES=$(((TASKS+TPN-1)/TPN))
Expand Down
10 changes: 4 additions & 6 deletions tests/opnReqTests/wrt_env.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,15 +8,13 @@ export RESTART_N=${RESTART_N:-}
export RESTART_INTERVAL="${RESTART_INTERVAL:-}"
export INPES=${INPES}
export JNPES=${JNPES}
export NTILES=${NTILES:-}
export OCN_tasks=${OCN_tasks:-}
export ICE_tasks=${ICE_tasks:-}
export WAV_tasks=${WAV_tasks:-}
export WRITE_GROUP=${WRITE_GROUP}
export WRTTASK_PER_GROUP=${WRTTASK_PER_GROUP}
export NPROC_ICE=${NPROC_ICE:-}
export med_petlist_bounds="${med_petlist_bounds:-}"
export chm_petlist_bounds="${chm_petlist_bounds:-}"
export atm_petlist_bounds="${atm_petlist_bounds:-}"
export ocn_petlist_bounds="${ocn_petlist_bounds:-}"
export ice_petlist_bounds="${ice_petlist_bounds:-}"
export wav_petlist_bounds="${wav_petlist_bounds:-}"
export THRD=${THRD}
export TASKS=${TASKS}
export TPN=${TPN}
Expand Down
1 change: 1 addition & 0 deletions tests/rt.sh
Original file line number Diff line number Diff line change
Expand Up @@ -777,6 +777,7 @@ EOF
(
source ${PATHRT}/tests/$TEST_NAME

TPN=$(( TPN / THRD ))
NODES=$(( TASKS / TPN ))
if (( NODES * TPN < TASKS )); then
NODES=$(( NODES + 1 ))
Expand Down
2 changes: 1 addition & 1 deletion tests/rt_utils.sh
Original file line number Diff line number Diff line change
Expand Up @@ -548,7 +548,7 @@ ecflow_run() {
module load ecflow
echo "Using special Jet ECFLOW start procedure"
MYCOMM="bash -l -c \"module load ecflow && ${ECFLOW_START} -d ${RUNDIR_ROOT}/ecflow_server\""
ssh $ECF_HOST "${MYCOMM}"
ssh $ECF_HOST "${MYCOMM}"
else
${ECFLOW_START} -p ${ECF_PORT} -d ${RUNDIR_ROOT}/ecflow_server
fi
Expand Down
2 changes: 1 addition & 1 deletion tests/run_compile.sh
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ cat ${RUNDIR}/job_timestamp.txt >> ${LOG_DIR}/job_${JOB_NR}_timestamp.txt
# End compile job
################################################################################

echo " $( date +%s )" >> ${LOG_DIR}/job_${JOB_NR}_timestamp.txt
echo " $( date +%s ), 1" >> ${LOG_DIR}/job_${JOB_NR}_timestamp.txt

elapsed=$SECONDS
echo "Elapsed time $elapsed seconds. Compile ${COMPILE_NR}"
97 changes: 81 additions & 16 deletions tests/run_test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,60 @@ write_fail_test() {
exit 1
}

function compute_petbounds() {

# each test MUST define ${COMPONENT}_tasks variable for all components it is using
# and MUST NOT define those that it's not using or set the value to 0.

# ATM is a special case since it is running on the sum of compute and io tasks.
# CHM component and mediator are running on ATM compute tasks only.

local n=0
unset atm_petlist_bounds ocn_petlist_bounds ice_petlist_bounds wav_petlist_bounds chm_petlist_bounds med_petlist_bounds

# ATM
ATM_io_tasks=${ATM_io_tasks:-0}
if [[ $((ATM_compute_tasks + ATM_io_tasks)) -gt 0 ]]; then
atm_petlist_bounds="${n} $((n + ATM_compute_tasks + ATM_io_tasks -1))"
n=$((n + ATM_compute_tasks + ATM_io_tasks))
fi

# OCN
if [[ ${OCN_tasks:-0} -gt 0 ]]; then
ocn_petlist_bounds="${n} $((n + OCN_tasks - 1))"
n=$((n + OCN_tasks))
fi

# ICE
if [[ ${ICE_tasks:-0} -gt 0 ]]; then
ice_petlist_bounds="${n} $((n + ICE_tasks - 1))"
n=$((n + ICE_tasks))
fi

# WAV
if [[ ${WAV_tasks:-0} -gt 0 ]]; then
wav_petlist_bounds="${n} $((n + WAV_tasks - 1))"
n=$((n + WAV_tasks))
fi

# CHM
chm_petlist_bounds="0 $((ATM_compute_tasks - 1))"

# MED
med_petlist_bounds="0 $((ATM_compute_tasks - 1))"
Copy link
Collaborator

@DeniseWorthen DeniseWorthen May 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be an issue which Moorthi first raised for running the mediator on more than 300 tasks. He found the execution much slower when too many PEs were given to the MED. So we've been keeping the MED tasks = the smaller of the ATM tasks or 288 (pe list 0:287). Have you seen in your tests an impact on runtime when the MED is allowed the same number as ATM for the bmrk_aero, for example?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't look carefully at runtime of each test, but I ran full test on gaea, jet and wcoss yesterday and all tests finished successfully. Do you know which tests have MED tasks limited to a subset of ATM tasks? Does this slowdown happen only on particular machine or on all machines?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested on cheyenne.intel at 41bf709 and all tests passed. It doesn't seem the number of mediator tasks had any real impact. The three tests where I know we limit the MED tasks to <300 are cpld_bmark_p8, cpld_control_c384 and cpld_control_c192. They were all +/- 20s of the current wall clock times.


UFS_tasks=${n}

echo "ATM_petlist_bounds: ${atm_petlist_bounds:-}"
echo "OCN_petlist_bounds: ${ocn_petlist_bounds:-}"
echo "ICE_petlist_bounds: ${ice_petlist_bounds:-}"
echo "WAV_petlist_bounds: ${wav_petlist_bounds:-}"
echo "CHM_petlist_bounds: ${chm_petlist_bounds:-}"
echo "MED_petlist_bounds: ${med_petlist_bounds:-}"
echo "UFS_tasks : ${UFS_tasks:-}"

}

if [[ $# != 5 ]]; then
echo "Usage: $0 PATHRT RUNDIR_ROOT TEST_NAME TEST_NR COMPILE_NR"
exit 1
Expand Down Expand Up @@ -107,8 +161,25 @@ fi

atparse < ${PATHRT}/parm/${MODEL_CONFIGURE:-model_configure.IN} > model_configure

if [[ $DATM_CDEPS = 'false' ]]; then
if [[ ${ATM_compute_tasks:-0} -eq 0 ]]; then
ATM_compute_tasks=$((INPES * JNPES * NTILES))
fi
if [[ $QUILTING = '.true.' ]]; then
ATM_io_tasks=$((WRITE_GROUP * WRTTASK_PER_GROUP))
fi
fi

compute_petbounds

atparse < ${PATHRT}/parm/${NEMS_CONFIGURE:-nems.configure} > nems.configure

# remove after all tests pass
if [[ $TASKS -ne $UFS_tasks ]]; then
echo "$TASKS -ne $UFS_tasks "
exit 1
fi

if [[ "Q${INPUT_NEST02_NML:-}" != Q ]] ; then
INPES_NEST=$INPES_NEST02; JNPES_NEST=$JNPES_NEST02
NPX_NEST=$NPX_NEST02; NPY_NEST=$NPY_NEST02
Expand Down Expand Up @@ -213,26 +284,20 @@ if [[ $DOCN_CDEPS = 'true' ]]; then
atparse < ${PATHRT}/parm/${DOCN_STREAM_CONFIGURE:-docn.streams.IN} > docn.streams
fi

TPN=$(( TPN / THRD ))
if (( TASKS < TPN )); then
TPN=${TASKS}
fi
NODES=$(( TASKS / TPN ))
if (( NODES * TPN < TASKS )); then
NODES=$(( NODES + 1 ))
fi

if [[ $SCHEDULER = 'pbs' ]]; then
NODES=$(( TASKS / TPN ))
if (( NODES * TPN < TASKS )); then
NODES=$(( NODES + 1 ))
fi
atparse < $PATHRT/fv3_conf/fv3_qsub.IN > job_card
elif [[ $SCHEDULER = 'slurm' ]]; then
NODES=$(( TASKS / TPN ))
if (( NODES * TPN < TASKS )); then
NODES=$(( NODES + 1 ))
fi
atparse < $PATHRT/fv3_conf/fv3_slurm.IN > job_card
elif [[ $SCHEDULER = 'lsf' ]]; then
if (( TASKS < TPN )); then
TPN=${TASKS}
fi
NODES=$(( TASKS / TPN ))
if (( NODES * TPN < TASKS )); then
NODES=$(( NODES + 1 ))
fi
atparse < $PATHRT/fv3_conf/fv3_bsub.IN > job_card
fi

Expand Down Expand Up @@ -281,7 +346,7 @@ fi
# End test
################################################################################

echo " $( date +%s )" >> ${LOG_DIR}/job_${JOB_NR}_timestamp.txt
echo " $( date +%s ), ${NODES}" >> ${LOG_DIR}/job_${JOB_NR}_timestamp.txt

################################################################################
# Remove RUN_DIRs if they are no longer needed by other tests
Expand Down
1 change: 0 additions & 1 deletion tests/tests/control_2threads
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,6 @@ export IOVR=3

export THRD=2
export TASKS=$TASKS_thrd
export TPN=$TPN_thrd
export INPES=$INPES_thrd
export JNPES=$JNPES_thrd
export WRTTASK_PER_GROUP=6
Expand Down
1 change: 0 additions & 1 deletion tests/tests/control_2threads_debug
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,6 @@ export OUTPUT_FH="0 1"

export THRD=2
export TASKS=$TASKS_thrd
export TPN=$TPN_thrd
export INPES=$INPES_thrd
export JNPES=$JNPES_thrd
export WRTTASK_PER_GROUP=6
Expand Down
1 change: 0 additions & 1 deletion tests/tests/control_2threads_p8
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,6 @@ export WRITE_DOPOST=.true.

export THRD=2
export TASKS=$TASKS_thrd
export TPN=$TPN_thrd
export INPES=$INPES_thrd
export JNPES=$JNPES_thrd
export WRTTASK_PER_GROUP=6
Expand Down
2 changes: 0 additions & 2 deletions tests/tests/control_atm_aerosols
Original file line number Diff line number Diff line change
Expand Up @@ -73,9 +73,7 @@ export DNATS=3
export CPL=.true.
export CPLCHM=.true.
export atm_model='fv3'
export atm_petlist_bounds="0 149"
export chm_model='gocart'
export chm_petlist_bounds="0 143"
export coupling_interval_sec=${DT_ATMOS}
export NEMS_CONFIGURE="nems.configure.atm_aerosols.IN"

Expand Down
4 changes: 1 addition & 3 deletions tests/tests/control_atmwav
Original file line number Diff line number Diff line change
Expand Up @@ -70,14 +70,12 @@ export WW3RSTDTHR=3
export DT_2_RST="$(printf "%02d" $(( ${WW3RSTDTHR}*3600 )))"

export TASKS=$TASKS_cpl_atmw
export TPN=$TPN_cpl_atmw
export INPES=$INPES_cpl_atmw
export JNPES=$JNPES_cpl_atmw
export THRD=$THRD_cpl_atmw
export WRTTASK_PER_GROUP=$WPG_cpl_atmw

export atm_petlist_bounds=$APB_cpl_atmw
export wav_petlist_bounds=$WPB_cpl_atmw
WAV_tasks=${WAV_tasks_atmw}

export CPL=.true.
export CPLWAV=.true.
Expand Down
2 changes: 1 addition & 1 deletion tests/tests/control_c384
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ export LIST_FILES="sfcf000.nc \

export_fv3
export TASKS=${TASKS_c384}
export TPN=${TPN_c384}
export THRD=${THRD_c384}
export INPES=${INPES_c384}
export JNPES=${JNPES_c384}
export WRITE_GROUP=1
Expand Down
Loading