Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resource updates to support WCOSS2 #1070

Conversation

KateFriedman-NOAA
Copy link
Member

@KateFriedman-NOAA KateFriedman-NOAA commented Oct 13, 2022

Description

This PR primarily brings back resource updates from testing on WCOSS2. These updates have been tested on Orion and further adjusted as needed. Testing on Hera continues but is ok so far. There are also changes to add BASE_CPLIC for WCOSS2.

Changes:

  1. pull in some GFSv16.3 adjustments (mainly from NCO reverting memory increases)
  2. add WCOSS2 BASE_CPLIC to config.coupled_ic
  3. add make_nsstbufr and make_nsstbufr to wcoss2.yaml hosts file to match S4 updates to other hosts files
  4. remove errant ) in wcoss2.yaml hosts file
  5. updates to config.fv3
  • add WCOSS2 npe_node_max
  • change WRTTASK_PER_GROUP from $npe_node_max to 64 for every resolution (half of node size on WCOSS2)
  • add check for WRTTASK_PER_GROUP to set it back to $npe_node_max if 64 is greater than $npe_node_max on machine (Hera/Orion)
  • change C192 nth_fv3 from 2 to 1 (2 fails on WCOSS2 but 1 works everywhere, further examine this on WCOSS2 during optimization work)
  • change C384 WRITE_GROUP from 1 to 2 (from WCOSS2 GFSv16 port work, works everywhere)
  • move C384 from netcdf_parallel to netcdf (serial) case check at bottom; need to run C384 with serial netcdf on WCOSS2 (needs further investigation)
  1. updates to config.resources
  • add WCOSS2 npe_node_max
  • add {} around variables
  • replace --exclusive with is_exclusive=True
  • update prep job to run exclusively on WCOSS2 but set 40GB memory elsewhere
  • pull in WCOSS2 GFSv16 port resource updates:
    • add _gfs versions of some job resource variables where needed to define gdas and gfs suite versions of jobs differently
    • adjust some walltimes
    • add npe_node_cycle where needed for WCOSS2 APRUN_CYCLE launcher commands
    • add memory settings for non-exclusive jobs (setting is_exclusive=True otherwise)
    • update some npe_node_* values to either be npe_node_max / thread # or if greater than npe_node_max set it to npe_node_max (for R&Ds)
    • reduce npe_wavepostbndpnt from 280 to 240
    • increase npe_wavepostbndpntbll from 280 to 448
    • reduce npe_wavepostpnt from 280 to 200
    • change npe_wavegempak from npe_node_max to 1
    • change npe_waveawipsbulls from npe_node_max to 1
    • change npe_waveawipsgridded from npe_node_max to 1
    • adjust C768 and C384 analysis job resources
    • reduce npe_analdiag from 112 to 96 and add note # Should be at least twice npe_ediag
    • increase npe_gldas from 96 to 112 and change npe_node_gaussian to be npe_node_max / nth_gaussian
    • increase npe_post from 112 to 126 and set npe_node_post* values to npe_post
    • change npe_wafsgrib2 from 1 to 18
    • change npe_wafsgrib20p25 from 1 to 11
    • set memory_echgres to 200GB on WCOSS2
    • reduce npe_ediag from 56 to 48
    • add WCOSS2 blocks to eupd section, use ops resources for C768 and C384; optimization is needed; also update Hera settings to stop using 40 threads (npe_node_max)
    • reduce nth_ecen from 6 to 4 on WCOSS2 and Orion
    • reduce nth_epos from 6 to 4 on WCOSS2 and Orion
    • update postsnd resources everywhere
    • reduce awips resources to 1 npe, 1 npe_node, and 1 thread
    • update gempak resources

Resource changes in generated xmls on R&Ds:

  • all
    • post job nodes reduced from 10 to 4
    • some ppn values reduced from 40 to a smaller non-npe_node_max value (no node # changes for these jobs though)
    • gdasfcst nodes reduced from 10 to 5
    • gdasepos nodes reduced from 14 to 8 on Orion
  • coupled C384
    • gfswavepostpnt nodes reduced from 7 to 4

Notes:

  1. non-atmos-only and non-coupled jobs were not tested on WCOSS2, downstream ops-only jobs (e.g. gempak/awips) were also not tested for this PR
  2. work is still needed for METplus and fit2obs on WCOSS2
  3. S2SWA is not supported on WCOSS2 (waiting for mapl)
  4. ocnpost does not work on WCOSS2

Type of change

Updates for supporting WCOSS2.

How Has This Been Tested?

  • Clone and Build tests on WCOSS2, Hera, and Orion
  • Cycled atmos-only C768C384L127 tests on WCOSS2
  • Cycled atmos-only C384C192L127 and C192C96L127 tests on WCOSS2, Hera, Orion
  • Forecast-only S2SW on WCOSS2 and Orion

Some continued testing is happening on Hera, Orion, and WCOSS2.

Refs #419

- WCOSS2 BASE_CPLIC is /lfs/h2/emc/global/noscrub/emc.global/IC/COUPLED

Refs NOAA-EMC#419
- Add npe_node_max=128 for WCOSS2 in machine if-block.
- Change WRTTASK_PER_GROUP and WRTTASK_PER_GROUP_GFS to be 64 for all
supported resolutions (good value for WCOSS2) but then add check for
whether WRTTTASK_PER_GROUP* variables are greater than npe_node_max
value (for R&DS) and if so, set WRTTASK_PER_GROUP* values to equal the
npe_node_max value (prior setting).
- Change C384 DELTIM to 200 (current GFSv16 ops value).

Refs NOAA-EMC#419
- Rename the resource variables in config.defaults.s2sw to include
"_s2sw" in the names.
- Update the C384 block of config.fv3 to use the "_s2sw" variables for
the relevant resource variable values (when set).
- When not running coupled S2Sw the "_s2sw" variables will not be set
and the C384 block of config.fv3 will not use them but instead use the
appropriate values for non-coupled.
- Resolves issue where S2Sw resource values were being forced onto the
C384 atmos-only enkf forecast jobs.

Refs NOAA-EMC#419
- When job should run exclusively, set is_exclusive=True.
- Replace native_*="--exclusive" with is_exclusive=True.

Refs NOAA-EMC#419
- Pull in wave job resources from GFSv16 port.
- Adjust walltimes, tasks, add memory, and add "_gfs" versions of
resource variables as needed.

Refs NOAA-EMC#419
- Based on GFSv16 port to WCOSS2.
- Add memory settings for both jobs.
- Adjust tasks and threads for both jobs.
- Add "_gfs" variable versions for gempak job.

Refs NOAA-EMC#419
- Pull in WCOSS2 GFSv16 ops port resource updates into config.resources.

Refs NOAA-EMC#419
- Set $npe_node_esfc to $npe_esfc instead of $npe_node_max; then check if
$npe_node_esfc is greater than $npe_node_max and set to $npe_node_max if
so.
- Set $npe_node_cycle based on $npe_node_max divided by $nth_cycle.
- Add memory setting.

Refs NOAA-EMC#419
- Change nth_ecen from 6 to 4.
- Add setting for $npe_node_cycle based on $npe_node_max divided by
$nth_cycle.

Refs NOAA-EMC#419
- Remove HERA block for C768.
- Add WCOSS2 block for C768; use GFSv16 ops settings.
- Adjust other resolution HERA blocks to remove npe_eupd settings (use
defaults for all machines) and reduce nth_eupd values to more
appropriate values other than 40 (npe_node_max on R&DS).

Refs NOAA-EMC#419
- Reduce npe_ediag from 56 to 48.
- Add memory setting.

Refs NOAA-EMC#419
- Add memory settings for wafs jobs.
- Adjust task values based on GFSv16 ops port.
- Set $npe_node_wafs* to $npe_wafs* values.

Refs NOAA-EMC#419
- Add "_gfs" variables.
- Reduce walltimes.
- Increase npe_post from 112 to 126 (GFSv16 ops port value).
- Check if $npe_node_post[_gfs] is greater than $npe_node_max and set it
to $npe_node_max if so.

Refs NOAA-EMC#419
- Increase $npe_gldas from 96 to 112.
- Set $npe_node_gldas to $npe_gldas instead of $npe_node_max.
- Set $npe_node_gaussian based on $npe_node_max divided by
$nth_gaussian.

Refs NOAA-EMC#419
- Reduce $npe_analdiag from 112 to 96; add note that $npe_analdiag should
be at least twice npe_diag.
- Set $npe_node_analdiag to $npe_analdiag.
- Check if $npe_node_analdiag is greater than $npe_node_max and set to
$npe_node_max if so.
- Add memory setting.

Refs NOAA-EMC#419
- If $npe_node_gldas is greater than $npe_node_max then set it to
$npe_node_max.

Refs NOAA-EMC#419
- Bring in GFSv16 ops port values for C768.
- Also adjust C384 values.
- Add "_gfs" versions of variables so they can be set separately.
- Set $npe_node_cycle based on $npe_node_max divided by $nth_cycle.

Refs NOAA-EMC#419
…s2-resources

* upstream/develop:
  Fix companion ocean resolution for C48 (NOAA-EMC#1066)
  Add trailing slash for gldas topo path (NOAA-EMC#1064)
  Limit number of CPU for post (NOAA-EMC#1061)
  Fix eupd trace (NOAA-EMC#1057)
  Port to S4 (NOAA-EMC#1023)
  Update to obsproc.v1.0.2 and prepobs.v1.0.1 (NOAA-EMC#1049)
  Add GDAS to the partial build list (NOAA-EMC#1050)
  Fix group number being treated as octal in gdas arch (NOAA-EMC#1053)
  Remove trace from link script (NOAA-EMC#1046)
Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shellcheck found more than 10 potential problems in the proposed changes. Check the Files changed tab for more details.

- Update C384 nth_fv3 from 2 to 1.
- Remove added "_s2sw" text to S2SW and C384 resource variables.

Refs NOAA-EMC#419
Copy link
Contributor

@WalterKolczynski-NOAA WalterKolczynski-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confused about these npe_node_* changes in general. Values should never be larger than the core size, otherwise you might put multiple ranks on the same CPU.

parm/config/config.resources Outdated Show resolved Hide resolved
Copy link
Contributor

@aerorahul aerorahul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most of the changes look good.
a couple of questions.

parm/config/config.resources Show resolved Hide resolved
parm/config/config.resources Show resolved Hide resolved
- Pull in GFSv16.3 updates from NCO and into ecf scripts.
- Adjust npe_node variables for some jobs to better adjust to different
npe_node_max values.
- Address some linter warnings about {}.

Refs NOAA-EMC#419
Copy link
Contributor

@aerorahul aerorahul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I need more explanations on the hardwiring of 64 across for all platforms.

parm/config/config.fv3 Show resolved Hide resolved
parm/config/config.fv3 Show resolved Hide resolved
KateFriedman-NOAA and others added 3 commits October 20, 2022 09:36
- Received an OOM kill when testing 60GB. Increased back up to 80GB and the job completed.
- This is the only job still encountering a "Cgroup mem" warning however. Further investigation is needed.

Refs NOAA-EMC#419
export nth_ediag=1
export npe_node_ediag=${npe_node_max}
export npe_node_ediag=$(echo "${npe_node_max} / ${nth_ediag}" | bc)

Check warning

Code scanning / shellcheck

Declare and assign separately to avoid masking return values.

Declare and assign separately to avoid masking return values.
if [[ ${CASE} = "C384" || ${CASE} = "C192" || ${CASE} = "C96" || ${CASE} = "C48" ]]; then export nth_ecen=2; fi
export npe_node_ecen=$(echo "${npe_node_max} / ${nth_ecen}" | bc)
export nth_cycle=${nth_ecen}
export npe_node_cycle=$(echo "${npe_node_max} / ${nth_cycle}" | bc)

Check warning

Code scanning / shellcheck

Declare and assign separately to avoid masking return values.

Declare and assign separately to avoid masking return values.
export nth_esfc=1
export npe_node_esfc=$(echo "${npe_node_max} / ${nth_esfc}" | bc)

Check warning

Code scanning / shellcheck

Declare and assign separately to avoid masking return values.

Declare and assign separately to avoid masking return values.
export nth_cycle=${nth_esfc}
export npe_node_cycle=$(echo "${npe_node_max} / ${nth_cycle}" | bc)

Check warning

Code scanning / shellcheck

Declare and assign separately to avoid masking return values.

Declare and assign separately to avoid masking return values.
if [ ${OUTPUT_FILE} == "nemsio" ]; then
export npe_postsnd=13
export npe_node_postsnd=4
fi
if [[ ${machine} = "HERA" ]]; then export npe_node_postsnd=2; fi
if [[ "$(echo "${npe_node_postsnd} * ${nth_postsnd}" | bc)" -gt "${npe_node_max}" ]]; then
export npe_node_postsnd=$(echo "${npe_node_max} / ${nth_postsnd}" | bc)

Check warning

Code scanning / shellcheck

Declare and assign separately to avoid masking return values.

Declare and assign separately to avoid masking return values.
Matches updates made to other hosts files recently.

Refs NOAA-EMC#419
- Aerosols not yet supported on WCOSS2 so can't build with S2SWA app there.
- Force UFS to build with the S2SW app on WCOSS2 for now.

Refs NOAA-EMC#419
- Update WCOSS2 hosts yaml file to set `hpssarch` to "NO" by default.
- Limited bandwidth between WCOSS2 to HPSS so don't want users archiving to HPSS unless necessary.

Refs NOAA-EMC#419
@KateFriedman-NOAA KateFriedman-NOAA merged commit 5c03697 into NOAA-EMC:develop Oct 24, 2022
KateFriedman-NOAA added a commit to KateFriedman-NOAA/global-workflow that referenced this pull request Jan 30, 2023
* develop:
  Correct issue in linking final restart files (NOAA-EMC#1285)
  Remove execute permissions from config files (NOAA-EMC#1281)
  Make needed updates to run forecast from GEFS (NOAA-EMC#1203)
  Remove unnecessary variables which reference to nemsio (NOAA-EMC#1259)
  Create analysis files for early-cycle EnKF by default (NOAA-EMC#1237)
  Don't wipe $DATA before running ocean bmat (NOAA-EMC#1280)
  More marine DA j-jobs (NOAA-EMC#1270)
  Update UFS-DA atmospheric prep script to be consistent with GDASApp update (NOAA-EMC#1265)
  Add new jjob for ocean analysis bmat (NOAA-EMC#1239)
  Retire ecf/versions in develop (NOAA-EMC#1267)
  Deploy documentation to RTD (NOAA-EMC#1264)
  Temporarily disable failing pytest (NOAA-EMC#1263)
  Remove incorrect/misleading comments in config.base (NOAA-EMC#1261)
  Add initial Sphinx documentation (NOAA-EMC#1258)
  Remove nemsio support (NOAA-EMC#1255)
  Increase wallclock for diag jobs (NOAA-EMC#1216)
  Use correct resources for GFS gempak (NOAA-EMC#1214)
  Abstract common j-job tasks (NOAA-EMC#1230)
  Add missing mkgfsawps.x link (NOAA-EMC#1218)
  Fix post sounding job (NOAA-EMC#1212)
  Revert "Use fracoro data for all new UFS applications (NOAA-EMC#1182)" (NOAA-EMC#1240)
  Use fracoro data for all new UFS applications (NOAA-EMC#1182)
  Revert "Merge GFS v16.3 operational GSI changes into develop branch. (NOAA-EMC#1158)" (NOAA-EMC#1238)
  Add more user defined parameters for the marine DA (NOAA-EMC#1235)
  Update pytests action version and run sequentially (NOAA-EMC#1236)
  Add utility to compare Fortran namelists (NOAA-EMC#1234)
  Updates for pygw (NOAA-EMC#1231)
  Merge GFS v16.3 operational GSI changes into develop branch. (NOAA-EMC#1158)
  Move member up in directory hierarchy (NOAA-EMC#1201)
  Enable staging ics for cycled experiments. (NOAA-EMC#1199)
  Add tests for configuration.py (NOAA-EMC#1192)
  Replace ocnanal_${CDATE}} with ${RUN}ocnanal_${cyc} (NOAA-EMC#1191)
  define NET and RUN in the Rocoto XML to accurately mimic the ecf in ecflow (NOAA-EMC#1193)
  Fix checking for restart files (NOAA-EMC#1186)
  Fix 'DEBUG' option in build_ufs.sh (NOAA-EMC#1188)
  Update archive job memory request value for R&Ds (NOAA-EMC#1183)
  Reorder post so all flux files are generated when running offline (NOAA-EMC#1181)
  Stop checking for restarts on non-GFS CDUMPs (NOAA-EMC#1179)
  Add missing jobids in some pre-job scripts (NOAA-EMC#1176)
  Remove existing directory if it exists when getic runs (NOAA-EMC#1165)
  Add logging decorator, test and test for yaml_file (NOAA-EMC#1178)
  fix coding norm check in `hosts.py` (NOAA-EMC#1174)
  Fix some bugs and make other changes so ctest in GDASApp works (NOAA-EMC#1172)
  Support for the GDASApp testing in containers (NOAA-EMC#1151)
  ATM 3DVAR with and without IAU (NOAA-EMC#1113)
  Enable checking for python norms and fix violating code (NOAA-EMC#1168)
  Enforce decimal math in atmos post (NOAA-EMC#1171)
  Update marine DA j-jobs to new format (NOAA-EMC#1149)
  Add utility to manipulate files en masse  (NOAA-EMC#1166)
  add action to run pytests (NOAA-EMC#1167)
  Pin `differential-shellcheck` to `v3` tag (NOAA-EMC#1162)
  Add a task base class and basic logger (NOAA-EMC#1160)
  Recursively convert dict to AttrDict when making an AttrDict (NOAA-EMC#1154)
  move configuration.py to pygw. Use it from there.  return AttrDict after sourcing configs (NOAA-EMC#1153)
  JEDI based Marine DA tasks (NOAA-EMC#1134)
  Allow customizations based on user/configuration (NOAA-EMC#1146)
  First step towards making j-jobs consistent in use from ecflow and rocoto (NOAA-EMC#1120)
  enable APP=S2SWA on WCOSS2 (NOAA-EMC#1142)
  Fix typo in .shellcheckrc
  Remove prod_envir module load from WCOSS2 (NOAA-EMC#1138)
  Link staged GSI fix files instead of cloning them from gerrit (NOAA-EMC#1132)
  Address shellcheck warnings in env files (NOAA-EMC#1136)
  Adds group size and nmem for GEFS (NOAA-EMC#1127)
  Remove unnecessary sCDATE assignment in forecast_predet.sh (NOAA-EMC#1133)
  Convert archive jobs to proper j-jobs (NOAA-EMC#1115)
  Update C48 forecast to run with one thread (NOAA-EMC#1131)
  Improved error messages from atmos analysis (NOAA-EMC#1125)
  Update MODULEPATH for Orion (NOAA-EMC#1126)
  MPMD variable updates and fix (NOAA-EMC#1124)
  Introduce FHMAX_ENKF_GFS to extending ensemble forecast capabilities (NOAA-EMC#1122)
  Update R&D launcher commands for tasks and multi-prog (NOAA-EMC#1112)
  Correct crtm path in UFS DA atmospheric analysis scripts (NOAA-EMC#1111)
  Correct syntax in remaining sorc scripts (NOAA-EMC#1105)
  Add GSI background error covariance as an option for UFS DA variational assimilation (NOAA-EMC#1104)
  Add Early Cycle EnKF workflow (NOAA-EMC#1022)
  Correct errors with gdas and monitoring symlinks (NOAA-EMC#1101)
  Fixed gfs-utils links (NOAA-EMC#1099)
  Fix build scripts and bring into compliance (NOAA-EMC#1096)
  Feature/updates for gdas app (NOAA-EMC#1091)
  Change GLDAS USE_CFP to NO on Hera (NOAA-EMC#1094)
  Resource updates to support WCOSS2 (NOAA-EMC#1070)
  Set COMPILER in link for detect machine (NOAA-EMC#1092)
  gfs utils update (NOAA-EMC#1088)
  GFS-UTILS update for build and ush scripts (NOAA-EMC#1082)
  Update UFS version to 2022 Oct 19 (NOAA-EMC#1083)
  Use more cycledefs for task control (NOAA-EMC#1078)
  removing superfluous EFSOI-specific files from develop (NOAA-EMC#1079)
  Update UFS to Sept 9 version (NOAA-EMC#1073)
  Modify default file location for monitor data when using rocoto (NOAA-EMC#1065)
  Fix companion ocean resolution for C48 (NOAA-EMC#1066)
  Add trailing slash for gldas topo path (NOAA-EMC#1064)
  Limit number of CPU for post (NOAA-EMC#1061)
  Fix eupd trace (NOAA-EMC#1057)
  Port to S4 (NOAA-EMC#1023)
  Update to obsproc.v1.0.2 and prepobs.v1.0.1 (NOAA-EMC#1049)
  Add GDAS to the partial build list (NOAA-EMC#1050)
  Fix group number being treated as octal in gdas arch (NOAA-EMC#1053)
  Remove trace from link script (NOAA-EMC#1046)
  Update gfs-utils hash to 3a609ea (NOAA-EMC#1048)
  Fix link script usage statement (NOAA-EMC#1045)
  Replace preamble variable commands with functions (NOAA-EMC#1012)
  Implement fix reorg and remove gfs-utils code (NOAA-EMC#1009)
  Rename post scripts (NOAA-EMC#1038)
  Fix missing @ symbol with COMINsyn in config.base (NOAA-EMC#1039)
  WCOSS2 run support and script/config updates (NOAA-EMC#1030)
  Remove base_svn from Hera and Orion hosts files (NOAA-EMC#1036)
  initial commit for incoming yaml work (NOAA-EMC#1029)
  Fix radiance verification failing to find diag files (NOAA-EMC#1031)
  Supported resolutions on platforms and defaults for mode (NOAA-EMC#1026)
  Add GLDAS scripts & fix GLDAS job (NOAA-EMC#1018)
  Update GSI Monitor for radmon fix
  Correct shell linter config (NOAA-EMC#1013)
  Correct diagnostic file handling in ush/ozn_xtrct.sh (NOAA-EMC#1016)
  Add shell linter Github action for pull requests (NOAA-EMC#1007)
  Build updates for WCOSS2 (NOAA-EMC#1002)
  Update UFS_UTILS tag to `ufs_utils_1_8_0` (NOAA-EMC#1001)
  Fix preamble id (NOAA-EMC#996)
  Add missing "atmos" into job dependencies (NOAA-EMC#998)
  Bugfix in arch.sh to remove hardwired "htar" (NOAA-EMC#992)
  Add in stubs for aerosol DA tasks + bugfix for setup_expt where cycled and ATMA are used (NOAA-EMC#990)
  Add GSI monitor scripts (NOAA-EMC#969)
  Fix product generation at some fcst hrs (NOAA-EMC#988)
  Add initial config files for global aerosol DA (NOAA-EMC#986)
  Update diag table to remove wav-ocn coupling fields (NOAA-EMC#979)
  use a robust Findwgrib2.cmake to find wgrib2 built w/ native wgrib2 build (NOAA-EMC#970)
  Externals.cfg was stale and had drifted off (NOAA-EMC#965)
  Fix post comparison with zero-padded numbers (NOAA-EMC#964)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants