Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update MOM6 to GFDL 20210224 main branch commit #439

Merged

Conversation

jiandewang
Copy link
Collaborator

@jiandewang jiandewang commented Feb 25, 2021

Description

GFDL updated their main branch on 20210224 which is their step 1 of FMS2 related code changes. No answer changes is expected. "mom6_files.cmake" needs modification to reflect the files being added and removed in framework directory

Issue(s) addressed

Issue #438
MOM6 issue 55 (NOAA-EMC/MOM6#55)

Testing

run ufs-weather-model with updated MOM6 and modified "mom6_files.cmake" on hera, orion and dell-P3

Regression tests passed on:

  • cheyenne.gnu (auto-rt)
  • cheyenne.intel (auto-rt)
  • wcoss_cray
  • wcoss-dell
  • hera.gnu
  • hera.intel
  • jet.intel
  • gaea.intel
  • orion.intel

Dependencies

MOM6 PR 56 (NOAA-EMC/MOM6#56)

modify mom6_files.cmake to reflect the files being added and removed in framework directory
@DeniseWorthen
Copy link
Collaborator

@jiandewang Please post the RegressionTests_platform.logs from orion,hera and dell-p3 which you have already run. You should be able to do hera-gnu and wcoss-cray also, correct?

I will do cheyenne, jet and gaea. I will post the logs to a directory on Hera which you can then copy and commit from your own checkout.

@DeniseWorthen DeniseWorthen added the No Baseline Change No Baseline Change label Feb 26, 2021
@jiandewang
Copy link
Collaborator Author

@DeniseWorthen thanks, will post log files from my part when jobs finished

@BrianCurtis-NOAA
Copy link
Collaborator

Log Name:rt_auto_hera.intel_20210226162045.log
Log Location:/scratch1/NCEPDEV/nems/Brian.Curtis/git2/ufs-community/ufs-weather-model/tests/auto
Logs are kept for one month

@jiandewang
Copy link
Collaborator Author

@BrianCurtis-NOAA I saw hera job is done, inside /scratch1/NCEPDEV/nems/Brian.Curtis/git2/ufs-community/ufs-weather-model/tests/auto, is "rt_auto_hera.intel_20210226162045.log" the file that I need to added and committed into my branch ? and there will be no more "RegressionTests_hera.intel.log" file, right ?

@BrianCurtis-NOAA
Copy link
Collaborator

Log Name:rt_auto_gaea.intel_20210226114143.log
Log Location:/lustre/f2/pdata/ncep/Brian.Curtis/git/ufs-community/ufs-weather-model/tests/auto
Logs are kept for one month

@DusanJovic-NOAA
Copy link
Collaborator

@BrianCurtis-NOAA I saw hera job is done, inside /scratch1/NCEPDEV/nems/Brian.Curtis/git2/ufs-community/ufs-weather-model/tests/auto, is "rt_auto_hera.intel_20210226162045.log" the file that I need to added and committed into my branch ? and there will be no more "RegressionTests_hera.intel.log" file, right ?

No. Log file is already committed.

@DusanJovic-NOAA
Copy link
Collaborator

Log Name:rt_auto_gaea.intel_20210226114143.log
Log Location:/lustre/f2/pdata/ncep/Brian.Curtis/git/ufs-community/ufs-weather-model/tests/auto
Logs are kept for one month

Gaea test failed. RT-auto should not commit log file in case regression test fails.

@BrianCurtis-NOAA
Copy link
Collaborator

Log Name:rt_auto_jet.intel_20210226173218.log
Log Location:/mnt/lfs4/HFIP/h-nems/Brian.Curtis/git/ufs-community/ufs-weather-model/tests/auto
Logs are kept for one month

@BrianCurtis-NOAA
Copy link
Collaborator

@DusanJovic-NOAA What's wrong with it sending logs if it fails, won't it just be overwritten later with a successful one?

@climbfuji
Copy link
Collaborator

@DusanJovic-NOAA What's wrong with it sending logs if it fails, won't it just be overwritten later with a successful one?

Does it produce a red flag somewhere, indicating that it failed?

@DusanJovic-NOAA
Copy link
Collaborator

@DusanJovic-NOAA What's wrong with it sending logs if it fails, won't it just be overwritten later with a successful one?

What am I supposed to do with that log file? It's misleading. One has to manually check each of these log files, scroll to the end and verify that the test didn't fail. This just creates more work.

@BrianCurtis-NOAA
Copy link
Collaborator

@DusanJovic-NOAA What's wrong with it sending logs if it fails, won't it just be overwritten later with a successful one?

Does it produce a red flag somewhere, indicating that it failed?

Even with a failed test, rt.sh does not return 1 at the end. The code checks for returncode != 0 . The code looks for that to tell the logger to get the stdout and stderr into the file.

@DusanJovic-NOAA
Copy link
Collaborator

@DusanJovic-NOAA What's wrong with it sending logs if it fails, won't it just be overwritten later with a successful one?

Does it produce a red flag somewhere, indicating that it failed?

Even with a failed test, rt.sh does not return 1 at the end. The code checks for returncode != 0 . The code looks for that to tell the logger to get the stdout and stderr into the file.

Because rt.sh (script) didn't fail. It successfully finished what it is supposed to do. Parse rt.conf and run a sequence of COMPILE and RUN jobs. rt.sh will return non-zero exit code if itself fails to run, but not if tests fail.

@BrianCurtis-NOAA
Copy link
Collaborator

Log Name:rt_auto_orion.intel_20210226115149.log
Log Location:/work/noaa/nems/bcurtis/git/ufs-community/ufs-weather-model/tests/auto
Logs are kept for one month

@BrianCurtis-NOAA
Copy link
Collaborator

Because rt.sh (script) didn't fail. It successfully finished what it is supposed to do. Parse rt.conf and run a sequence of COMPILE and RUN jobs. rt.sh will return non-zero exit code if itself fails to run, but not if tests fail.

I figured it was a nice start to at least do the work of running it without ever having to touch the HPC, so a good start is that the FAILED text is in the log file which tells the PR owner they should go dive into why. The automated stuff has a LONG way to go to be set it and forget it. When all is said and done, it is def the goal for "set it and forget it".

@jiandewang
Copy link
Collaborator Author

@jun and all: at this stage RT passed for jet, orion, hera, cheyenna and cray, dell-p3 is still running. But is having issues on gaea, on gaea all non-debug style of coupled runs failed when do cmp with baseline.
What shall we do here ? I don't have project account on gaea.

@climbfuji
Copy link
Collaborator

Log Name:rt_auto_20210304110005.log
Log Location:/glade/work/heinzell/fv3/ufs-weather-model/auto-rt/control-20210226-new/tests/auto
Logs are kept for one month

@BrianCurtis-NOAA
Copy link
Collaborator

Log Name:rt_auto_20210304093014.log
Log Location:/work/noaa/nems/bcurtis/git/ufs-community/ufs-weather-model/tests/auto
Logs are kept for one month

@BrianCurtis-NOAA can you chmod for the above ?

/work/noaa/nems/bcurtis/test/579823211/20210304093017/ufs-weather-model in case it's not readable let me know.

@jiandewang
Copy link
Collaborator Author

Orion had another time out job, will re-submit

@BrianCurtis-NOAA
Copy link
Collaborator

BrianCurtis-NOAA commented Mar 4, 2021 via email

@jiandewang
Copy link
Collaborator Author

Please let me run the orion test?

@BrianCurtis-NOAA I launched the job 30s before I saw your message

@jiandewang
Copy link
Collaborator Author

/work/noaa/nems/bcurtis/test/579823211/20210304093017/ufs-weather-model

it is accessible now.

@BrianCurtis-NOAA
Copy link
Collaborator

Please let me run the orion test?

@BrianCurtis-NOAA I launched the job 30s before I saw your message

No problem. If it fails again I'd like to run the next one.

@jiandewang
Copy link
Collaborator Author

Please let me run the orion test?

@BrianCurtis-NOAA I launched the job 30s before I saw your message

No problem. If it fails again I'd like to run the next one.

hope this time it will give us good luck

@DeniseWorthen
Copy link
Collaborator

@jiandewang which test is timing out?

@jiandewang
Copy link
Collaborator Author

@jiandewang which test is timing out?

coupled-frictional-C192

@jiandewang
Copy link
Collaborator Author

@BrianCurtis-NOAA hera intel is not finished yet, just want to conform with you

@DeniseWorthen
Copy link
Collaborator

@jiandewang The cpld_controlfrac_c192 wall clock time on Hera is only ~6 minutes. If it is taking longer than 30min on Orion then something seems wrong.

@jiandewang
Copy link
Collaborator Author

@jiandewang The cpld_controlfrac_c192 wall clock time on Hera is only ~6 minutes. If it is taking longer than 30min on Orion then something seems wrong.

1st round run from Brian timed out with different job, I believe this is a machine issue. I had these kind of issues before.

@uturuncoglu
Copy link
Collaborator

@DeniseWorthen my experience with Orion indicates that the model performance in that particular platform is not predictable and depend on the load. Sometimes, the model hang on the FV3 initialization stage or takes longer than usual. It could be a network or disk issue but I am not sure at this point.

@DeniseWorthen
Copy link
Collaborator

@jiandewang @uturuncoglu Thanks both of you. If it was a different test that timed out previously then yes I agree this is probably a machine issue.

@BrianCurtis-NOAA
Copy link
Collaborator

Log Name:rt_auto_20210304190005.log
Log Location:/scratch1/NCEPDEV/nems/Brian.Curtis/git2/ufs-community/ufs-weather-model/tests/auto
Logs are kept for one month

@BrianCurtis-NOAA
Copy link
Collaborator

hera.intel PASSED

Copy link
Collaborator

@DeniseWorthen DeniseWorthen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming there are no further issues w/ orion RTs, approve.

@jiandewang
Copy link
Collaborator Author

no timed out job on orion so far, only last 5 jobs in pending status, hope they don't bring bad luck

.gitmodules Outdated Show resolved Hide resolved
@jiandewang
Copy link
Collaborator Author

ready for review

@junwang-noaa junwang-noaa merged commit 35d1897 into ufs-community:develop Mar 5, 2021
AnningCheng-NOAA added a commit to AnningCheng-NOAA/ufs-weather-model that referenced this pull request Mar 8, 2021
* upstream/develop:
  update MOM6 to GFDL 20210224 main branch commit (ufs-community#439)
  Add GNU and Cheyenne Support to Automated RT (ufs-community#444)
  Move Noah MP init to CCPP and update Noah MP regression tests, ice flux init bug fix in CCPP (ufs-community#425)
  Feature/rt automation (ufs-community#403)
  Update ccpp-physics. Make RRTMGP thread safe (ufs-community#418)
  Update regression tests from GFSv15+Thompson to GFSv16+Thompson, include "Add one regional regression test in DEBUG mode. (ufs-community#419)" (ufs-community#421)
  UGWP v0 v1 combined (ufs-community#396)
  add optional mesh in MOM6; add dz_min and min_seaice as configurable variables for coupled model (ufs-community#399)
  updates FMS to 2020.04.01 (ufs-community#392)
  Move LSM vegetation lookup tables into CCPP, clean up RUC snow cover on ice initialization (remove IPD step 2)  (ufs-community#407)
  Update CMEPS for HAFS integration; add datm and coupled-model tests on Gaea (ufs-community#401)
  Remove legacy gnumake build from fv3atm and NEMS, remove legacy Python 2.7 support, rename v16beta to v16 and RT updates (ufs-community#384)
  MOM6 bugfixes, GFDL update, update CDMBGWD settings; fix for restart reproducibility (without waves) when USE_LA_LI2016=True, sign error on fprec passed to ocean, GFDL update, resolution dependent cdmbgwd settings (ufs-community#379)
  dycore options to add zero-gradient BC to reconstruct interface u/v and change dz_min as input (ufs-community#369)
  Update develop from NOAA-GSL: RUC ice, MYNN sfclay, stochastic land perturbations (ufs-community#386)
  update cpl gfsv16 tests, rrtmgp fix and bug fixes in cmeps (ufs-community#378)
  point fv3 to EMC develop branch (ufs-community#377)
  Remove IPD steps 3 and 5 (ufs-community#357)
  Update CMEPS  (ufs-community#345)
  Implementation of CCPP timestep_init and timestep_final phases (ufs-community#337)
  Remove unnecessary SIMD instruction sets for Jet, first round of cleanup in rt.conf, initialize cld_amt to zero for regional runs (dycore) (ufs-community#353)
  add frac grid input, update and add additional cpld tests (ufs-community#354)
  Add checkpoint restarts for ufs-cpld (ufs-community#342)
  Update the format of rt.conf (ufs-community#349)
  Remove IPD (step 1) (ufs-community#331)
  Feature/ww3update (ufs-community#334)
  Replace old regional SDF with FV3_GFS_v15_thompson_mynn (ufs-community#333)
  Update modules with hpc-stack v1.1.0 (ufs-community#319)
  Regression test log for PR ufs-community#323 for jet.intel (ufs-community#336)
  RRTMGP and Thompson MP coupling (ufs-community#323)
  Add 2 new tests for DATM-MOM6-CICE6 application (ufs-community#332)
  Add optional bulk flux calculation in ufs-datm (ufs-community#266)
  Final-final GFS v16 updates / restart reproducibility bugfixes (ufs-community#325)
  Updates to build for JEDI linking/control, add wcoss2 (ufs-community#295)
  Update CICE, Move regression test input outside baseline directory (ufs-community#270)
  Feature/update mom6 and retain b4b results for 025x025 resolution (ufs-community#290)
  Update for Jet, bug fixes in running with frac_grid=T and GFDL MP, and in restarting with frac_grid=T  (ufs-community#304)
  Updates to stochastic_physics_wrapper (ufs-community#280)
  Update develop from gsd/develop 2020/11/20: Unified gravity wave drag, updates to other GSL physics (ufs-community#297)
  Fix to allow quilting with non-factors for layout (ufs-community#250)
  rt update (ufs-community#261)
@jiandewang jiandewang deleted the feature/update-MOM6-20210224 branch February 28, 2023 04:03
epic-cicd-jenkins pushed a commit that referenced this pull request Apr 17, 2023
## DESCRIPTION OF CHANGES: 
Modified the Jinja-formatted FV3LAM_wflow.xml template workflow to accommodate sub-hourly post-processing tasks that rely on sub-hourly FV3 output as a dependency. All changes are _additions_ to existing code and include the addition of a few keyword variables in the config.sh script. These new flags include...

- SUB_HOURLY_POST: a logical flag indicating whether nor not sub-hourly post-processing is to be used
- DT_SUBHOURLY_POST_MNTS: the increment in minutes to sub-divide the hour

Additional post-processing tasks were added to FV3LAM_wflow.xml to account for the different FV3 output file names depending on whether sub-hourly FV3 output is used (the first FV3 output file has a different naming structure than the remaining output files).

setup.sh was updated to check whether valid entries were used for these two variables and also check that DT_ATMOS divides evenly into DT_SUBHOURLY_POST_MNTS so that the FV3 output is consistent with the requested frequency of UPP output. config_defaults.sh and valid_param_vals.sh were also updated accordingly.

## TESTS CONDUCTED: 
Have run generate_FV3LAM_wflow.sh on a large variety of settings of SUB_HOURLY_POST and DT_SUBHOURLY_POST_MNTS. Note that setup.sh is configured such that DT_SUBHOURLY_POST_MNTS = 0 will cause SUB_HOURLY_POST to be ignored. I have successfully tested cases in which DT_ATMOS _does not_ divide evenly into DT_SUBHOURLY_POST_MNTS and when DT_SUBHOURLY_POST_MNTS is specified as anything other than a two-digit value (strings vs. open integers both work).

The resulting workflows run successfully with rocotorun and output no error messages.

## ISSUE: 
Resolves issue #434

## CONTRIBUTORS:
@gsketefian.  Contributions:

1) Fixed bug in setup.sh in the test that checks whether DT_SUBHOURLY_POST_MNTS is set to 0:  should use the -eq operator instead of ==.
2) Fixed bug in the jinja XML template for rocoto (FV3LAM_wflow.xml) as follows:  rearranged the post-processing tasks so that the post task is run for only the first minute of the last hour (e.g. if the forecast is 3 hours long, post is run for 3:00 but not for 3:15, 3:30, etc).
3) Ran the following 3 WE2E tests [note that tests (b) and (c) are not yet in the regional_workflow repo and will be included in a future PR]:
  a) **grid_RRFS_CONUS_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta**.  This is without subhourly post-processing, i.e. SUB_HOURLY_POST is set to "FALSE".
  b) **subhourly_post**.  This is with subhourly post-processing, i.e. SUB_HOURLY_POST set is set to "TRUE" (with DT_SUBHOURLY_POST_MNTS set to "12" minutes).
  c) **subhourly_post_ensemble_2mems**.  This is with subhourly post-processing and with ensemble forecasts enabled, i.e. SUB_HOURLY_POST and DO_ENSEMBLES are both set to "TRUE" (with DT_SUBHOURLY_POST_MNTS is set to "12" minutes and NUM_ENS_MEMBERS set to "2" members).  This test is run because the changes in the jinja XML template FV3LAM_wflow.xml needed to add subhourly post invovle code that executes ensemble forecasts.
**All three tests were successful.**

Note that this PR changes the names of the grib2 files that UPP generates such that they now always include the minutes -- regardless of whether SUB_HOURLY_POST is set to "TRUE" or "FALSE" (if set to "FALSE", the minutes are always "00").  For example, previously, the grib2 file for forecast hour 1 was named `rrfs.t00z.bgdawpf001.tm00.grib2`; henceforth, it will be named `rrfs.t00z.bgdawpf00100.tm00.grib2`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Baseline Updates Current baselines will be updated.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants