Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove requirement for esmf and mapl debug versions, remove DEBUG_LINKMPI #1681

Merged
merged 14 commits into from
Apr 5, 2023

Conversation

climbfuji
Copy link
Collaborator

@climbfuji climbfuji commented Mar 27, 2023

Description

Fixes #1680
Fixes #330

Note on expected/unexpected baseline changes. I ran the regression tests on Hera and thought I'd see changes in the results for at least some of the DEBUG tests. Instead I found this:

Hera/Intel
All tests passed against the existing baseline, except one test that does not use the DEBUG build and therefore the result change doesn't make sense to me (maybe something wrong with the test/code tested itself):

baseline dir = /scratch1/NCEPDEV/nems/emc.nemspara/RT/NEMSfv3gfs/develop-20230321/INTEL/rrfs_smoke_conus13km_hrrr_warm
working dir  = /scratch1/NCEPDEV/stmp2/Dom.Heinzeller/FV3_RT/rt_3521/rrfs_smoke_conus13km_hrrr_warm
Checking test 001 rrfs_smoke_conus13km_hrrr_warm results ....
 Comparing sfcf000.nc ............ALT CHECK......NOT OK
 Comparing sfcf001.nc ............ALT CHECK......NOT OK
 Comparing sfcf002.nc ............ALT CHECK......NOT OK
 Comparing atmf000.nc ............ALT CHECK......NOT OK
 Comparing atmf001.nc ............ALT CHECK......NOT OK
 Comparing atmf002.nc ............ALT CHECK......NOT OK

  0: The total amount of wall time                        = 138.606586
  0: The maximum resident set size (KB)                   = 959036

Test 001 rrfs_smoke_conus13km_hrrr_warm FAIL Tries: 2

I reran this test and it failed with the same b4b mismatch.

Hera/GNU

All tests passed against the existing baseline, except:

  • rrfs_smoke_conus13km_hrrr_warm: same test as for Intel, does not use the DEBUG build and therefore the result change doesn't make sense to me (maybe something wrong with the test/code tested itself)?
baseline dir = /scratch1/NCEPDEV/nems/emc.nemspara/RT/NEMSfv3gfs/develop-20230321/GNU/rrfs_smoke_conus13km_hrrr_warm
working dir  = /scratch1/NCEPDEV/stmp2/Dom.Heinzeller/FV3_RT/rt_2336/rrfs_smoke_conus13km_hrrr_warm
Checking test 018 rrfs_smoke_conus13km_hrrr_warm results ....
 Comparing sfcf000.nc ............ALT CHECK......NOT OK
 Comparing sfcf001.nc ............ALT CHECK......NOT OK
 Comparing sfcf002.nc ............ALT CHECK......NOT OK
 Comparing atmf000.nc ............ALT CHECK......NOT OK
 Comparing atmf001.nc ............ALT CHECK......NOT OK
 Comparing atmf002.nc ............ALT CHECK......NOT OK

  0: The total amount of wall time                        = 761.822270
  0: The maximum resident set size (KB)                   = 652044

Test 018 rrfs_smoke_conus13km_hrrr_warm FAIL Tries: 2
  • cpld_control_p8 failed because it timed out (exceeded walltime) consistently. How does this complete within the walltime when using debug builds of ESMF and MAPL? It seems to hang in the first UFS Aerosols step:
  0:  ==============
  0:  final results
  0:  ==============
  0:  dbgx --fixratio: F F F F
 72:  CA cubic mosaic domain decomposition
 72: whalo =    1, ehalo =    1, shalo =    1, nhalo =    1
 72:   X-AXIS =  160 160 160
 72:   Y-AXIS =   60  60  60  60  60  60  60  60
 74:  dbgx --scale snwdph from sheleg         326   0.0000000000000000       0.10000000000000001
  0:  CA cubic mosaic domain decomposition
  0: whalo =    1, ehalo =    1, shalo =    1, nhalo =    1
  0:   X-AXIS =  160 160 160
  0:   Y-AXIS =   60  60  60  60  60  60  60  60
 48:  CA cubic mosaic domain decomposition
 48: whalo =    1, ehalo =    1, shalo =    1, nhalo =    1
 48:   X-AXIS =  160 160 160
 48:   Y-AXIS =   60  60  60  60  60  60  60  60
  0:  in radiation_clouds_prop=           8 F           4 F F           2           1
  0:  in radiation_clouds_prop=           8 F           4 F F           2           1
  0:  in radiation_clouds_prop=           8 F           4 F F           2           1
  0:  in radiation_clouds_prop=           8 F           4 F F           2           1
  0:  in radiation_clouds_prop=           8 F           4 F F           2           1
  0:  in radiation_clouds_prop=           8 F           4 F F           2           1
  0:  in radiation_clouds_prop=           8 F           4 F F           2           1
  0:  in radiation_clouds_prop=           8 F           4 F F           2           1
  0:  in radiation_clouds_prop=           8 F           4 F F           2           1
  0:  in radiation_clouds_prop=           8 F           4 F F           2           1
  0:  in radiation_clouds_prop=           8 F           4 F F           2           1
  0:  in radiation_clouds_prop=           8 F           4 F F           2           1
  0: PASS: fcstRUN phase 1, n_atmsteps =                0 time is        12.142189
  0: UFS Aerosols: Advancing from 2021-03-22T06:00:00 to 2021-03-22T06:12:00
_______________________________________________________________
Start Epilog v20.08.28 on node h36m03 for job 43297771 :: Mon Mar 27 13:38:21 UTC 2023
Job 43297771 (not serial) finished for user Dom.Heinzeller in partition hera with exit code 0:15
_______________________________________________________________
End Epilogue v20.08.28 Mon Mar 27 13:38:21 UTC 2023

Logs of rt runs on Hera attached here: rt_hera_intel_gnu_pr1681.tar.gz

Top of commit queue on: TBD

n/a - no changes to any of the submodules

Input data additions/changes

  • No changes are expected to input data.
  • There will be new input data.
  • Input data will be updated.

Anticipated changes to regression tests:

  • No changes are expected to any regression test. (but see above)
  • Changes are expected to the following tests:

Subcomponents involved:

  • AQM
  • CDEPS
  • CICE
  • CMEPS
  • CMakeModules
  • FV3
  • GOCART
  • HYCOM
  • MOM6
  • NOAHMP
  • WW3
  • stochastic_physics
  • none

Combined with PR's (If Applicable):

Commit Queue Checklist:

  • Link PR's from all sub-components involved
  • Confirm reviews completed in sub-component PR's
  • Add all appropriate labels to this PR.
  • Run full RT suite on either Hera/Cheyenne with both Intel/GNU compilers
  • Add list of any failed regression tests to "Anticipated changes to regression tests" section.

Linked PR's and Issues:

Fixes #1680
Fixes #330

Testing Day Checklist:

  • This PR is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR.
  • Move new/updated input data on RDHPCS Hera and propagate input data changes to all supported systems.

Testing Log (for CM's):

  • RDHPCS
    • Intel
      • Hera
      • Orion
      • Jet
      • Gaea
      • Cheyenne
    • GNU
      • Hera
      • Cheyenne
  • WCOSS2
    • Dogwood/Cactus
    • Acorn
  • CI
    • Completed
  • opnReqTest
    • N/A
    • Log attached to comment

@climbfuji climbfuji marked this pull request as ready for review March 27, 2023 14:09
@SamuelTrahanNOAA
Copy link
Collaborator

The PR that is about to be merged, #1658, makes major revisions to the smoke implementation and adds a debug test for it.

@DusanJovic-NOAA
Copy link
Collaborator

I ran rrfs_smoke_conus13km_hrrr_warm test in develop and it failed:

$ cat /scratch2/NCEPDEV/fv3-cam/Dusan.Jovic/ufs/develop/ufs-weather-model/tests/RegressionTests_hera.intel.log
Mon Mar 27 12:43:33 UTC 2023
Start Regression test

Compile 001 elapsed time 494 seconds. -DAPP=ATM -DCCPP_SUITES=FV3_RAP,FV3_RAP_sfcdiff,FV3_HRRR,FV3_HRRR_smoke,FV3_RRFS_v1beta,FV3_RRFS_v1nssl -D32BIT=ON -DMPI=ON -DCMAKE_BUILD_TYPE=Release

baseline dir = /scratch1/NCEPDEV/nems/emc.nemspara/RT/NEMSfv3gfs/develop-20230321/INTEL/rrfs_smoke_conus13km_hrrr_warm
working dir  = /scratch1/NCEPDEV/stmp2/Dusan.Jovic/FV3_RT/rt_24722/rrfs_smoke_conus13km_hrrr_warm
Checking test 001 rrfs_smoke_conus13km_hrrr_warm results ....
 Comparing sfcf000.nc ............ALT CHECK......NOT OK
 Comparing sfcf001.nc ............ALT CHECK......NOT OK
 Comparing sfcf002.nc ............ALT CHECK......NOT OK
 Comparing atmf000.nc ............ALT CHECK......NOT OK
 Comparing atmf001.nc ............ALT CHECK......NOT OK
 Comparing atmf002.nc ............ALT CHECK......NOT OK

  0: The total amount of wall time                        = 139.780115
  0: The maximum resident set size (KB)                   = 969144

Test 001 rrfs_smoke_conus13km_hrrr_warm FAIL

FAILED TESTS: 
Test rrfs_smoke_conus13km_hrrr_warm 001 failed in check_result failed 

REGRESSION TEST FAILED
Mon Mar 27 12:58:06 UTC 2023
Elapsed time: 00h:14m:35s. Have a nice day!

@SamuelTrahanNOAA
Copy link
Collaborator

I ran rrfs_smoke_conus13km_hrrr_warm test in develop and it failed:

The new version hasn't been merged to develop yet, so that isn't the one you tested.

@climbfuji
Copy link
Collaborator Author

Good to know that it's not my PR that causes the smoke test problems!

@DusanJovic-NOAA
Copy link
Collaborator

I ran the test in develop branch about an hour ago, whatever is/was in develop at that time.

@SamuelTrahanNOAA
Copy link
Collaborator

cpld_control_p8 failed because it timed out (exceeded walltime) consistently. How does this complete within the walltime when using debug builds of ESMF and MAPL? It seems to hang in the first UFS Aerosols step:

The cpld_debug_p8 does this, too. This has been a known issue for quite a while: #1432

@SamuelTrahanNOAA
Copy link
Collaborator

I ran the test in develop branch about an hour ago, whatever is/was in develop at that time.

I sincerely hope that if you run it again after the PR is merged, you will see it pass.

@climbfuji
Copy link
Collaborator Author

cpld_control_p8 failed because it timed out (exceeded walltime) consistently. How does this complete within the walltime when using debug builds of ESMF and MAPL? It seems to hang in the first UFS Aerosols step:

The cpld_debug_p8 does this, too. This has been a known issue for quite a while: #1432

Thanks, Sam. Once upon a time, all tests passed on all platforms ("an old man complaining" ;-) ).

@SamuelTrahanNOAA
Copy link
Collaborator

Thanks, Sam. Once upon a time, all tests passed on all platforms ("an old man complaining" ;-) ).

Actually, the cpld_control_p8 was broken for a long time, for pretty much as long as that test has existed. The problem has gotten steadily worse, and it's at the point where three tries is usually not enough.

@junwang-noaa
Copy link
Collaborator

@SamuelTrahanNOAA Thanks for creating the issue to report the problem. @jkbk2004 Do we know when the problem started and in which PRs the cpld_debug_p8 failed many times with gnu compiler? I agree with Dom, the test used to run fine. This is a surprise.

@SamuelTrahanNOAA
Copy link
Collaborator

The problem is much older than the issue I submitted. I just got fed up with it one day, and submitted an issue.

@SamuelTrahanNOAA
Copy link
Collaborator

The update to the smoke code and regression tests has been merged. You should update your branches and try again. Hopefully, your smoke issue will be gone.

@climbfuji
Copy link
Collaborator Author

@SamuelTrahanNOAA I can confirm (for GNU, didn't test Intel) that after pulling in develop (after your smoke update was merged) only this tests fails:

+ read -r failed_test_name
+ echo 'Test cpld_control_p8 047 failed in run_test failed '
Test cpld_control_p8 047 failed in run_test failed
+ echo 'Test cpld_control_p8 047 failed in run_test failed '
+ read -r failed_test_name

@BrianCurtis-NOAA
Copy link
Collaborator

@climbfuji Was the cpld_control_p8 a timeout failure as well? If so I think we can still consider this ready for commit queue.

@DeniseWorthen
Copy link
Collaborator

@BrianCurtis-NOAA If Dom confirms that the failed cpld_control_p8 was a timeout on hera.gnu, I think this might be a temporary work-around since this test is failing consistently on hera.gnu now

--- a/tests/tests/cpld_control_p8
+++ b/tests/tests/cpld_control_p8
@@ -85,3 +85,6 @@ export FV3_RUN=cpld_control_run.IN
 if [[ $MACHINE_ID = cheyenne.* ]]; then
   TPN=18
 fi
+if [[ $MACHINE_ID = hera.gnu ]]; then
+  export WLCLK=40
+fi

I can't test because hera is down today though.

@climbfuji
Copy link
Collaborator Author

@BrianCurtis-NOAA If Dom confirms that the failed cpld_control_p8 was a timeout on hera.gnu, I think this might be a temporary work-around since this test is failing consistently on hera.gnu now

--- a/tests/tests/cpld_control_p8
+++ b/tests/tests/cpld_control_p8
@@ -85,3 +85,6 @@ export FV3_RUN=cpld_control_run.IN
 if [[ $MACHINE_ID = cheyenne.* ]]; then
   TPN=18
 fi
+if [[ $MACHINE_ID = hera.gnu ]]; then
+  export WLCLK=40
+fi

I can't test because hera is down today though.

Yes, it was:

30 min. TEST 048 cpld_debug_p8 is running,  status: R jobid 43285682
Slurm unknown status CG. Check sacct ...
43285682                  TIMEOUT         rt_10960_048
43285682.ba+            CANCELLED                batch
43285682.ex+            COMPLETED               extern
43285682.0              CANCELLED              fv3.exe
31 min. TEST 048 cpld_debug_p8 is TIMEOUT,  status: CG jobid 43285682

@climbfuji
Copy link
Collaborator Author

I just pulled in develop.

@BrianCurtis-NOAA BrianCurtis-NOAA added No Baseline Change No Baseline Change Ready for Commit Queue The PR is ready for the Commit Queue. All checkboxes in PR template have been checked. labels Apr 4, 2023
@BrianCurtis-NOAA
Copy link
Collaborator

@zach1221 @jkbk2004 This is ready to go. Please start jenkins items/AutoRT minus Hera. We can get Hera when it's back up later this evening/tomorrow morning.

@zach1221
Copy link
Collaborator

zach1221 commented Apr 4, 2023

Thanks, @BrianCurtis-NOAA . Starting CI now.

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Apr 4, 2023

Automated RT Failure Notification
Machine: orion
Compiler: intel
Job: RT
[RT] Repo location: /work/noaa/epic-ps/jongkim/autort/pr/1290621074/20230404153010/ufs-weather-model
Please make changes and add the following label back: orion-intel-RT

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Apr 4, 2023

@FernandoAndrade-NOAA let me add orion label again with locally sticking in module load git/2.28.0 rt.py on orion.

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Apr 4, 2023

Automated RT Failure Notification
Machine: orion
Compiler: intel
Job: RT
[RT] Repo location: /work/noaa/epic-ps/jongkim/autort/pr/1290621074/20230404161518/ufs-weather-model
Please make changes and add the following label back: orion-intel-RT

@FernandoAndrade-NOAA
Copy link
Collaborator

@jkbk2004 , Looks like there's still an issue with Orion, I'll update when my manual run finishes.

epic-cicd-jenkins and others added 8 commits April 4, 2023 15:34
on-behalf-of @ufs-community <ecc.platform@noaa.gov>
on-behalf-of @ufs-community <jong.kim@noaa.gov>
on-behalf-of @ufs-community <jong.kim@noaa.gov>
on-behalf-of @ufs-community <brian.curtis@noaa.gov>
Due to a combination recent WCOSS2 and UFSWM changes, wallclock needs to be bumped to 45 minutes for compiles to succeed.
@jkbk2004
Copy link
Collaborator

jkbk2004 commented Apr 5, 2023

All tests are done. We can start merging process. No dependencies. Please, go ahead for final reviews and approvals.

@jkbk2004 jkbk2004 merged commit 9be8465 into ufs-community:develop Apr 5, 2023
@jkbk2004 jkbk2004 mentioned this pull request Apr 5, 2023
6 tasks
@climbfuji
Copy link
Collaborator Author

Thanks everyone for this quick turnaround, highly appreciated!

@climbfuji climbfuji deleted the feature/remove-debug-stuff branch April 6, 2023 03:05
@DeniseWorthen
Copy link
Collaborator

@climbfuji I thought this commit meant we were no longer compiling w/ the debug ESMF library. When I look at the PET logs in PR branch I'm preparing, I still see the message in the log

20230419 180823.249 WARNING          PET000 !!! Calling ESMCI::Array::sparseMatMul() with CHECKFLAG!
20230419 180823.249 WARNING          PET000 !!! Extra checking comes at the cost  !!!
20230419 180823.249 WARNING          PET000 !!! of performance. Only use for      !!!
20230419 180823.249 WARNING          PET000 !!! debugging, NOT for production!    !!!

This looks like ESMF being used is the debug version?

@DeniseWorthen
Copy link
Collaborator

Gerhard gave me a clue where to look for where this is being triggered. In CMEPS, we have

    ! local variables
    logical :: checkflag = .false.
    character(len=CS) :: lfldname
    real(ESMF_KIND_R8), parameter :: fillValue = 9.99e20_ESMF_KIND_R8
    character(len=*), parameter :: subname='(module_MED_map:med_map_field) '
    !---------------------------------------------------

    rc = ESMF_SUCCESS

#ifdef DEBUG
    checkflag = .true.
#endif

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jenkins-ci Jenkins CI: ORT build/test on docker container No Baseline Change No Baseline Change Ready for Commit Queue The PR is ready for the Commit Queue. All checkboxes in PR template have been checked.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove requirement for esmf and mapl debug versions DEBUG_LINKMPI