Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add C48L127 atmosphere only test and turn on the control_csawmg test on jet/cheyenne #724

Merged
merged 25 commits into from
Aug 6, 2021

Conversation

junwang-noaa
Copy link
Collaborator

@junwang-noaa junwang-noaa commented Jul 29, 2021

PR Checklist

  • Ths PR is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR. Please consult the ufs-weather-model wiki if you are unsure how to do this.

  • This PR has been tested using a branch which is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR

  • An Issue describing the work contained in this PR has been created either in the subcomponent(s) or in the ufs-weather-model. The Issue should be created in the repository that is most relevant to the changes in contained in the PR. The Issue and the dependent sub-component PR
    are specified below.

  • If new or updated input data is required by this PR, it is clearly stated in the text of the PR.

Instructions: All subsequent sections of text should be filled in as appropriate.

The information provided below allows the code managers to understand the changes relevant to this PR, whether those changes are in the ufs-weather-model repository or in a subcomponent repository. Ufs-weather-model code managers will use the information provided to add any applicable labels, assign reviewers and place it in the Commit Queue. Once the PR is in the Commit Queue, it is the PR owner's responsiblity to keep the PR up-to-date with the develop branch of ufs-weather-model.

Description

This PR will add a C48L127 atmosphere only test in the regression test suite. This test requires small resources to facilitate infrastructure/workflow development.

  • fix the control_csawmg test on jet by changing iaer to 1011. Because of this change, new baseline is required.
  • remove LT_ENHANCE=3 in MOM_input_template_025 and MOM_input_template_050 (from @jiandewang )
  • update rt.sh to disable running rt.sh without ecflow on ecflow node (from @MinsukJi-NOAA )
  • turn on control_csawmg on cheyenne (from @climbfuji)
  • fix the lambert conformal projection for regional inline post

Issue(s) addressed

Link the issues to be closed with this PR, whether in this repository, or in another repository.
(Remember, issues must always be created before starting work on a PR branch!)

Testing

How were these changes tested? What compilers / HPCs was it tested with? Are the changes covered by regression tests? (If not, why? Do new tests need to be added?) Have regression tests and unit tests (utests) been run? On which platforms and with which compilers? (Note that unit tests can only be run on tier-1 platforms)

Dependencies

fv3atm PR#356
ufs-weather-model PR#724

tests/fv3_conf/control_run.IN Show resolved Hide resolved
tests/rt.conf Outdated Show resolved Hide resolved
@junwang-noaa
Copy link
Collaborator Author

junwang-noaa commented Aug 3, 2021 via email

@climbfuji
Copy link
Collaborator

Yes, please.

Testing this now.

@climbfuji
Copy link
Collaborator

Yes, please.

Testing this now.

@junwang-noaa the test now runs on Cheyenne with Intel and is run-to-run reproducible.

I also tried it with GNU, but I get the following FPE from all tasks:

10:Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
10:#0  0x2b7323ad7aff in ???
10:#1  0x26bd698 in __tp_core_mod_MOD_pert_ppm
10:     at /glade/work/heinzell/fv3/ufs-weather-model/ufs-weather-model-c48-jun/gnu/FV3/atmos_cubed_sphere/model/tp_core.F90:1199
10:#2  0x26e58f8 in xppm
10:     at /glade/work/heinzell/fv3/ufs-weather-model/ufs-weather-model-c48-jun/gnu/FV3/atmos_cubed_sphere/model/tp_core.F90:632
10:#3  0x26ecdfd in __tp_core_mod_MOD_fv_tp_2d
10:     at /glade/work/heinzell/fv3/ufs-weather-model/ufs-weather-model-c48-jun/gnu/FV3/atmos_cubed_sphere/model/tp_core.F90:188
...

tests/rt.conf Outdated Show resolved Hide resolved
@climbfuji
Copy link
Collaborator

Yes, please.

Testing this now.

@junwang-noaa the test now runs on Cheyenne with Intel and is run-to-run reproducible.

I also tried it with GNU, but I get the following FPE from all tasks:

10:Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
10:#0  0x2b7323ad7aff in ???
10:#1  0x26bd698 in __tp_core_mod_MOD_pert_ppm
10:     at /glade/work/heinzell/fv3/ufs-weather-model/ufs-weather-model-c48-jun/gnu/FV3/atmos_cubed_sphere/model/tp_core.F90:1199
10:#2  0x26e58f8 in xppm
10:     at /glade/work/heinzell/fv3/ufs-weather-model/ufs-weather-model-c48-jun/gnu/FV3/atmos_cubed_sphere/model/tp_core.F90:632
10:#3  0x26ecdfd in __tp_core_mod_MOD_fv_tp_2d
10:     at /glade/work/heinzell/fv3/ufs-weather-model/ufs-weather-model-c48-jun/gnu/FV3/atmos_cubed_sphere/model/tp_core.F90:188
...

@SMoorthi-emc @junwang-noaa @DusanJovic-NOAA I looked at the GNU problems with csawmg_debug. It crashes right away in the dycore init, it seems, see first screenshot.

The second screenshot shows that two variables a4 and da1 are both extremely small (1e-40). If those were multiplied/divided first, then the rest of the computation in this line would be fine. Maybe that is what Intel does? If, on the other hand, GNU evaluates 0.24/a4 first, then this results in a FPE.

Two questions.

  1. Why is this only a problem with csawmg? Any special settings in the dycore nml section that are different from the other tests?
  2. What is Intel really doing? Are a4 and da1 small and it does evaluate da1**2/a4 first? I can do this test easily, just add a breakpoint in this line.

How about 1., any insights?

Screen Shot 2021-08-03 at 11 04 59 AM

Screen Shot 2021-08-03 at 11 04 00 AM

@climbfuji
Copy link
Collaborator

Yes, please.

Testing this now.

@junwang-noaa the test now runs on Cheyenne with Intel and is run-to-run reproducible.
I also tried it with GNU, but I get the following FPE from all tasks:

10:Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
10:#0  0x2b7323ad7aff in ???
10:#1  0x26bd698 in __tp_core_mod_MOD_pert_ppm
10:     at /glade/work/heinzell/fv3/ufs-weather-model/ufs-weather-model-c48-jun/gnu/FV3/atmos_cubed_sphere/model/tp_core.F90:1199
10:#2  0x26e58f8 in xppm
10:     at /glade/work/heinzell/fv3/ufs-weather-model/ufs-weather-model-c48-jun/gnu/FV3/atmos_cubed_sphere/model/tp_core.F90:632
10:#3  0x26ecdfd in __tp_core_mod_MOD_fv_tp_2d
10:     at /glade/work/heinzell/fv3/ufs-weather-model/ufs-weather-model-c48-jun/gnu/FV3/atmos_cubed_sphere/model/tp_core.F90:188
...

@SMoorthi-emc @junwang-noaa @DusanJovic-NOAA I looked at the GNU problems with csawmg_debug. It crashes right away in the dycore init, it seems, see first screenshot.

The second screenshot shows that two variables a4 and da1 are both extremely small (1e-40). If those were multiplied/divided first, then the rest of the computation in this line would be fine. Maybe that is what Intel does? If, on the other hand, GNU evaluates 0.24/a4 first, then this results in a FPE.

Two questions.

  1. Why is this only a problem with csawmg? Any special settings in the dycore nml section that are different from the other tests?
  2. What is Intel really doing? Are a4 and da1 small and it does evaluate da1**2/a4 first? I can do this test easily, just add a breakpoint in this line.

How about 1., any insights?

Screen Shot 2021-08-03 at 11 04 59 AM Screen Shot 2021-08-03 at 11 04 00 AM

I got around the crash in the dycore by changing

  hord_mt = 6
  hord_vt = 6
  hord_tm = 6
  hord_dp = 6
  hord_tr = 13

to

  hord_mt = 5
  hord_vt = 5
  hord_tm = 5
  hord_dp = -5
  hord_tr = 8

so that control_csawmg.nml.IN matches control.nml.IN. It then crashes in cs_conv:

95:
95:Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
95:
95:Backtrace for this error:
95:#0  0x2b802a283aff in ???
95:#1  0x467355f in fprec
95:	at /glade/work/heinzell/fv3/ufs-weather-model/ufs-weather-model-c48-jun/gnu/FV3/ccpp/physics/physics/cs_conv.F90:2638
95:#2  0x4667e54 in cumup
95:	at /glade/work/heinzell/fv3/ufs-weather-model/ufs-weather-model-c48-jun/gnu/FV3/ccpp/physics/physics/cs_conv.F90:2327
95:#3  0x4683907 in cs_cumlus
95:	at /glade/work/heinzell/fv3/ufs-weather-model/ufs-weather-model-c48-jun/gnu/FV3/ccpp/physics/physics/cs_conv.F90:1153
95:#4  0x4698abd in __cs_conv_MOD_cs_conv_run
95:	at /glade/work/heinzell/fv3/ufs-weather-model/ufs-weather-model-c48-jun/gnu/FV3/ccpp/physics/physics/cs_conv.F90:532
95:#5  0x3ccf483 in __ccpp_fv3_gfs_v16_csawmg_physics_cap_MOD_fv3_gfs_v16_csawmg_physics_run_cap
95:	at /glade/work/heinzell/fv3/ufs-weather-model/ufs-weather-model-c48-jun/gnu/tests/build_fv3/FV3/ccpp/physics/ccpp_FV3_GFS_v16_csawmg_physics_cap.F90:1547
95:#6  0x3986420 in __ccpp_static_api_MOD_ccpp_physics_run

The reason is that some of the variables, e.g. GDZTR = tropopause height, are becoming zero. Clearly doesn't make sense.

@junwang-noaa
Copy link
Collaborator Author

@climbfuji Sorry, I forgot to change iaer in control_csawmg_debug. I just updated the code. Do you use the new iaer(1011) in the control_caswmg_debug test?

@climbfuji
Copy link
Collaborator

@climbfuji Sorry, I forgot to change iaer in control_csawmg_debug. I just updated the code. Do you use the new iaer(1011) in the control_caswmg_debug test?

I think the debug test didn't have the correct iaer option. But it runs (and always ran) with Intel, just not with GNU. Will try again with GNU, but I think we should change the hord options incontrol_csawmg.nml.IN to match what is in control.nml.IN, unless @SMoorthi-emc knows why they should be different.

@climbfuji
Copy link
Collaborator

@climbfuji Sorry, I forgot to change iaer in control_csawmg_debug. I just updated the code. Do you use the new iaer(1011) in the control_caswmg_debug test?

I think the debug test didn't have the correct iaer option. But it runs (and always ran) with Intel, just not with GNU. Will try again with GNU, but I think we should change the hord options incontrol_csawmg.nml.IN to match what is in control.nml.IN, unless @SMoorthi-emc knows why they should be different.

Ok, now I am back to the very old / original error I have seen a few years back, memory corruption:

140:*** Error in `./fv3.exe': free(): invalid next size (fast): 0x0000000032369490 ***
140:======= Backtrace: =========
140:/glade/u/apps/ch/os/lib64/libc.so.6(+0x721af)[0x2b39bf6731af]
140:/glade/u/apps/ch/os/lib64/libc.so.6(+0x779d6)[0x2b39bf6789d6]
140:/glade/u/apps/ch/os/lib64/libc.so.6(+0x78723)[0x2b39bf679723]
140:./fv3.exe[0x4682dbf]
140:./fv3.exe[0x46985a6]
140:./fv3.exe[0x3ccf219]
140:./fv3.exe[0x39863d1]
140:./fv3.exe[0x3982c8d]
140:/glade/u/apps/ch/opt/gnu/10.1.0/lib64/libgomp.so.1(GOMP_parallel+0x42)[0x2b39bdd66742]
140:./fv3.exe[0x3982295]
140:./fv3.exe[0x20e0927]
140:./fv3.exe[0x1ee771d]
...

Will try to track it down. Since this happened with GNU 8, 9 and 10, it is unlikely that it is a compiler bug.

@junwang-noaa
Copy link
Collaborator Author

@SMoorthi-emc Are you OK to change the dycore namelist for hord variables as Dom pointed out?

@junwang-noaa junwang-noaa added the Baseline Updates Current baselines will be updated. label Aug 5, 2021
@junwang-noaa junwang-noaa changed the title Add C48L127 atmosphere only test Add C48L127 atmosphere only test and turn on the control_csawmg test on jet/cheyenne Aug 5, 2021
@junwang-noaa junwang-noaa added the Waiting for Reviews The PR is waiting for reviews from associated component PR's. label Aug 5, 2021
@BrianCurtis-NOAA
Copy link
Collaborator

Machine: orion
Compiler: intel
Job: BL
Repo location: /work/noaa/nems/emc.nemspara/autort/pr/699473949/20210805103023/ufs-weather-model
Please manually delete: /work/noaa/stmp/bcurtis/stmp/bcurtis/FV3_RT/rt_124709
Please make changes and add the following label back:
orion-intel-BL

@MinsukJi-NOAA
Copy link
Contributor

MinsukJi-NOAA commented Aug 5, 2021

Machine: orion
Compiler: intel
Job: BL
Repo location: /work/noaa/nems/emc.nemspara/autort/pr/699473949/20210805103023/ufs-weather-model
Please manually delete: /work/noaa/stmp/bcurtis/stmp/bcurtis/FV3_RT/rt_124709
Please make changes and add the following label back:
orion-intel-BL

Baseline did not get generated at all, although 92 baseline generation tests passed. @BrianCurtis-NOAA can you please copy /work/noaa/stmp/bcurtis/stmp/bcurtis/FV3_RT/REGRESSION_TEST_INTEL to /work/noaa/nems/emc.nemspara/RT/NEMSfv3gfs/develop-20210805/INTEL?

@BrianCurtis-NOAA
Copy link
Collaborator

Machine: gaea
Compiler: intel
Job: BL
Repo location: /lustre/f2/pdata/ncep/emc.nemspara/autort/pr/699473949/20210805153011/ufs-weather-model
Please manually delete: /lustre/f2/scratch/emc.nemspara/FV3_RT/rt_29330
Baseline creation and move successful
Repo location: /lustre/f2/pdata/ncep/emc.nemspara/autort/pr/699473949/20210805162849/ufs-weather-model
Please manually delete: /lustre/f2/scratch/emc.nemspara/FV3_RT/rt_18825
Test control_wrtGauss_netcdf_parallel 023 failed failed
Test control_wrtGauss_netcdf_parallel 023 failed in run_test failed
Please make changes and add the following label back:
gaea-intel-BL

@@ -560,6 +561,8 @@ export FNALBC="'global_snowfree_albedo.bosu.t126.384.190.rg.grb',"
export FNVETC="'global_vegtype.igbp.t126.384.190.rg.grb',"
export FNSOTC="'global_soiltype.statsgo.t126.384.190.rg.grb',"
export FNSMCC="'global_soilmgldas.t126.384.190.grb',"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need both FNSMCC and FNSMCC_control?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes at this time. The non-control related tests are using FNSMCC. I hope we can unify them when all the global atm tests are updated.

@MinsukJi-NOAA
Copy link
Contributor

Machine: gaea
Compiler: intel
Job: BL
Repo location: /lustre/f2/pdata/ncep/emc.nemspara/autort/pr/699473949/20210805153011/ufs-weather-model
Please manually delete: /lustre/f2/scratch/emc.nemspara/FV3_RT/rt_29330
Baseline creation and move successful
Repo location: /lustre/f2/pdata/ncep/emc.nemspara/autort/pr/699473949/20210805162849/ufs-weather-model
Please manually delete: /lustre/f2/scratch/emc.nemspara/FV3_RT/rt_18825
Test control_wrtGauss_netcdf_parallel 023 failed failed
Test control_wrtGauss_netcdf_parallel 023 failed in run_test failed
Please make changes and add the following label back:
gaea-intel-BL

slurmstepd: error: *** JOB 268863209 ON nid00665 CANCELLED AT 2021-08-05T17:03:15 DUE TO TIME LIMIT ***``
Last results files are at forecast hour 15 rather than 24.
Will manually rerun control_wrtGauss_netcdf_parallel and attach the log file.

@junwang-noaa junwang-noaa merged commit 4ff260a into ufs-community:develop Aug 6, 2021
@junwang-noaa junwang-noaa deleted the c48test branch August 6, 2021 14:20
@BrianCurtis-NOAA
Copy link
Collaborator

Machine: cheyenne
Compiler: intel
Job: BL
Repo location: /glade/scratch/dtcufsrt/autort/tests/auto/pr/699473949/20210806081509/ufs-weather-model
Please manually delete: /glade/scratch/dtcufsrt/FV3_RT/rt_32675
Test control_c384gdas_wav 091 failed failed
Test control_c384gdas_wav 091 failed in run_test failed
Please make changes and add the following label back:
cheyenne-intel-BL

@climbfuji
Copy link
Collaborator

Machine: cheyenne
Compiler: intel
Job: BL
Repo location: /glade/scratch/dtcufsrt/autort/tests/auto/pr/699473949/20210806081509/ufs-weather-model
Please manually delete: /glade/scratch/dtcufsrt/FV3_RT/rt_32675
Test control_c384gdas_wav 091 failed failed
Test control_c384gdas_wav 091 failed in run_test failed
Please make changes and add the following label back:
cheyenne-intel-BL

@BrianCurtis-NOAA this is the same failiing test as in the previous commit, when something (we still don't know exactly which script) got killed around 1am MT after being in the queue for 800 minutes.

@BrianCurtis-NOAA
Copy link
Collaborator

BrianCurtis-NOAA commented Aug 9, 2021

@climbfuji I checked to see if PBS has a timeout setting in the queue, and I couldn't find info to support yes or no. At first glance it's either a PBS timeout or just coincidence that the machine had killed those jobs at around 1AM and it was 800 minutes in the queue.

@climbfuji
Copy link
Collaborator

Sorry, my explanation was poor. The 800 minutes were from the previous failure, I didn't check how long it was in the queue this time before it got killed. We need to monitor if using the economy queue delays the jobs too much. The "process scrubber" on Cheyenne kills the cron jobs around 1am MT, if I remember correctly.

epic-cicd-jenkins pushed a commit that referenced this pull request Apr 17, 2023
Co-authored-by: Benjamin.Blake EMC <Benjamin.Blake@v71a1.ncep.noaa.gov>
Co-authored-by: Benjamin.Blake EMC <Benjamin.Blake@v72a1.ncep.noaa.gov>
Co-authored-by: Benjamin.Blake EMC <Benjamin.Blake@v71a3.ncep.noaa.gov>
Co-authored-by: chan-hoo <chan-hoo.jeon@noaa.gov>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Baseline Updates Current baselines will be updated. New Input Data Req'd This PR requires new data to be sync across platforms Waiting for Reviews The PR is waiting for reviews from associated component PR's.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Setting up C48 atmosphere only test case merra2 data used in control_csawmg test
7 participants