Use assumed-size arrays in CCPP, Fortran/metadata consistency fixes in CCPP #527

climbfuji · 2021-04-15T16:15:01Z

PR Checklist

Ths PR is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR. Please consult the ufs-weather-model wiki if you are unsure how to do this.
This PR has been tested using a branch which is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR
An Issue describing the work contained in this PR has been created either in the subcomponent(s) or in the ufs-weather-model. The Issue should be created in the repository that is most relevant to the changes in contained in the PR. The Issue and the dependent sub-component PR
are specified below.
If new or updated input data is required by this PR, it is clearly stated in the text of the PR.

Description

Provide a detailed description of what this PR does. What bug does it fix, or what feature does it add? Is a change of answers expected from this PR? Are any library updates included in this PR (modulefiles etc.)?

This PR does the following:

Update the submodule pointers for ccpp-physics and fv3atm and the regression test baseline date tag
Small update of tests/ci/repo_check.sh from @MinsukJi-NOAA
Move coupled tests in rt.conf to the top, because these take the longest to wait in the queue and to run; BUT: needed to move ATMW tests to end of rt.conf, since WW3 is not able to support parallel builds. By moving it to the end, we are as "lucky" as before in a sense that only ten builds can happen at the same time, and the two builds that involve WW3 are sufficiently far apart to not overlap - see issue WW3 does not support concurrent builds #550
Change location of temporary/personal baseline directories on Cheyenne from $WORK to $SCRATCH to avoid disk space limitations
Increase walltime for compile jobs on wcoss_dell_p3 from 50min to 1hr

New regression test baseline is required for Intel, since the Intel compiler optimization leads to different answers when using assumed-size arrays in ccpp-physics.

Issue(s) addressed

none in ufs-weather-model (but see NCAR/ccpp-physics#611)

Testing

Preliminary testing

Regression tests were run against the existing baselines on Hera with Intel and GNU twice, once 2021/04/22 and once 2021/04/28 (each time the code was updated to include all changes from the trunk).

With GNU, all regression tests pass (i.e. in both PROD and DEBUG mode). With Intel, all DEBUG tests pass and many of the PROD tests as well: 72 tests pass, 31 tests have different results (but all run to completion). The failing tests fall in one or more of the following categories:

they use Zhao-Carr, i.e. all GFS v14 PROD tests
they use satmedmf or satmedmfq, i.e. all GFS v16 PROD tests
they use csawmg under the hood
addititonal tests that have different results: fv3_rrfs_v1alpha, fv3_rrfs_v1beta (only RESTART/phy_data.tile?.nc and RESTART/sfc_data.tile?.nc change, all other restart files and all 24h diag files are identical)

For details, see the attached log files.

GNU

2021/04/22
rt_hera_gnu_verify_against_existing.log

2021/04/28
rt_hera_gnu_verify_against_existing.log

Intel

2021/04/22
rt_hera_intel_verify_against_existing.log
rt_hera_intel_verify_against_existing_fail_test.log
rt_hera_intel_verify_against_existing_log_hera.intel.tar.gz

2021/04/28
rt_hera_intel_verify_against_existing.log
rt_hera_intel_verify_against_existing_fail_test.log
rt_hera_intel_verify_against_existing_log_hera.intel.tar.gz

Also, for the 2021/04/28, I created new baselines on jet to see which tests run to completion and which not. The following tests crash with
FATAL from PE 140: NaN in input field of mpp_reproducing_sum(_2d), this indicates numerical instability on jet.intel (but not on hera.intel):

fv3_csawmg
fv3_cpt
fv3_gfsv16_csawmgt
fv3_gfsv16_csawmg

Notably, all these tests change answers on hera.intel. Also, we were never able to run these tests with GNU, because they crashed with segmentation faults. Which means that there is a bug somewhere in the csawmg suite. Switching to assumed-size arrays will allow us to add a debug test in the future and identify the problem, hopefully. The jet crashes may also go away after the current RRTMGP PR is merged and this PR is updated, because currently the compiler flag modifications in FV3/ccpp/CMakeLists.txt are not applied on jet (but on all other platforms). This is fixed in the RRTMGP PR.

rt_jet_intel_create.log
rt_jet_intel_create_fail_test.log
rt_jet_intel_create_log_jet.intel.tar.gz

Update on csawmg crashes on jet. After applying bugfix NCAR/ccpp-physics@41d34d0 in ccpp-physics, the test fv3_csawmg runs in debug mode on jet.intel until it times out. In PROD mode, they now all run to completion. Yay. This update, however, became necessary only after switching to assumed-sized arrays, hence this will not be the solution for the long-standing GNU issues with csawmg.

Final regression testing.

CI tests passed for hash abf15d0

Note. After creating baselines on orion.intel, the verification step for test fv3_gfsv16_csawmg failed. I recreated the baseline for this test, updated the baseline and verified against it successfully. We may need to keep an eye on these tests and implement the DEBUG tests for those rather sooner than later (see issue #552).

Dependencies

NCAR/ccpp-physics#611
NOAA-EMC/fv3atm#284
#527

…g RRTMGP

…nd testing

…r-model into capgen_fixes_assumed_sizes

…r-model into addLWadj_fullProfile

…r-model into capgen_fixes_assumed_sizes

…s/ufs-weather-model into capgen_fixes_assumed_sizes

…o support two concurrent builds

…r-model into capgen_fixes_assumed_sizes

tests/rt.conf

climbfuji · 2021-04-30T12:43:34Z

CI tests passed for hash abf15d0

climbfuji · 2021-04-30T13:16:21Z

@DusanJovic-NOAA @junwang-noaa @DeniseWorthen @SMoorthi-emc I wanted to make sure that switching to assumed sizes does not affect the performance. I compared the regression test logs for hera.gnu and hera.intel between the current head of develop (after RRTMGP from yesterday) with this PR. The runtimes are remarkably similar often within 1s of each other. Tests running longer are mostly within 10s of each other, sometimes this PR is faster, sometimes it is slower. Tests bigger larger I/O show larger deviations (as usually is the case), but again no clear signal pro/con one version of the code.

junwang-noaa · 2021-04-30T13:22:42Z

Dom, thanks for testing and confirming the performance.

…

On Fri, Apr 30, 2021 at 9:16 AM Dom Heinzeller ***@***.***> wrote: @DusanJovic-NOAA <https://github.com/DusanJovic-NOAA> @junwang-noaa <https://github.com/junwang-noaa> @DeniseWorthen <https://github.com/DeniseWorthen> @SMoorthi-emc <https://github.com/SMoorthi-emc> I wanted to make sure that switching to assumed sizes does not affect the performance. I compared the regression test logs for hera.gnu and hera.intel between the current head of develop (after RRTMGP from yesterday) with this PR. The runtimes are remarkably similar often within 1s of each other. Tests running longer are mostly within 10s of each other, sometimes this PR is faster, sometimes it is slower. Tests bigger larger I/O show larger deviations (as usually is the case), but again no clear signal pro/con one version of the code. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#527 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AI7D6TOMYCRBVYOMCF7AV4LTLKUTZANCNFSM4272A6QA> .

SMoorthi-emc · 2021-04-30T14:12:51Z

The speed is more or less the same in my C384L127 coupled test.

…

On Fri, Apr 30, 2021 at 9:23 AM Jun Wang ***@***.***> wrote: Dom, thanks for testing and confirming the performance. On Fri, Apr 30, 2021 at 9:16 AM Dom Heinzeller ***@***.***> wrote: > @DusanJovic-NOAA <https://github.com/DusanJovic-NOAA> @junwang-noaa > <https://github.com/junwang-noaa> @DeniseWorthen > <https://github.com/DeniseWorthen> @SMoorthi-emc > <https://github.com/SMoorthi-emc> I wanted to make sure that switching to > assumed sizes does not affect the performance. I compared the regression > test logs for hera.gnu and hera.intel between the current head of develop > (after RRTMGP from yesterday) with this PR. The runtimes are remarkably > similar often within 1s of each other. Tests running longer are mostly > within 10s of each other, sometimes this PR is faster, sometimes it is > slower. Tests bigger larger I/O show larger deviations (as usually is the > case), but again no clear signal pro/con one version of the code. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > < #527 (comment) >, > or unsubscribe > < https://github.com/notifications/unsubscribe-auth/AI7D6TOMYCRBVYOMCF7AV4LTLKUTZANCNFSM4272A6QA > > . > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#527 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ALLVRYWNIZQ5TN6ZFFUMHMDTLKVLTANCNFSM4272A6QA> .

-- Dr. Shrinivas Moorthi Research Meteorologist Modeling and Data Assimilation Branch Environmental Modeling Center / National Centers for Environmental Prediction 5830 University Research Court - (W/NP23), College Park MD 20740 USA Tel: (301)683-3718 e-mail: ***@***.*** Phone: (301) 683-3718 Fax: (301) 683-3718

…/fv3_conf/compile_bsub.IN_wcoss_dell_p3

DeniseWorthen

This seems to have been well-reviewed on the ncar/ccpp side and well tested/documented for which tests change and why for ufs-weather. I'll approve on that basis.

climbfuji · 2021-04-30T20:36:18Z

fv3atm hash updated - please check and merge if ok.

climbfuji · 2021-04-30T20:41:24Z

Thank you!

* Change RRTMGP to RRTMG in suite_FV3_GFS_v17_p8 and suite_FV3_GFS_v17_coupled_p8 * deleted or modified some SDFs related to RRTMGP or Thompson schemes * added a new SDF file for P8 with rrtmgp

dustinswales and others added 10 commits April 9, 2021 20:41

Updates to RRTMGP in FV3/CCPP. New RT for regional configuration usin…

f0df3ec

…g RRTMGP

Updated submodule to my forked repo.

1ab3c0e

Changed default namelist setting for RRTMGP.

a8edc64

Turned on feature to use GP LW flux-adjustment.

45b5e84

Removed mistake in previous commit

7b3f875

Update .gitmodules and submodule pointer for fv3atm for code review a…

efa164f

…nd testing

Merge branch 'develop' of https://github.com/ufs-community/ufs-weathe…

8536b4a

…r-model into capgen_fixes_assumed_sizes

Update submodule pointer for fv3atm

05fcb3f

Merge branch 'develop' of https://github.com/ufs-community/ufs-weathe…

d161b27

…r-model into capgen_fixes_assumed_sizes

Merge branch 'develop' of https://github.com/ufs-community/ufs-weathe…

18e89b0

…r-model into capgen_fixes_assumed_sizes

This was referenced Apr 22, 2021

Wrapper PR for assumed sizes, Fortran/metadata consistency fixes, bugfix Thompson MP, etc. NCAR/ccpp-physics#611

Merged

Use assumed-size arrays in CCPP, Fortran/metadata consistency fixes in CCPP NOAA-EMC/fv3atm#284

Merged

climbfuji requested review from junwang-noaa and DusanJovic-NOAA April 22, 2021 22:52

climbfuji marked this pull request as ready for review April 22, 2021 22:52

climbfuji added Baseline Updates Current baselines will be updated. Waiting for Reviews The PR is waiting for reviews from associated component PR's. labels Apr 22, 2021

update submodule pointer for fv3atm

0c2ade8

climbfuji changed the title ~~WORK IN PROGRESS: use assumed-size arrays in CCPP, Fortran/metadata consistency fixes in CCPP~~ Use assumed-size arrays in CCPP, Fortran/metadata consistency fixes in CCPP Apr 23, 2021

dustinswales and others added 11 commits April 27, 2021 20:36

Updates to RRTMGP RTs.

c872110

Changes from code review.

a32d050

Changes from code review.

8b0238a

Updated FV3.

2c94fb5

Updated FV3

f8ee25f

Merge branch 'develop' of https://github.com/ufs-community/ufs-weathe…

af6c11c

…r-model into addLWadj_fullProfile

Update baseline date in rt.sh

5b3e399

Updated RT nml.

61b71de

Updated GP regional RTs

df74c2a

Merge branch 'develop' of https://github.com/ufs-community/ufs-weathe…

c401c7a

…r-model into capgen_fixes_assumed_sizes

Update submodule pointer for fv3atm

f74f47b

climbfuji added 4 commits April 29, 2021 21:17

Update submodule pointer for fv3atm

48dc9a9

Merge branch 'addLWadj_fullProfile' of https://github.com/dustinswale…

bff809a

…s/ufs-weather-model into capgen_fixes_assumed_sizes

Move ATMW tests to end of rt.conf, since WW3 is apparently not able t…

1ccf3b2

…o support two concurrent builds

Merge branch 'develop' of https://github.com/ufs-community/ufs-weathe…

6870f11

…r-model into capgen_fixes_assumed_sizes

climbfuji mentioned this pull request Apr 30, 2021

WW3 does not support concurrent builds #550

Closed

Regression test log for hera.gnu; run-ci

abf15d0

climbfuji requested review from DeniseWorthen and JessicaMeixner-NOAA April 30, 2021 12:04

DeniseWorthen reviewed Apr 30, 2021

View reviewed changes

tests/rt.conf Show resolved Hide resolved

Regression test logs for cheyenne.gnu and hera.intel

db77c13

climbfuji added 3 commits April 30, 2021 09:11

Regression test logs for gaea.intel, jet.intel, wcoss_cray

e861ac2

Increase walltime for compile jobs from 50 minutes to 1 hour in tests…

b2e8559

…/fv3_conf/compile_bsub.IN_wcoss_dell_p3

Regression test log for cheyenne.intel

ca990ba

climbfuji mentioned this pull request Apr 30, 2021

Add csawmg debug test, add to rt_gnu.conf #552

Closed

DeniseWorthen approved these changes Apr 30, 2021

View reviewed changes

DusanJovic-NOAA approved these changes Apr 30, 2021

View reviewed changes

junwang-noaa approved these changes Apr 30, 2021

View reviewed changes

climbfuji added 2 commits April 30, 2021 14:04

Regression test logs for orion.intel and wcoss_dell_p3

15e1eaa

Revert change to .gitmodules and update submodule pointer for fv3atm

fb6273e

climbfuji added the Ready for Commit Queue The PR is ready for the Commit Queue. All checkboxes in PR template have been checked. label Apr 30, 2021

climbfuji mentioned this pull request Apr 30, 2021

Update GNU compiler on Cheyenne to 10.1.0 #554

Closed

DusanJovic-NOAA merged commit 4820bb8 into ufs-community:develop Apr 30, 2021

MinsukJi-NOAA mentioned this pull request May 3, 2021

Automatically checking if repos are up to date #547

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use assumed-size arrays in CCPP, Fortran/metadata consistency fixes in CCPP #527

Use assumed-size arrays in CCPP, Fortran/metadata consistency fixes in CCPP #527

climbfuji commented Apr 15, 2021 •

edited

Loading

climbfuji commented Apr 30, 2021

climbfuji commented Apr 30, 2021

junwang-noaa commented Apr 30, 2021 via email

SMoorthi-emc commented Apr 30, 2021 via email

DeniseWorthen left a comment

climbfuji commented Apr 30, 2021

climbfuji commented Apr 30, 2021

Use assumed-size arrays in CCPP, Fortran/metadata consistency fixes in CCPP #527

Use assumed-size arrays in CCPP, Fortran/metadata consistency fixes in CCPP #527

Conversation

climbfuji commented Apr 15, 2021 • edited Loading

PR Checklist

Description

Issue(s) addressed

Testing

Preliminary testing

Final regression testing.

Dependencies

climbfuji commented Apr 30, 2021

climbfuji commented Apr 30, 2021

junwang-noaa commented Apr 30, 2021 via email

SMoorthi-emc commented Apr 30, 2021 via email

DeniseWorthen left a comment

Choose a reason for hiding this comment

climbfuji commented Apr 30, 2021

climbfuji commented Apr 30, 2021

climbfuji commented Apr 15, 2021 •

edited

Loading