Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use assumed-size arrays in CCPP, Fortran/metadata consistency fixes in CCPP #527

Merged

Conversation

climbfuji
Copy link
Collaborator

@climbfuji climbfuji commented Apr 15, 2021

PR Checklist

  • Ths PR is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR. Please consult the ufs-weather-model wiki if you are unsure how to do this.

  • This PR has been tested using a branch which is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR

  • An Issue describing the work contained in this PR has been created either in the subcomponent(s) or in the ufs-weather-model. The Issue should be created in the repository that is most relevant to the changes in contained in the PR. The Issue and the dependent sub-component PR
    are specified below.

  • If new or updated input data is required by this PR, it is clearly stated in the text of the PR.

Description

Provide a detailed description of what this PR does. What bug does it fix, or what feature does it add? Is a change of answers expected from this PR? Are any library updates included in this PR (modulefiles etc.)?

This PR does the following:

  • Update the submodule pointers for ccpp-physics and fv3atm and the regression test baseline date tag
  • Small update of tests/ci/repo_check.sh from @MinsukJi-NOAA
  • Move coupled tests in rt.conf to the top, because these take the longest to wait in the queue and to run; BUT: needed to move ATMW tests to end of rt.conf, since WW3 is not able to support parallel builds. By moving it to the end, we are as "lucky" as before in a sense that only ten builds can happen at the same time, and the two builds that involve WW3 are sufficiently far apart to not overlap - see issue WW3 does not support concurrent builds #550
  • Change location of temporary/personal baseline directories on Cheyenne from $WORK to $SCRATCH to avoid disk space limitations
  • Increase walltime for compile jobs on wcoss_dell_p3 from 50min to 1hr

New regression test baseline is required for Intel, since the Intel compiler optimization leads to different answers when using assumed-size arrays in ccpp-physics.

Issue(s) addressed

none in ufs-weather-model (but see NCAR/ccpp-physics#611)

Testing

Preliminary testing

Regression tests were run against the existing baselines on Hera with Intel and GNU twice, once 2021/04/22 and once 2021/04/28 (each time the code was updated to include all changes from the trunk).

With GNU, all regression tests pass (i.e. in both PROD and DEBUG mode). With Intel, all DEBUG tests pass and many of the PROD tests as well: 72 tests pass, 31 tests have different results (but all run to completion). The failing tests fall in one or more of the following categories:

  • they use Zhao-Carr, i.e. all GFS v14 PROD tests
  • they use satmedmf or satmedmfq, i.e. all GFS v16 PROD tests
  • they use csawmg under the hood
  • addititonal tests that have different results: fv3_rrfs_v1alpha, fv3_rrfs_v1beta (only RESTART/phy_data.tile?.nc and RESTART/sfc_data.tile?.nc change, all other restart files and all 24h diag files are identical)

For details, see the attached log files.

GNU

2021/04/22
rt_hera_gnu_verify_against_existing.log

2021/04/28
rt_hera_gnu_verify_against_existing.log

Intel

2021/04/22
rt_hera_intel_verify_against_existing.log
rt_hera_intel_verify_against_existing_fail_test.log
rt_hera_intel_verify_against_existing_log_hera.intel.tar.gz

2021/04/28
rt_hera_intel_verify_against_existing.log
rt_hera_intel_verify_against_existing_fail_test.log
rt_hera_intel_verify_against_existing_log_hera.intel.tar.gz

Also, for the 2021/04/28, I created new baselines on jet to see which tests run to completion and which not. The following tests crash with
FATAL from PE 140: NaN in input field of mpp_reproducing_sum(_2d), this indicates numerical instability on jet.intel (but not on hera.intel):

fv3_csawmg
fv3_cpt
fv3_gfsv16_csawmgt
fv3_gfsv16_csawmg

Notably, all these tests change answers on hera.intel. Also, we were never able to run these tests with GNU, because they crashed with segmentation faults. Which means that there is a bug somewhere in the csawmg suite. Switching to assumed-size arrays will allow us to add a debug test in the future and identify the problem, hopefully. The jet crashes may also go away after the current RRTMGP PR is merged and this PR is updated, because currently the compiler flag modifications in FV3/ccpp/CMakeLists.txt are not applied on jet (but on all other platforms). This is fixed in the RRTMGP PR.

rt_jet_intel_create.log
rt_jet_intel_create_fail_test.log
rt_jet_intel_create_log_jet.intel.tar.gz

Update on csawmg crashes on jet. After applying bugfix NCAR/ccpp-physics@41d34d0 in ccpp-physics, the test fv3_csawmg runs in debug mode on jet.intel until it times out. In PROD mode, they now all run to completion. Yay. This update, however, became necessary only after switching to assumed-sized arrays, hence this will not be the solution for the long-standing GNU issues with csawmg.

Final regression testing.

CI tests passed for hash abf15d0

  • hera.intel
  • hera.gnu
  • orion.intel
  • cheyenne.intel
  • cheyenne.gnu
  • gaea.intel
  • jet.intel
  • wcoss_cray
  • wcoss_dell_p3

Note. After creating baselines on orion.intel, the verification step for test fv3_gfsv16_csawmg failed. I recreated the baseline for this test, updated the baseline and verified against it successfully. We may need to keep an eye on these tests and implement the DEBUG tests for those rather sooner than later (see issue #552).

Dependencies

NCAR/ccpp-physics#611
NOAA-EMC/fv3atm#284
#527

@climbfuji climbfuji changed the title WORK IN PROGRESS: use assumed-size arrays in CCPP, Fortran/metadata consistency fixes in CCPP Use assumed-size arrays in CCPP, Fortran/metadata consistency fixes in CCPP Apr 23, 2021
@climbfuji
Copy link
Collaborator Author

CI tests passed for hash abf15d0

@climbfuji
Copy link
Collaborator Author

@DusanJovic-NOAA @junwang-noaa @DeniseWorthen @SMoorthi-emc I wanted to make sure that switching to assumed sizes does not affect the performance. I compared the regression test logs for hera.gnu and hera.intel between the current head of develop (after RRTMGP from yesterday) with this PR. The runtimes are remarkably similar often within 1s of each other. Tests running longer are mostly within 10s of each other, sometimes this PR is faster, sometimes it is slower. Tests bigger larger I/O show larger deviations (as usually is the case), but again no clear signal pro/con one version of the code.

@junwang-noaa
Copy link
Collaborator

junwang-noaa commented Apr 30, 2021 via email

@SMoorthi-emc
Copy link
Contributor

SMoorthi-emc commented Apr 30, 2021 via email

Copy link
Collaborator

@DeniseWorthen DeniseWorthen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to have been well-reviewed on the ncar/ccpp side and well tested/documented for which tests change and why for ufs-weather. I'll approve on that basis.

@climbfuji climbfuji added the Ready for Commit Queue The PR is ready for the Commit Queue. All checkboxes in PR template have been checked. label Apr 30, 2021
@climbfuji
Copy link
Collaborator Author

fv3atm hash updated - please check and merge if ok.

@DusanJovic-NOAA DusanJovic-NOAA merged commit 4820bb8 into ufs-community:develop Apr 30, 2021
@climbfuji
Copy link
Collaborator Author

Thank you!

pjpegion pushed a commit to NOAA-PSL/ufs-weather-model that referenced this pull request Apr 4, 2023
* Change RRTMGP to RRTMG in suite_FV3_GFS_v17_p8 and suite_FV3_GFS_v17_coupled_p8
* deleted or modified some SDFs related to RRTMGP or Thompson schemes
* added a new SDF file for P8 with rrtmgp
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Baseline Updates Current baselines will be updated. Ready for Commit Queue The PR is ready for the Commit Queue. All checkboxes in PR template have been checked. Waiting for Reviews The PR is waiting for reviews from associated component PR's.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants