-
Notifications
You must be signed in to change notification settings - Fork 250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use assumed-size arrays in CCPP, Fortran/metadata consistency fixes in CCPP #527
Use assumed-size arrays in CCPP, Fortran/metadata consistency fixes in CCPP #527
Conversation
…s/ufs-weather-model into capgen_fixes_assumed_sizes
…o support two concurrent builds
…r-model into capgen_fixes_assumed_sizes
CI tests passed for hash abf15d0 |
@DusanJovic-NOAA @junwang-noaa @DeniseWorthen @SMoorthi-emc I wanted to make sure that switching to assumed sizes does not affect the performance. I compared the regression test logs for hera.gnu and hera.intel between the current head of develop (after RRTMGP from yesterday) with this PR. The runtimes are remarkably similar often within 1s of each other. Tests running longer are mostly within 10s of each other, sometimes this PR is faster, sometimes it is slower. Tests bigger larger I/O show larger deviations (as usually is the case), but again no clear signal pro/con one version of the code. |
Dom, thanks for testing and confirming the performance.
…On Fri, Apr 30, 2021 at 9:16 AM Dom Heinzeller ***@***.***> wrote:
@DusanJovic-NOAA <https://github.com/DusanJovic-NOAA> @junwang-noaa
<https://github.com/junwang-noaa> @DeniseWorthen
<https://github.com/DeniseWorthen> @SMoorthi-emc
<https://github.com/SMoorthi-emc> I wanted to make sure that switching to
assumed sizes does not affect the performance. I compared the regression
test logs for hera.gnu and hera.intel between the current head of develop
(after RRTMGP from yesterday) with this PR. The runtimes are remarkably
similar often within 1s of each other. Tests running longer are mostly
within 10s of each other, sometimes this PR is faster, sometimes it is
slower. Tests bigger larger I/O show larger deviations (as usually is the
case), but again no clear signal pro/con one version of the code.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#527 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AI7D6TOMYCRBVYOMCF7AV4LTLKUTZANCNFSM4272A6QA>
.
|
The speed is more or less the same in my C384L127 coupled test.
…On Fri, Apr 30, 2021 at 9:23 AM Jun Wang ***@***.***> wrote:
Dom, thanks for testing and confirming the performance.
On Fri, Apr 30, 2021 at 9:16 AM Dom Heinzeller ***@***.***>
wrote:
> @DusanJovic-NOAA <https://github.com/DusanJovic-NOAA> @junwang-noaa
> <https://github.com/junwang-noaa> @DeniseWorthen
> <https://github.com/DeniseWorthen> @SMoorthi-emc
> <https://github.com/SMoorthi-emc> I wanted to make sure that switching
to
> assumed sizes does not affect the performance. I compared the regression
> test logs for hera.gnu and hera.intel between the current head of develop
> (after RRTMGP from yesterday) with this PR. The runtimes are remarkably
> similar often within 1s of each other. Tests running longer are mostly
> within 10s of each other, sometimes this PR is faster, sometimes it is
> slower. Tests bigger larger I/O show larger deviations (as usually is the
> case), but again no clear signal pro/con one version of the code.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <
#527 (comment)
>,
> or unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/AI7D6TOMYCRBVYOMCF7AV4LTLKUTZANCNFSM4272A6QA
>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#527 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ALLVRYWNIZQ5TN6ZFFUMHMDTLKVLTANCNFSM4272A6QA>
.
--
Dr. Shrinivas Moorthi
Research Meteorologist
Modeling and Data Assimilation Branch
Environmental Modeling Center / National Centers for Environmental
Prediction
5830 University Research Court - (W/NP23), College Park MD 20740 USA
Tel: (301)683-3718
e-mail: ***@***.***
Phone: (301) 683-3718 Fax: (301) 683-3718
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to have been well-reviewed on the ncar/ccpp side and well tested/documented for which tests change and why for ufs-weather. I'll approve on that basis.
fv3atm hash updated - please check and merge if ok. |
Thank you! |
* Change RRTMGP to RRTMG in suite_FV3_GFS_v17_p8 and suite_FV3_GFS_v17_coupled_p8 * deleted or modified some SDFs related to RRTMGP or Thompson schemes * added a new SDF file for P8 with rrtmgp
PR Checklist
Ths PR is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR. Please consult the ufs-weather-model wiki if you are unsure how to do this.
This PR has been tested using a branch which is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR
An Issue describing the work contained in this PR has been created either in the subcomponent(s) or in the ufs-weather-model. The Issue should be created in the repository that is most relevant to the changes in contained in the PR. The Issue and the dependent sub-component PR
are specified below.
If new or updated input data is required by this PR, it is clearly stated in the text of the PR.
Description
Provide a detailed description of what this PR does. What bug does it fix, or what feature does it add? Is a change of answers expected from this PR? Are any library updates included in this PR (modulefiles etc.)?
This PR does the following:
tests/ci/repo_check.sh
from @MinsukJi-NOAArt.conf
to the top, because these take the longest to wait in the queue and to run; BUT: needed to move ATMW tests to end ofrt.conf
, since WW3 is not able to support parallel builds. By moving it to the end, we are as "lucky" as before in a sense that only ten builds can happen at the same time, and the two builds that involve WW3 are sufficiently far apart to not overlap - see issue WW3 does not support concurrent builds #550New regression test baseline is required for Intel, since the Intel compiler optimization leads to different answers when using assumed-size arrays in ccpp-physics.
Issue(s) addressed
none in ufs-weather-model (but see NCAR/ccpp-physics#611)
Testing
Preliminary testing
Regression tests were run against the existing baselines on Hera with Intel and GNU twice, once 2021/04/22 and once 2021/04/28 (each time the code was updated to include all changes from the trunk).
With GNU, all regression tests pass (i.e. in both PROD and DEBUG mode). With Intel, all DEBUG tests pass and many of the PROD tests as well: 72 tests pass, 31 tests have different results (but all run to completion). The failing tests fall in one or more of the following categories:
satmedmf
orsatmedmfq
, i.e. all GFS v16 PROD testscsawmg
under the hoodfv3_rrfs_v1alpha
,fv3_rrfs_v1beta
(onlyRESTART/phy_data.tile?.nc
andRESTART/sfc_data.tile?.nc
change, all other restart files and all 24h diag files are identical)For details, see the attached log files.
GNU
2021/04/22
rt_hera_gnu_verify_against_existing.log
2021/04/28
rt_hera_gnu_verify_against_existing.log
Intel
2021/04/22
rt_hera_intel_verify_against_existing.log
rt_hera_intel_verify_against_existing_fail_test.log
rt_hera_intel_verify_against_existing_log_hera.intel.tar.gz
2021/04/28
rt_hera_intel_verify_against_existing.log
rt_hera_intel_verify_against_existing_fail_test.log
rt_hera_intel_verify_against_existing_log_hera.intel.tar.gz
Also, for the 2021/04/28, I created new baselines on jet to see which tests run to completion and which not. The following tests crash with
FATAL from PE 140: NaN in input field of mpp_reproducing_sum(_2d), this indicates numerical instability
on jet.intel (but not on hera.intel):Notably, all these tests change answers on hera.intel. Also, we were never able to run these tests with GNU, because they crashed with segmentation faults. Which means that there is a bug somewhere in the
csawmg
suite. Switching to assumed-size arrays will allow us to add a debug test in the future and identify the problem, hopefully. The jet crashes may also go away after the current RRTMGP PR is merged and this PR is updated, because currently the compiler flag modifications inFV3/ccpp/CMakeLists.txt
are not applied on jet (but on all other platforms). This is fixed in the RRTMGP PR.rt_jet_intel_create.log
rt_jet_intel_create_fail_test.log
rt_jet_intel_create_log_jet.intel.tar.gz
Update on csawmg crashes on jet. After applying bugfix NCAR/ccpp-physics@41d34d0 in ccpp-physics, the test
fv3_csawmg
runs in debug mode on jet.intel until it times out. In PROD mode, they now all run to completion. Yay. This update, however, became necessary only after switching to assumed-sized arrays, hence this will not be the solution for the long-standing GNU issues withcsawmg
.Final regression testing.
CI tests passed for hash abf15d0
Note. After creating baselines on orion.intel, the verification step for test
fv3_gfsv16_csawmg
failed. I recreated the baseline for this test, updated the baseline and verified against it successfully. We may need to keep an eye on these tests and implement the DEBUG tests for those rather sooner than later (see issue #552).Dependencies
NCAR/ccpp-physics#611
NOAA-EMC/fv3atm#284
#527