-
Notifications
You must be signed in to change notification settings - Fork 253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regional application bitwise reproducibility problem using different MPI layout and/or threads #196
Comments
Is this true for multiple physics suites or the specific ones listed here? |
Any physics suite (two CCPP suites tested). |
What is the C* resolution? Can it be divided by both 4 and 6?
Fanglin
On Wed, Sep 2, 2020 at 5:27 PM RatkoVasic-NOAA ***@***.***> wrote:
Is this true for multiple physics suites or the specific ones listed here?
Any physics suite (two CCPP suites tested).
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#196 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKY5N2NTEX47VPNMW3SXGDLSD22EDANCNFSM4QTZOFSA>
.
--
Fanglin Yang, Ph.D.
Physical Scientist
Environmental Modeling Center
National Centers for Environmental Prediction
301-6833722; fanglin.yang@noaa.gov
http://www.emc.ncep.noaa.gov/gmb/wx24fy/fyang/
http://www.emc.ncep.noaa.gov/gmb/STATS_vsdb/
|
Both C768 and C96.
I'll try with different layouts. Still doesn't explain differences in threads. Maybe after changes this will fix threads!?
And results still differ. |
We do have threading tests for the global runs, and these pass on all machines every time we merge a commit. So this must be something specific to the regional application of the code. I know that there is quite some code in the dycore ( One thing we should do to further drill down on this is to test a nested config. It would be good to know if the problem exists only for In the past, I fixed some obviously wrong code in the dycore for regional applications (routine To my knowledge, there is no code in the CCPP physics that depends on the number of tiles or whether it is a global, regional or nested setup. Thus it seems more likely - but no guarantee, of course - that this is a problem with the dycore or the fv3atm model (not initializing everything properly for coldstarts/restarts) than with the CCPP physics. Do you want me to help debugging this issue, or are you going to take care of it? |
Here is an interesting twist. not sure if it is related or has to do with the jet software stack or build config. When I create a new baseline on jet using ecflow and then verify against it, I get b4b differences for |
Dom, we would really appreciate your help in solving this problem.
Job cards point to the same executable:
|
Do you know if this runs with the release/public-v2 branch (essentially develop, just before the ESMF 8.1.0 bs21 update was made)? |
This is from the git log.
BTW, we tested older code (march 2020), and we had same results. |
I'll use release/public-v2 then, most relevant for this problem. We can also bring bugfixes back to develop if needed. |
Ok, I could reproduce the problem when compiling the code as
Thus this also happens with double-precision dynamics. |
Don't know if this is related, but the HWRF team recently had reproducibility issues show up on jet, due to the heterogeneous nodes and compiler optimizations, which showed up with an Intel version update (jet/HWRF had been using an old compiler version). So, it could be a processor/node difference in the specific jet cases (tjet, ujet, sjet, etc) |
Does anyone know what type of nodes the login nodes on jet are? Are they the same as one of the jet partitions? |
All, this is definitely something in the dycore and it happens right in the beginning. I am stopping the model in
This is before CCPP (or IPD) are even initialized, only a first pass through the dycore was made for initialization. At this point, the tracer array
versus
|
This commit (1150bf5) made on Jun 5, which is the last commit before FV3 dynamic core is updated to the GFDL 201912, gives bit-identical outputs with 4x6 and 6x4 layout, when configured with |
@DusanJovic-NOAA FYI Chan-Hoo found a bug in the regional code in the dycore, missing regional boundary update/exchange of the extra (fourth) row for the velocities on the C and D grid. PR to come. This solves the threading and layout b4b differences for the GFDL-MP (and presumably Zhao-Carr) physics runs, but not yet for Thompson+MYNN. Means that there is another bug, but this time in the physics. |
The halo boundary update bugfix in the FV3 dycore went in with PR #208 (NOAA-EMC/GFDL_atmos_cubed_sphere#40). Other issues such as the rewrite of the MPI reduce function and the bug in the dynamics-physics update step for Thompson MP still need to be addressed. |
@climbfuji and @RatkoVasic-NOAA was this problem solved? I thought it was. If yes can you close this ticket? |
@arunchawla-NOAA Yes. I'll close the ticket. |
This was solved only for the SRW app public release branch. Following discussion with GFDL, this solution should not be brought over to the main development branch, see issue NOAA-EMC/GFDL_atmos_cubed_sphere#55 for more information. You can keep this issue closed, because we do have the issue open in the correct repository. |
* update post lib to upp lib and add dzmin change in fv3 dycore * add dycore change ufs-community#35 * merge with top of dev/emc dycore branch * remove duplicate read_data in fms_mod in external_ic.F90
…ach repository (#304) ## DESCRIPTION OF CHANGES: The new top-level cmake build for the SRW App ([SRW App PR#27](ufs-community/ufs-srweather-app#27)) results in some executables having different names. This PR makes modifications that 1. Allow the workflow to run successfully with the new cmake build and its different executable names, and 2. Allow back-compatibility with the old build system to allow for a gradual transition to new build system This PR also explicitly disallows running the workflow without CCPP, which we decided against supporting several months ago. I don't think the capability even works so this shouldn't effect anyone at this time. ## TESTS CONDUCTED: - **Cheyenne**: Build and end-to-end test ("DOT_OR_USCORE" test case) was successful on Cheyenne with intel, both for the cmake build and the old build script (that will soon be deprecated). - **Hera**: Build and end-to-end tests successful (aside from expected failures). Also built with old build script successfully. - **Jet**: Build test was successful. ## ISSUE: It was not the primary aim of this PR, but it does partially resolve #196
## DESCRIPTION OF CHANGES: This PR removes the USE_CCPP variable from all scripts and other files. The workflow only supports running the FV3 model with CCPP, so USE_CCPP is deprecated. ## TESTS CONDUCTED: Ran one WE2E test (regional_002) on hera. Succeeded. ## ISSUE (optional): This resolves Issue #196.
Description
Regional FV3 is producing different results when using different MPI layout and/or different number of threads. This application cannot pass regression tests in ufs-weather-model. Current regression tests are testing only restart and quilting capabilities, so that problem probably existed for some time. Older version checked (03/2020) is showing same behavior.
To Reproduce:
We are seeing this problem on WCOSS machines and Hera. Jim Abeles managed to get bit identical result on Orion with old code (03/2020).
To replicate problem:
1. Go to ufs-weather-model/tests/
2. Run rt.sh -fk , using short, 2-line version of rt.conf:
NOTE -k option in rt.sh saves run directory.
3. Go to run directory, save history files and submit job again (using job_card), but this time change only one line in input.nml:
from
layout = 4,6
to
layout = 6,4
4. Compare saved and new results.
The text was updated successfully, but these errors were encountered: