Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regional application bitwise reproducibility problem using different MPI layout and/or threads #196

Closed
RatkoVasic-NOAA opened this issue Sep 2, 2020 · 20 comments
Assignees
Labels
bug Something isn't working

Comments

@RatkoVasic-NOAA
Copy link
Collaborator

Description

Regional FV3 is producing different results when using different MPI layout and/or different number of threads. This application cannot pass regression tests in ufs-weather-model. Current regression tests are testing only restart and quilting capabilities, so that problem probably existed for some time. Older version checked (03/2020) is showing same behavior.

To Reproduce:

We are seeing this problem on WCOSS machines and Hera. Jim Abeles managed to get bit identical result on Orion with old code (03/2020).

To replicate problem:
1. Go to ufs-weather-model/tests/
2. Run rt.sh -fk , using short, 2-line version of rt.conf:

COMPILE | CCPP=Y SUITES=FV3_GFS_2017_gfdlmp_regional 32BIT=Y REPRO=Y | standard | | fv3 |
RUN     | fv3_ccpp_regional_control                                  | standard | | fv3 |

NOTE -k option in rt.sh saves run directory.
3. Go to run directory, save history files and submit job again (using job_card), but this time change only one line in input.nml:
from
layout = 4,6
to
layout = 6,4
4. Compare saved and new results.

@RatkoVasic-NOAA RatkoVasic-NOAA added the bug Something isn't working label Sep 2, 2020
@arunchawla-NOAA
Copy link

Is this true for multiple physics suites or the specific ones listed here?

@RatkoVasic-NOAA
Copy link
Collaborator Author

Is this true for multiple physics suites or the specific ones listed here?

Any physics suite (two CCPP suites tested).

@yangfanglin
Copy link
Collaborator

yangfanglin commented Sep 2, 2020 via email

@RatkoVasic-NOAA
Copy link
Collaborator Author

RatkoVasic-NOAA commented Sep 2, 2020

Both C768 and C96.
It's one face, but number of points is:

       npx      = 211
       npy      = 193

I'll try with different layouts. Still doesn't explain differences in threads. Maybe after changes this will fix threads!?
UPDATE:
Unfortunately, that didn't help, I used layout with both nx-1 and ny-1 divisible with 2 and 3:

<        layout   = 2,3
---
>        layout   = 3,2

And results still differ.

@climbfuji
Copy link
Collaborator

We do have threading tests for the global runs, and these pass on all machines every time we merge a commit. So this must be something specific to the regional application of the code. I know that there is quite some code in the dycore (GFDL_atmos_cubed_sphere) that is only executed for regional and/or nested runs.

One thing we should do to further drill down on this is to test a nested config. It would be good to know if the problem exists only for ntiles=1 or also for ntiles=7.

In the past, I fixed some obviously wrong code in the dycore for regional applications (routine exchange_uv) that didn't cause a problem on any of the NOAA RDHPC systems, but on Cheyenne (run-to-run differences with exactly the same setup). In this case it was an error in the MPI code in that routine.

To my knowledge, there is no code in the CCPP physics that depends on the number of tiles or whether it is a global, regional or nested setup. Thus it seems more likely - but no guarantee, of course - that this is a problem with the dycore or the fv3atm model (not initializing everything properly for coldstarts/restarts) than with the CCPP physics.

Do you want me to help debugging this issue, or are you going to take care of it?

@climbfuji
Copy link
Collaborator

Here is an interesting twist. not sure if it is related or has to do with the jet software stack or build config.

When I create a new baseline on jet using ecflow and then verify against it, I get b4b differences for fv3_ccpp_decomp, i.e. when changing the decomposition. That is a global run. That said, I also get b4b differences for all tests when I don't use ecflow (i.e. compile on the login node as opposed to compile on the compute node), so there might be something buggy with the jet setup in general.

@RatkoVasic-NOAA
Copy link
Collaborator Author

Do you want me to help debugging this issue, or are you going to take care of it?

Dom, we would really appreciate your help in solving this problem.
Maybe this can help, on Hera, I created small test site:
/scratch2/NCEPDEV/fv3-cam/Ratko.Vasic/wrk/REG_RT
with one source directory and two run directories differing only in layout:

Hera:/scratch2/NCEPDEV/fv3-cam/Ratko.Vasic/wrk/REG_RT>diff  run_*/input.nml
39c39
<        layout   = 2,3
---
>        layout   = 3,2

Job cards point to the same executable:

Hera:/scratch2/NCEPDEV/fv3-cam/Ratko.Vasic/wrk/REG_RT>ll run_*/job*
-rwxr--r-- 1 Ratko.Vasic fv3-cam 624 Sep  2 21:56 run_1/job_card
-rwxr--r-- 1 Ratko.Vasic fv3-cam 624 Sep  2 22:07 run_2/job_card

@climbfuji
Copy link
Collaborator

Do you want me to help debugging this issue, or are you going to take care of it?

Dom, we would really appreciate your help in solving this problem.
Maybe this can help, on Hera, I created small test site:
/scratch2/NCEPDEV/fv3-cam/Ratko.Vasic/wrk/REG_RT
with one source directory and two run directories differing only in layout:

Hera:/scratch2/NCEPDEV/fv3-cam/Ratko.Vasic/wrk/REG_RT>diff  run_*/input.nml
39c39
<        layout   = 2,3
---
>        layout   = 3,2

Job cards point to the same executable:

Hera:/scratch2/NCEPDEV/fv3-cam/Ratko.Vasic/wrk/REG_RT>ll run_*/job*
-rwxr--r-- 1 Ratko.Vasic fv3-cam 624 Sep  2 21:56 run_1/job_card
-rwxr--r-- 1 Ratko.Vasic fv3-cam 624 Sep  2 22:07 run_2/job_card

Do you know if this runs with the release/public-v2 branch (essentially develop, just before the ESMF 8.1.0 bs21 update was made)?

@RatkoVasic-NOAA
Copy link
Collaborator Author

This is from the git log.

commit 1e4edf0ac90d8de714becfa362c36de8758b8281 (HEAD -> develop, origin/develop, origin/HEAD)
Author: Dom Heinzeller <dom.heinzeller@icloud.com>
Date:   Wed Aug 26 09:40:41 2020 -0600

    develop: cleanup, remove legacy code, minor bugfixes (#190)

BTW, we tested older code (march 2020), and we had same results.

@climbfuji
Copy link
Collaborator

This is from the git log.

commit 1e4edf0ac90d8de714becfa362c36de8758b8281 (HEAD -> develop, origin/develop, origin/HEAD)
Author: Dom Heinzeller <dom.heinzeller@icloud.com>
Date:   Wed Aug 26 09:40:41 2020 -0600

    develop: cleanup, remove legacy code, minor bugfixes (#190)

BTW, we tested older code (march 2020), and we had same results.

I'll use release/public-v2 then, most relevant for this problem. We can also bring bugfixes back to develop if needed.

@climbfuji
Copy link
Collaborator

Ok, I could reproduce the problem when compiling the code as

./compile_cmake.sh $PWD/.. hera.intel 'CCPP=Y SUITES=FV3_GFS_v15_thompson_mynn' '' NO NO 2>&1 | tee compile.log

Thus this also happens with double-precision dynamics.

@llpcarson
Copy link
Collaborator

Don't know if this is related, but the HWRF team recently had reproducibility issues show up on jet, due to the heterogeneous nodes and compiler optimizations, which showed up with an Intel version update (jet/HWRF had been using an old compiler version). So, it could be a processor/node difference in the specific jet cases (tjet, ujet, sjet, etc)

@climbfuji
Copy link
Collaborator

Does anyone know what type of nodes the login nodes on jet are? Are they the same as one of the jet partitions?

@climbfuji
Copy link
Collaborator

All, this is definitely something in the dycore and it happens right in the beginning. I am stopping the model inatmos_model.F90 around line 530, right after the call to

   call atmosphere_init (Atmos%Time_init, Atmos%Time, Atmos%Time_step,&
                         Atmos%grid, Atmos%area)

This is before CCPP (or IPD) are even initialized, only a first pass through the dycore was made for initialization. At this point, the tracer array Atm(mygrid)%q is already different. Some of the other diagnostic output is also different:

0:  After adi: W max =    1.16725366723497       min =  -0.438392996489926
0:  na_ini Z500   5754.19968088326        5731.28070372451
0:   0.000000000000000E+000   5868.19062118756

versus

0:  After adi: W max =    1.16727826557961       min =  -0.438352988116583
0:  na_ini Z500   5754.19968090299        5731.28070391164
0:   0.000000000000000E+000   5868.19062037472

@DusanJovic-NOAA
Copy link
Collaborator

This commit (1150bf5) made on Jun 5, which is the last commit before FV3 dynamic core is updated to the GFDL 201912, gives bit-identical outputs with 4x6 and 6x4 layout, when configured with do_sat_adj = .F. Tested on Hera using fv3_ccpp_regional_control test.

@climbfuji
Copy link
Collaborator

@DusanJovic-NOAA FYI Chan-Hoo found a bug in the regional code in the dycore, missing regional boundary update/exchange of the extra (fourth) row for the velocities on the C and D grid. PR to come. This solves the threading and layout b4b differences for the GFDL-MP (and presumably Zhao-Carr) physics runs, but not yet for Thompson+MYNN. Means that there is another bug, but this time in the physics.

@climbfuji
Copy link
Collaborator

The halo boundary update bugfix in the FV3 dycore went in with PR #208 (NOAA-EMC/GFDL_atmos_cubed_sphere#40).

Other issues such as the rewrite of the MPI reduce function and the bug in the dynamics-physics update step for Thompson MP still need to be addressed.

@arunchawla-NOAA
Copy link

@climbfuji and @RatkoVasic-NOAA was this problem solved? I thought it was. If yes can you close this ticket?

@RatkoVasic-NOAA
Copy link
Collaborator Author

@arunchawla-NOAA Yes. I'll close the ticket.

@climbfuji
Copy link
Collaborator

This was solved only for the SRW app public release branch. Following discussion with GFDL, this solution should not be brought over to the main development branch, see issue NOAA-EMC/GFDL_atmos_cubed_sphere#55 for more information. You can keep this issue closed, because we do have the issue open in the correct repository.

pjpegion pushed a commit to NOAA-PSL/ufs-weather-model.p7b that referenced this issue Jul 20, 2021
* update post lib to upp lib and add dzmin change in fv3 dycore
* add dycore change ufs-community#35
* merge with top of dev/emc dycore branch
* remove duplicate read_data in fms_mod in external_ic.F90
epic-cicd-jenkins pushed a commit that referenced this issue Apr 17, 2023
…ach repository (#304)

## DESCRIPTION OF CHANGES: 
The new top-level cmake build for the SRW App ([SRW App PR#27](ufs-community/ufs-srweather-app#27)) results in some executables having different names. This PR makes modifications that
 1. Allow the workflow to run successfully with the new cmake build and its different executable names, and
 2. Allow back-compatibility with the old build system to allow for a gradual transition to new build system

This PR also explicitly disallows running the workflow without CCPP, which we decided against supporting several months ago. I don't think the capability even works so this shouldn't effect anyone at this time.

## TESTS CONDUCTED: 
 - **Cheyenne**: Build and end-to-end test ("DOT_OR_USCORE" test case) was successful on Cheyenne with intel, both for the cmake build and the old build script (that will soon be deprecated). 
 - **Hera**: Build and end-to-end tests successful (aside from expected failures). Also built with old build script successfully.
 - **Jet**: Build test was successful. 

## ISSUE: 
It was not the primary aim of this PR, but it does partially resolve #196
epic-cicd-jenkins pushed a commit that referenced this issue Apr 17, 2023
## DESCRIPTION OF CHANGES:
This PR removes the USE_CCPP variable from all scripts and other files.  The workflow only supports running the FV3 model with CCPP, so USE_CCPP is deprecated.

## TESTS CONDUCTED: 
Ran one WE2E test (regional_002) on hera.  Succeeded.

## ISSUE (optional):
This resolves Issue #196.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants