Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quilting restart does not work for 32-bit physics #1882

Closed
jkbk2004 opened this issue Aug 28, 2023 · 10 comments · Fixed by #1893
Closed

Quilting restart does not work for 32-bit physics #1882

jkbk2004 opened this issue Aug 28, 2023 · 10 comments · Fixed by #1893
Assignees
Labels
bug Something isn't working

Comments

@jkbk2004
Copy link
Collaborator

Description

In #1880, conus13km_control_qr_intel, conus13km_control_qr_gnu and hrrr_c3_gnu were introduced but turned off due to crashing on Cheyenne. ESMF accesses uninitialized memory extensively during quilt server initialization. This is easy to detect on any platform using valgrind. The result is gibberish data sent to the NetCDF library. If that gibberish includes a signalling NaN, then the model will crash. A suspected bug in ESMF, or the model. The PR has some bug fixes for 32-bit physics with quilting restart, but not enough to get it to work.

To Reproduce:

Run baseline creation for those cases on Cheyenne. It dumps out a traceback error: 144: file: module_write_restart_netcdf.F90 line: 452 NetCDF: HDF error

Additional context

Output

@junwang-noaa
Copy link
Collaborator

@jkbk2004 @SamuelTrahanNOAA Will the changes in PR#1853 fix the issue?

@SamuelTrahanNOAA
Copy link
Collaborator

Will the changes in PR#1853 fix the issue?

I haven't tried, but I don't expect it to make any difference. The NetCDF library does automatic type conversion, so type mismatches between the provided data and output type shouldn't matter.

@SamuelTrahanNOAA
Copy link
Collaborator

@junwang-noaa @DusanJovic-NOAA I have tried Dusan's bug fixes that have been merged to develop, plus one more fix. The quilting restart still crashes for 32-bit physics. This happens with the FV3_RAP suite and the FV3_HRRR suite on Hera with the gnu compiler in debug mode.

The model aborts due to NaN longitudes in recover_fields when converting from single precision to double precision. Before it aborts, it encounters gibberish longitudes that look like uninitialized memory or a bad type conversion. Specifically, it fails at lon = lon4 around line 2524 of module_wrt_grid_comp.F90.

To detect this, I replaced the array copy with a loop that prints bad longitudes:

        do j=lbound(lonr4,2),ubound(lonr4,2)
          do i=lbound(lonr4,1),ubound(lonr4,1)
            if(.not.(lonr4(i,j) > -2000 .and. lonr4(i,j) < 2000)) then
              write(0,*) 'bad i,j,lon',i,j,lonr4(i,j)
            endif
            lon(i,j) = real(lonr4(i,j),kind=ESMF_KIND_R8)
          enddo
        enddo

The failure happens at the if(...) line. With standard-confirming floating-point, such a failure can only happen if lon4(i,j) is NaN.

Before it fails, I see bad longitudes:

Bad longitudes in log file

The first column is the MPI rank. The third and fourth are i and j. The fifth is the bad longitude value.

144:  bad i,j,lon           7           1  -4.90393997E+13
144:  bad i,j,lon          17           1  -4.90393955E+13
144:  bad i,j,lon          45           1  -4.90427887E+13
144:  bad i,j,lon          49           1  -4.90415472E+13
144:  bad i,j,lon          51           1  -4.90393158E+13
144:  bad i,j,lon          57           1  -4.90427887E+13
144:  bad i,j,lon          69           1  -4.90527376E+13
144:  bad i,j,lon          71           1  -4.90396178E+13
144:  bad i,j,lon          77           1  -4.90527376E+13
144:  bad i,j,lon          83           1  -4.90442316E+13
144:  bad i,j,lon          89           1  -4.90491808E+13
149:  bad i,j,lon           1           1   6.70109323E+31
149:  bad i,j,lon           2           1   7.20845618E+31
149:  bad i,j,lon           9           1   1.76149855E+19
149:  bad i,j,lon          10           1   4.42489548E+30
149:  bad i,j,lon          25           1   2.61705476E+32
149:  bad i,j,lon          26           1   4.06990390E+31
149:  bad i,j,lon          29           1   6.83059875E+22
149:  bad i,j,lon          33           1   3.04810548E+32
149:  bad i,j,lon          34           1   1.74064814E+25
149:  bad i,j,lon          57           1   3.02623147E+29
149:  bad i,j,lon          58           1   2.63808995E+23
149:  bad i,j,lon          66           1   52821.2617    
147:  bad i,j,lon           9           1  -4.37240905E+23
148:  bad i,j,lon           1           1   6.70109323E+31
148:  bad i,j,lon           2           1   7.20845618E+31
148:  bad i,j,lon           9           1   1.76149855E+19
148:  bad i,j,lon          10           1   4.42489548E+30
148:  bad i,j,lon          25           1   2.61705476E+32
148:  bad i,j,lon          26           1   4.06990390E+31
148:  bad i,j,lon          29           1   6.83059875E+22
148:  bad i,j,lon          33           1   3.04810548E+32
148:  bad i,j,lon          34           1   1.74064814E+25
148:  bad i,j,lon          57           1   3.02623147E+29
148:  bad i,j,lon          58           1   2.63808995E+23
148:  bad i,j,lon          66           1   52821.2617    
146:  bad i,j,lon           1           1   6.70109323E+31
146:  bad i,j,lon           2           1   7.20845618E+31
145:  bad i,j,lon           1           1   6.70109323E+31
145:  bad i,j,lon           2           1   7.20845618E+31

@SamuelTrahanNOAA
Copy link
Collaborator

SamuelTrahanNOAA commented Sep 7, 2023

I forgot to mention my branch. You'll find all my latest bug fixes here:

git clone --recursive --branch bugfix/qr-c3-stringlen https://github.com/SamuelTrahanNOAA/ufs-weather-model

https://github.com/SamuelTrahanNOAA/ufs-weather-model/tree/bugfix/qr-c3-stringlen

Cheyenne has no crashes, but Hera crashes in the new conus13km_debug_qr test.

In the branch, you will find bug fixes for the hrrr_c3 crash. Also, I replaced hrrr_c3 and hrrr_gf with debug versions. Luckily, the c3 scheme does work now. I'd like to get the quilting restart working too, rather than splitting the PR.

@SamuelTrahanNOAA
Copy link
Collaborator

SamuelTrahanNOAA commented Sep 8, 2023

The quilting restart fails less frequently after I added a bugfix from Dusan on top of the ones I already had. It disables an unneeded subroutine call. That avoids the code that causes a crash.

Unfortunately, the conus13km_debug_qr_gnu test still fails about 1/4 of the time (5 of 21 tests) on Hera. My PR #1893 enables conus13km_debug_qr_intel because I haven't seen it fail yet.

The Failure

It fails in FieldBundleRegridStore, in InitializeAdvertise. The MPI ranks that abort are different each time. Most likely, that depends on which rank reaches InitializeAdvertise first.

From PET144.ESMF_LogFile: PET144 calling into wrtFB(01,01) FieldBundleRegridStore()....

And from the job's stderr stream:

    144: Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
    144: Backtrace for this error:
    ...
    144: #16  0x129f186 in initializeadvertise
    144:    at /scratch2/BMC/wrfruc/Samuel.Trahan/lakes/revise-tests/FV3/fv3_cap.F90:753

Line 753 is an "if" statement that tests a logical variable. The actual failure must happen at a different line.

Test case:

HERA: /scratch2/BMC/wrfruc/Samuel.Trahan/lakes/latest-conus13km_debug_qr_gnu
NOAA HPSS: /2year/BMC/wrfruc/Samuel.Trahan/latest-conus13km_debug_qr_gnu.tar

Past Information

Before the switch to Spack Stack, I was able to run UFS in valgrind. (It fails quite early now.) In that older version, the failure happened when ESMF was copying uninitialized data to a C++ std::vector.

Another older version had abundant uninitialized arrays in the clm lake and rrfs-sd restart fields. Initializing them reduced the failure rate.

Sam's hypothesis as to the reason for the failure.

This is merely a hypothesis, but it is supported by evidence so far. Some restart data arrays are not initialized before sending them to FMS or ESMF during the restart code. The arrays are initialized before the restart code is told to write something. FMS refrains from reading the arrays until it is told to write data. ESMF reads from the arrays immediately, not waiting for the code to ask for it. This makes sense, because the write component has to send that data to the write servers. Contents of uninitialized memory are highly unpredictable. They can vary from run to run and system to system. Hence, failure is unpredictable.

@DusanJovic-NOAA
Copy link
Collaborator

DusanJovic-NOAA commented Sep 9, 2023

@SamuelTrahanNOAA I found two uninitialized arrays that used in ESMF to compute route handles before being assigned the actual values at the restart time. Can you please make this change in atmos_cubed_sphere and try again, I made several test runs (5 or 6) using conus13km_debug_qr_gnu, and model did not fail.

diff --git a/driver/fvGFS/fv_ufs_restart_io.F90 b/driver/fvGFS/fv_ufs_restart_io.F90
index bb1940b..2621b69 100644
--- a/driver/fvGFS/fv_ufs_restart_io.F90
+++ b/driver/fvGFS/fv_ufs_restart_io.F90
@@ -132,6 +132,7 @@ module fv_ufs_restart_io_mod
    ! srf_wnd
    nvar2d_srf_wnd = 2
    allocate (srf_wnd_var2(nx,ny,nvar2d_srf_wnd), srf_wnd_var2_names(nvar2d_srf_wnd))
+   srf_wnd_var2 = 0.0
    srf_wnd_var2_names(1) = 'u_srf'
    srf_wnd_var2_names(2) = 'v_srf'
 
@@ -141,6 +142,7 @@ module fv_ufs_restart_io_mod
    nvar3d_tracers = ntprog+ntdiag
    tracers_zsize = size(Atm%q,3)
    allocate (tracers_var3(nx,ny,tracers_zsize,nvar3d_tracers), tracers_var3_names(nvar3d_tracers))
+   tracers_var3 = 0.0
 
    do nt = 1, ntprog
       call get_tracer_names(MODEL_ATMOS, nt, tracer_name)

@SamuelTrahanNOAA
Copy link
Collaborator

SamuelTrahanNOAA commented Sep 9, 2023

I made an issue for the uninitialized srf_wnd_var2 and tracers_va3 arrays here:

NOAA-GFDL/GFDL_atmos_cubed_sphere#281

@SamuelTrahanNOAA
Copy link
Collaborator

The conus13km_debug_qr_gnu is passing reliably for me on Hera. It seems the fixes in #1893 are enough to get quilting restart working with 32-bit physics. This is not the first time I thought the problem was fixed, though. I'll believe it is fixed once the tests pass on all platforms.

@SamuelTrahanNOAA
Copy link
Collaborator

SamuelTrahanNOAA commented Sep 29, 2023

There's a multitude of bug fixes in #1893 which, unfortunately, are not enough to fix the problem. The new conus13km_debug_qr tests can run on hera, hercules, jet, orion, acorn, and cactus. They fail on cheyenne with both intel and gnu compilers. There is an "hdf error" when writing 32-bit u wind to a restart file using parallel netcdf.

@SamuelTrahanNOAA
Copy link
Collaborator

This issue still occurs on Cheyenne, but not the other seven machines. Cheyenne is going to be retired soon, so I'll assume the issue is fixed until we see evidence to the contrary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
No open projects
4 participants