Quilting restart does not work for 32-bit physics #1882

jkbk2004 · 2023-08-28T13:53:16Z

Description

In #1880, conus13km_control_qr_intel, conus13km_control_qr_gnu and hrrr_c3_gnu were introduced but turned off due to crashing on Cheyenne. ESMF accesses uninitialized memory extensively during quilt server initialization. This is easy to detect on any platform using valgrind. The result is gibberish data sent to the NetCDF library. If that gibberish includes a signalling NaN, then the model will crash. A suspected bug in ESMF, or the model. The PR has some bug fixes for 32-bit physics with quilting restart, but not enough to get it to work.

To Reproduce:

Run baseline creation for those cases on Cheyenne. It dumps out a traceback error: 144: file: module_write_restart_netcdf.F90 line: 452 NetCDF: HDF error

Additional context

Output

junwang-noaa · 2023-08-28T14:05:06Z

@jkbk2004 @SamuelTrahanNOAA Will the changes in PR#1853 fix the issue?

SamuelTrahanNOAA · 2023-08-28T14:22:03Z

Will the changes in PR#1853 fix the issue?

I haven't tried, but I don't expect it to make any difference. The NetCDF library does automatic type conversion, so type mismatches between the provided data and output type shouldn't matter.

SamuelTrahanNOAA · 2023-09-07T13:20:05Z

@junwang-noaa @DusanJovic-NOAA I have tried Dusan's bug fixes that have been merged to develop, plus one more fix. The quilting restart still crashes for 32-bit physics. This happens with the FV3_RAP suite and the FV3_HRRR suite on Hera with the gnu compiler in debug mode.

The model aborts due to NaN longitudes in recover_fields when converting from single precision to double precision. Before it aborts, it encounters gibberish longitudes that look like uninitialized memory or a bad type conversion. Specifically, it fails at lon = lon4 around line 2524 of module_wrt_grid_comp.F90.

To detect this, I replaced the array copy with a loop that prints bad longitudes:

        do j=lbound(lonr4,2),ubound(lonr4,2)
          do i=lbound(lonr4,1),ubound(lonr4,1)
            if(.not.(lonr4(i,j) > -2000 .and. lonr4(i,j) < 2000)) then
              write(0,*) 'bad i,j,lon',i,j,lonr4(i,j)
            endif
            lon(i,j) = real(lonr4(i,j),kind=ESMF_KIND_R8)
          enddo
        enddo

The failure happens at the if(...) line. With standard-confirming floating-point, such a failure can only happen if lon4(i,j) is NaN.

Before it fails, I see bad longitudes:

Bad longitudes in log file

The first column is the MPI rank. The third and fourth are i and j. The fifth is the bad longitude value.

144:  bad i,j,lon           7           1  -4.90393997E+13
144:  bad i,j,lon          17           1  -4.90393955E+13
144:  bad i,j,lon          45           1  -4.90427887E+13
144:  bad i,j,lon          49           1  -4.90415472E+13
144:  bad i,j,lon          51           1  -4.90393158E+13
144:  bad i,j,lon          57           1  -4.90427887E+13
144:  bad i,j,lon          69           1  -4.90527376E+13
144:  bad i,j,lon          71           1  -4.90396178E+13
144:  bad i,j,lon          77           1  -4.90527376E+13
144:  bad i,j,lon          83           1  -4.90442316E+13
144:  bad i,j,lon          89           1  -4.90491808E+13
149:  bad i,j,lon           1           1   6.70109323E+31
149:  bad i,j,lon           2           1   7.20845618E+31
149:  bad i,j,lon           9           1   1.76149855E+19
149:  bad i,j,lon          10           1   4.42489548E+30
149:  bad i,j,lon          25           1   2.61705476E+32
149:  bad i,j,lon          26           1   4.06990390E+31
149:  bad i,j,lon          29           1   6.83059875E+22
149:  bad i,j,lon          33           1   3.04810548E+32
149:  bad i,j,lon          34           1   1.74064814E+25
149:  bad i,j,lon          57           1   3.02623147E+29
149:  bad i,j,lon          58           1   2.63808995E+23
149:  bad i,j,lon          66           1   52821.2617    
147:  bad i,j,lon           9           1  -4.37240905E+23
148:  bad i,j,lon           1           1   6.70109323E+31
148:  bad i,j,lon           2           1   7.20845618E+31
148:  bad i,j,lon           9           1   1.76149855E+19
148:  bad i,j,lon          10           1   4.42489548E+30
148:  bad i,j,lon          25           1   2.61705476E+32
148:  bad i,j,lon          26           1   4.06990390E+31
148:  bad i,j,lon          29           1   6.83059875E+22
148:  bad i,j,lon          33           1   3.04810548E+32
148:  bad i,j,lon          34           1   1.74064814E+25
148:  bad i,j,lon          57           1   3.02623147E+29
148:  bad i,j,lon          58           1   2.63808995E+23
148:  bad i,j,lon          66           1   52821.2617    
146:  bad i,j,lon           1           1   6.70109323E+31
146:  bad i,j,lon           2           1   7.20845618E+31
145:  bad i,j,lon           1           1   6.70109323E+31
145:  bad i,j,lon           2           1   7.20845618E+31

SamuelTrahanNOAA · 2023-09-07T13:25:12Z

I forgot to mention my branch. You'll find all my latest bug fixes here:

git clone --recursive --branch bugfix/qr-c3-stringlen https://github.com/SamuelTrahanNOAA/ufs-weather-model

https://github.com/SamuelTrahanNOAA/ufs-weather-model/tree/bugfix/qr-c3-stringlen

Cheyenne has no crashes, but Hera crashes in the new conus13km_debug_qr test.

In the branch, you will find bug fixes for the hrrr_c3 crash. Also, I replaced hrrr_c3 and hrrr_gf with debug versions. Luckily, the c3 scheme does work now. I'd like to get the quilting restart working too, rather than splitting the PR.

SamuelTrahanNOAA · 2023-09-08T21:28:36Z

The quilting restart fails less frequently after I added a bugfix from Dusan on top of the ones I already had. It disables an unneeded subroutine call. That avoids the code that causes a crash.

Unfortunately, the conus13km_debug_qr_gnu test still fails about 1/4 of the time (5 of 21 tests) on Hera. My PR #1893 enables conus13km_debug_qr_intel because I haven't seen it fail yet.

The Failure

It fails in FieldBundleRegridStore, in InitializeAdvertise. The MPI ranks that abort are different each time. Most likely, that depends on which rank reaches InitializeAdvertise first.

From PET144.ESMF_LogFile: PET144 calling into wrtFB(01,01) FieldBundleRegridStore()....

And from the job's stderr stream:

    144: Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
    144: Backtrace for this error:
    ...
    144: #16  0x129f186 in initializeadvertise
    144:    at /scratch2/BMC/wrfruc/Samuel.Trahan/lakes/revise-tests/FV3/fv3_cap.F90:753

Line 753 is an "if" statement that tests a logical variable. The actual failure must happen at a different line.

Test case:

HERA: /scratch2/BMC/wrfruc/Samuel.Trahan/lakes/latest-conus13km_debug_qr_gnu
NOAA HPSS: /2year/BMC/wrfruc/Samuel.Trahan/latest-conus13km_debug_qr_gnu.tar

Past Information

Before the switch to Spack Stack, I was able to run UFS in valgrind. (It fails quite early now.) In that older version, the failure happened when ESMF was copying uninitialized data to a C++ std::vector.

Another older version had abundant uninitialized arrays in the clm lake and rrfs-sd restart fields. Initializing them reduced the failure rate.

Sam's hypothesis as to the reason for the failure.

This is merely a hypothesis, but it is supported by evidence so far. Some restart data arrays are not initialized before sending them to FMS or ESMF during the restart code. The arrays are initialized before the restart code is told to write something. FMS refrains from reading the arrays until it is told to write data. ESMF reads from the arrays immediately, not waiting for the code to ask for it. This makes sense, because the write component has to send that data to the write servers. Contents of uninitialized memory are highly unpredictable. They can vary from run to run and system to system. Hence, failure is unpredictable.

DusanJovic-NOAA · 2023-09-09T14:54:25Z

@SamuelTrahanNOAA I found two uninitialized arrays that used in ESMF to compute route handles before being assigned the actual values at the restart time. Can you please make this change in atmos_cubed_sphere and try again, I made several test runs (5 or 6) using conus13km_debug_qr_gnu, and model did not fail.

diff --git a/driver/fvGFS/fv_ufs_restart_io.F90 b/driver/fvGFS/fv_ufs_restart_io.F90
index bb1940b..2621b69 100644
--- a/driver/fvGFS/fv_ufs_restart_io.F90
+++ b/driver/fvGFS/fv_ufs_restart_io.F90
@@ -132,6 +132,7 @@ module fv_ufs_restart_io_mod
    ! srf_wnd
    nvar2d_srf_wnd = 2
    allocate (srf_wnd_var2(nx,ny,nvar2d_srf_wnd), srf_wnd_var2_names(nvar2d_srf_wnd))
+   srf_wnd_var2 = 0.0
    srf_wnd_var2_names(1) = 'u_srf'
    srf_wnd_var2_names(2) = 'v_srf'
 
@@ -141,6 +142,7 @@ module fv_ufs_restart_io_mod
    nvar3d_tracers = ntprog+ntdiag
    tracers_zsize = size(Atm%q,3)
    allocate (tracers_var3(nx,ny,tracers_zsize,nvar3d_tracers), tracers_var3_names(nvar3d_tracers))
+   tracers_var3 = 0.0
 
    do nt = 1, ntprog
       call get_tracer_names(MODEL_ATMOS, nt, tracer_name)

SamuelTrahanNOAA · 2023-09-09T15:54:52Z

I made an issue for the uninitialized srf_wnd_var2 and tracers_va3 arrays here:

NOAA-GFDL/GFDL_atmos_cubed_sphere#281

SamuelTrahanNOAA · 2023-09-09T17:10:34Z

The conus13km_debug_qr_gnu is passing reliably for me on Hera. It seems the fixes in #1893 are enough to get quilting restart working with 32-bit physics. This is not the first time I thought the problem was fixed, though. I'll believe it is fixed once the tests pass on all platforms.

SamuelTrahanNOAA · 2023-09-29T13:31:20Z

There's a multitude of bug fixes in #1893 which, unfortunately, are not enough to fix the problem. The new conus13km_debug_qr tests can run on hera, hercules, jet, orion, acorn, and cactus. They fail on cheyenne with both intel and gnu compilers. There is an "hdf error" when writing 32-bit u wind to a restart file using parallel netcdf.

SamuelTrahanNOAA · 2023-10-02T21:02:52Z

This issue still occurs on Cheyenne, but not the other seven machines. Cheyenne is going to be retired soon, so I'll assume the issue is fixed until we see evidence to the contrary.

jkbk2004 added the bug Something isn't working label Aug 28, 2023

jkbk2004 mentioned this issue Aug 28, 2023

Bug fixes for 32-bit physics & correct the lake scheme in FV3_HRRR_c3 & FV3_HRRR_gf #1880

Merged

34 tasks

SamuelTrahanNOAA mentioned this issue Sep 9, 2023

two arrays are not initialized in fv_ufs_restart_io.F90 NOAA-GFDL/GFDL_atmos_cubed_sphere#281

Closed

junwang-noaa assigned SamuelTrahanNOAA and DusanJovic-NOAA Sep 11, 2023

jkbk2004 closed this as completed in #1893 Oct 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quilting restart does not work for 32-bit physics #1882

Quilting restart does not work for 32-bit physics #1882

jkbk2004 commented Aug 28, 2023

junwang-noaa commented Aug 28, 2023

SamuelTrahanNOAA commented Aug 28, 2023

SamuelTrahanNOAA commented Sep 7, 2023

SamuelTrahanNOAA commented Sep 7, 2023 •

edited

Loading

SamuelTrahanNOAA commented Sep 8, 2023 •

edited

Loading

DusanJovic-NOAA commented Sep 9, 2023 •

edited

Loading

SamuelTrahanNOAA commented Sep 9, 2023 •

edited

Loading

SamuelTrahanNOAA commented Sep 9, 2023

SamuelTrahanNOAA commented Sep 29, 2023 •

edited

Loading

SamuelTrahanNOAA commented Oct 2, 2023

Quilting restart does not work for 32-bit physics #1882

Quilting restart does not work for 32-bit physics #1882

Comments

jkbk2004 commented Aug 28, 2023

Description

To Reproduce:

Additional context

Output

junwang-noaa commented Aug 28, 2023

SamuelTrahanNOAA commented Aug 28, 2023

SamuelTrahanNOAA commented Sep 7, 2023

SamuelTrahanNOAA commented Sep 7, 2023 • edited Loading

SamuelTrahanNOAA commented Sep 8, 2023 • edited Loading

The Failure

Test case:

Past Information

Sam's hypothesis as to the reason for the failure.

DusanJovic-NOAA commented Sep 9, 2023 • edited Loading

SamuelTrahanNOAA commented Sep 9, 2023 • edited Loading

SamuelTrahanNOAA commented Sep 9, 2023

SamuelTrahanNOAA commented Sep 29, 2023 • edited Loading

SamuelTrahanNOAA commented Oct 2, 2023

SamuelTrahanNOAA commented Sep 7, 2023 •

edited

Loading

SamuelTrahanNOAA commented Sep 8, 2023 •

edited

Loading

DusanJovic-NOAA commented Sep 9, 2023 •

edited

Loading

SamuelTrahanNOAA commented Sep 9, 2023 •

edited

Loading

SamuelTrahanNOAA commented Sep 29, 2023 •

edited

Loading