-
Notifications
You must be signed in to change notification settings - Fork 246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quilting restart does not work for 32-bit physics #1882
Comments
@jkbk2004 @SamuelTrahanNOAA Will the changes in PR#1853 fix the issue? |
I haven't tried, but I don't expect it to make any difference. The NetCDF library does automatic type conversion, so type mismatches between the provided data and output type shouldn't matter. |
@junwang-noaa @DusanJovic-NOAA I have tried Dusan's bug fixes that have been merged to develop, plus one more fix. The quilting restart still crashes for 32-bit physics. This happens with the FV3_RAP suite and the FV3_HRRR suite on Hera with the gnu compiler in debug mode. The model aborts due to NaN longitudes in recover_fields when converting from single precision to double precision. Before it aborts, it encounters gibberish longitudes that look like uninitialized memory or a bad type conversion. Specifically, it fails at To detect this, I replaced the array copy with a loop that prints bad longitudes: do j=lbound(lonr4,2),ubound(lonr4,2)
do i=lbound(lonr4,1),ubound(lonr4,1)
if(.not.(lonr4(i,j) > -2000 .and. lonr4(i,j) < 2000)) then
write(0,*) 'bad i,j,lon',i,j,lonr4(i,j)
endif
lon(i,j) = real(lonr4(i,j),kind=ESMF_KIND_R8)
enddo
enddo The failure happens at the Before it fails, I see bad longitudes: Bad longitudes in log fileThe first column is the MPI rank. The third and fourth are i and j. The fifth is the bad longitude value. 144: bad i,j,lon 7 1 -4.90393997E+13
144: bad i,j,lon 17 1 -4.90393955E+13
144: bad i,j,lon 45 1 -4.90427887E+13
144: bad i,j,lon 49 1 -4.90415472E+13
144: bad i,j,lon 51 1 -4.90393158E+13
144: bad i,j,lon 57 1 -4.90427887E+13
144: bad i,j,lon 69 1 -4.90527376E+13
144: bad i,j,lon 71 1 -4.90396178E+13
144: bad i,j,lon 77 1 -4.90527376E+13
144: bad i,j,lon 83 1 -4.90442316E+13
144: bad i,j,lon 89 1 -4.90491808E+13
149: bad i,j,lon 1 1 6.70109323E+31
149: bad i,j,lon 2 1 7.20845618E+31
149: bad i,j,lon 9 1 1.76149855E+19
149: bad i,j,lon 10 1 4.42489548E+30
149: bad i,j,lon 25 1 2.61705476E+32
149: bad i,j,lon 26 1 4.06990390E+31
149: bad i,j,lon 29 1 6.83059875E+22
149: bad i,j,lon 33 1 3.04810548E+32
149: bad i,j,lon 34 1 1.74064814E+25
149: bad i,j,lon 57 1 3.02623147E+29
149: bad i,j,lon 58 1 2.63808995E+23
149: bad i,j,lon 66 1 52821.2617
147: bad i,j,lon 9 1 -4.37240905E+23
148: bad i,j,lon 1 1 6.70109323E+31
148: bad i,j,lon 2 1 7.20845618E+31
148: bad i,j,lon 9 1 1.76149855E+19
148: bad i,j,lon 10 1 4.42489548E+30
148: bad i,j,lon 25 1 2.61705476E+32
148: bad i,j,lon 26 1 4.06990390E+31
148: bad i,j,lon 29 1 6.83059875E+22
148: bad i,j,lon 33 1 3.04810548E+32
148: bad i,j,lon 34 1 1.74064814E+25
148: bad i,j,lon 57 1 3.02623147E+29
148: bad i,j,lon 58 1 2.63808995E+23
148: bad i,j,lon 66 1 52821.2617
146: bad i,j,lon 1 1 6.70109323E+31
146: bad i,j,lon 2 1 7.20845618E+31
145: bad i,j,lon 1 1 6.70109323E+31
145: bad i,j,lon 2 1 7.20845618E+31 |
I forgot to mention my branch. You'll find all my latest bug fixes here:
https://github.com/SamuelTrahanNOAA/ufs-weather-model/tree/bugfix/qr-c3-stringlen Cheyenne has no crashes, but Hera crashes in the new In the branch, you will find bug fixes for the hrrr_c3 crash. Also, I replaced hrrr_c3 and hrrr_gf with debug versions. Luckily, the c3 scheme does work now. I'd like to get the quilting restart working too, rather than splitting the PR. |
The quilting restart fails less frequently after I added a bugfix from Dusan on top of the ones I already had. It disables an unneeded subroutine call. That avoids the code that causes a crash. Unfortunately, the conus13km_debug_qr_gnu test still fails about 1/4 of the time (5 of 21 tests) on Hera. My PR #1893 enables conus13km_debug_qr_intel because I haven't seen it fail yet. The FailureIt fails in FieldBundleRegridStore, in InitializeAdvertise. The MPI ranks that abort are different each time. Most likely, that depends on which rank reaches InitializeAdvertise first. From PET144.ESMF_LogFile: And from the job's stderr stream:
Line 753 is an "if" statement that tests a logical variable. The actual failure must happen at a different line. Test case:HERA: /scratch2/BMC/wrfruc/Samuel.Trahan/lakes/latest-conus13km_debug_qr_gnu Past InformationBefore the switch to Spack Stack, I was able to run UFS in valgrind. (It fails quite early now.) In that older version, the failure happened when ESMF was copying uninitialized data to a C++ std::vector. Another older version had abundant uninitialized arrays in the clm lake and rrfs-sd restart fields. Initializing them reduced the failure rate. Sam's hypothesis as to the reason for the failure.This is merely a hypothesis, but it is supported by evidence so far. Some restart data arrays are not initialized before sending them to FMS or ESMF during the restart code. The arrays are initialized before the restart code is told to write something. FMS refrains from reading the arrays until it is told to write data. ESMF reads from the arrays immediately, not waiting for the code to ask for it. This makes sense, because the write component has to send that data to the write servers. Contents of uninitialized memory are highly unpredictable. They can vary from run to run and system to system. Hence, failure is unpredictable. |
@SamuelTrahanNOAA I found two uninitialized arrays that used in ESMF to compute route handles before being assigned the actual values at the restart time. Can you please make this change in atmos_cubed_sphere and try again, I made several test runs (5 or 6) using conus13km_debug_qr_gnu, and model did not fail. diff --git a/driver/fvGFS/fv_ufs_restart_io.F90 b/driver/fvGFS/fv_ufs_restart_io.F90
index bb1940b..2621b69 100644
--- a/driver/fvGFS/fv_ufs_restart_io.F90
+++ b/driver/fvGFS/fv_ufs_restart_io.F90
@@ -132,6 +132,7 @@ module fv_ufs_restart_io_mod
! srf_wnd
nvar2d_srf_wnd = 2
allocate (srf_wnd_var2(nx,ny,nvar2d_srf_wnd), srf_wnd_var2_names(nvar2d_srf_wnd))
+ srf_wnd_var2 = 0.0
srf_wnd_var2_names(1) = 'u_srf'
srf_wnd_var2_names(2) = 'v_srf'
@@ -141,6 +142,7 @@ module fv_ufs_restart_io_mod
nvar3d_tracers = ntprog+ntdiag
tracers_zsize = size(Atm%q,3)
allocate (tracers_var3(nx,ny,tracers_zsize,nvar3d_tracers), tracers_var3_names(nvar3d_tracers))
+ tracers_var3 = 0.0
do nt = 1, ntprog
call get_tracer_names(MODEL_ATMOS, nt, tracer_name)
|
I made an issue for the uninitialized srf_wnd_var2 and tracers_va3 arrays here: |
The conus13km_debug_qr_gnu is passing reliably for me on Hera. It seems the fixes in #1893 are enough to get quilting restart working with 32-bit physics. This is not the first time I thought the problem was fixed, though. I'll believe it is fixed once the tests pass on all platforms. |
There's a multitude of bug fixes in #1893 which, unfortunately, are not enough to fix the problem. The new conus13km_debug_qr tests can run on hera, hercules, jet, orion, acorn, and cactus. They fail on cheyenne with both intel and gnu compilers. There is an "hdf error" when writing 32-bit u wind to a restart file using parallel netcdf. |
This issue still occurs on Cheyenne, but not the other seven machines. Cheyenne is going to be retired soon, so I'll assume the issue is fixed until we see evidence to the contrary. |
Description
In #1880, conus13km_control_qr_intel, conus13km_control_qr_gnu and hrrr_c3_gnu were introduced but turned off due to crashing on Cheyenne. ESMF accesses uninitialized memory extensively during quilt server initialization. This is easy to detect on any platform using valgrind. The result is gibberish data sent to the NetCDF library. If that gibberish includes a signalling NaN, then the model will crash. A suspected bug in ESMF, or the model. The PR has some bug fixes for 32-bit physics with quilting restart, but not enough to get it to work.
To Reproduce:
Run baseline creation for those cases on Cheyenne. It dumps out a traceback error: 144: file: module_write_restart_netcdf.F90 line: 452 NetCDF: HDF error
Additional context
Output
The text was updated successfully, but these errors were encountered: