Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debug gsi.x aborts in deter_sfc_gmi with invalid array index #778

Closed
RussTreadon-NOAA opened this issue Aug 7, 2024 · 19 comments · Fixed by #781
Closed

Debug gsi.x aborts in deter_sfc_gmi with invalid array index #778

RussTreadon-NOAA opened this issue Aug 7, 2024 · 19 comments · Fixed by #781
Assignees

Comments

@RussTreadon-NOAA
Copy link
Contributor

Encounter an unexpected error while working on issue #777 .

Build debug gsi.x on Hera from develop at e82365d. Run 2023060712 case using files from /scratch2/NCEPDEV/stmp3/Emily.Liu/ROTDIRS/v17allskyens

The debug gsi.x aborts in deter_sfc_mod.f90 with the message

132: forrtl: severe (408): fort: (3): Subscript #2 of the array ISLI_FULL has value 0 which is less than the lower bound of 1

The code in question is the isli_full(i,j) line below

     grid_dist=rearth * (rlats_sfc(klatp1) - rlats_sfc(klat1))
     n_grid=int(40000 / grid_dist) + 1
     klatn = max(klat1 - n_grid, 1)
     klonn = klon1 - n_grid
     if (klonn < 0)  klonn = nlon_sfc - klonn
     klatpn = min((klat1 + n_grid), nlat_sfc)
     klonpn = klon1 + n_grid
     if (klonpn > nlon_sfc)  klonpn = klonpn - nlon_sfc

     isflg=0
     outer: do i = klatn, klatpn
       ! assume n_grid > 2
       if (0 < klonpn - klonn .and. klonpn - klonn < nlon_sfc / 2) then
         do j = klonn, klonpn
           if (isli_full(i, j) /= 0) then
             isflg = 1
             exit outer
           end if
         end do

This code is in subroutine deter_sfc_gmi in file deter_sfc_mod.f90

Prints added to the code confirm that klonn is 0.

132: deter_sfc_mod: grid_dist=  26045.2     rlats_sfc(p1)=-0.831898     rlats_sfc(1)=-0.835986
132: deter_sfc_mod: klonn=     0 klon1=     2 n_grid=     2 nlon_sfc=  1536

j=0 is not a valid index value for array isli_full.

@RussTreadon-NOAA
Copy link
Contributor Author

@emilyhcliu
Copy link
Contributor

I saw this error before. It went away when I ran again. The failed point was indeed from GMI while performing spatial averaging. We need to fix this.

@emilyhcliu
Copy link
Contributor

@xincjin-NOAA Could you take a look at the failure from GMI spatial averaging?

The initial conditions that Russ used in his test can be found in the following location on HERA:
2023060712 case using files from /scratch2/NCEPDEV/stmp3/Emily.Liu/ROTDIRS/v17allskyens

@xincjin-NOAA
Copy link
Contributor

xincjin-NOAA commented Aug 7, 2024

@emilyhcliu @RussTreadon-NOAA Can you give me some guideline so that I can reproduce this errors? or I just clone the GSI and Build develop at e82365d. Then run ctests or one cycle test run.

I believe in that I found the issue in the code:

     klonn = klon1 - n_grid
     if (klonn < 0)  klonn = nlon_sfc - klonn
     klatpn = min((klat1 + n_grid), nlat_sfc)

klonn <0 should be replaced with klonn < 1

     if (klonn < 1)  klonn = nlon_sfc - klonn  

@xincjin-NOAA
Copy link
Contributor

I am going to run experiment on Hera to see if this solve the issue.

  • I just cloned the GSI (/scratch1/NCEPDEV/da/Xin.C.Jin/debug_gsi/GSI/),
  • built with Debug option,
  • and created a run_script (/scratch1/NCEPDEV/da/Xin.C.Jin/debug_gsi/run_script/rungsi_develop.sh).

I guess I don't set correctly those paths for the initial and or obs data, Can you take a look at it when you have time and give some advice? @emilyhcliu @RussTreadon-NOAA

@RussTreadon-NOAA
Copy link
Contributor Author

@xincjin-NOAA , the paths to Emily's data are not quite correct. The script I used to run Emily's case is

/scratch1/NCEPDEV/stmp2/Russ.Treadon/rungsi384L127.sh

@xincjin-NOAA
Copy link
Contributor

@RussTreadon-NOAA Thank you so much for sharing the run script with me. I have used the modified run script to reproduce the issue:

12:  READ_GMI: do_noise_reduction= F
 12: forrtl: severe (408): fort: (3): Subscript #2 of the array ISLI_FULL has value 0 which is less than the lower bound of 1
 12:
 12: Image              PC                Routine            Line        Source
 12: gsi.x              0000000007EBB18F  Unknown               Unknown  Unknown
 12: gsi.x              0000000003A02B14  deter_sfc_mod_mp_        1423  deter_sfc_mod.f90
 12: gsi.x              000000000512711A  read_gmi_                 699  read_gmi.f90
 12: gsi.x              0000000001C53AC7  read_obsmod_mp_re        1827  read_obs.F90
 12: gsi.x              0000000001765371  observermod_mp_se         329  observer.F90
 12: gsi.x              0000000003F0FC82  glbsoi_                   222  glbsoi.f90
 12: gsi.x              00000000010989FC  gsisub_                   200  gsisub.F90
 12: gsi.x              000000000042DBD8  gsimod_mp_gsimain        2431  gsimod.F90
 12: gsi.x              0000000000414C4B  MAIN__                    633  gsimain.f90
 12: gsi.x              0000000000414AA2  Unknown               Unknown  Unknown
 12: libc-2.28.so       00001481B4B49D85  __libc_start_main     Unknown  Unknown
 12: gsi.x              00000000004149AE  Unknown               Unknown  Unknown

after changed the code in the deter_sfc_mod.f90 as stated above., rebuilt, and run script, the output shows:

12:  READ_GMI: do_noise_reduction= F
  8:  READ_SATWND: ictype(nc),rmesh,pflag,nlevp,pmesh,nc uv                254    200.00    1   12    100.00  140    0      2.00
  8:  READ_SATWND,nread,ndata,nreal,nodata=     3679708     1521151          26
  8:      3042302

Therefore this issue is solved.

Two notes:

  1. In order to reproduce the issue I removed many other obs as inputs otherwise the running will exit before this issue appear.
  2. the running is stopped in the second run because other issues. I am not sure if this related to the obs inputs removed.
  0:  in init_sf_xy, jcap,s_ens_hv(  115 -  116), max diff(f0-f)=        69   1250.00    0.8033829157E-10
  0:  in init_sf_xy, jcap,s_ens_hv(  117 -  127), max diff(f0-f)=        69   1300.00    0.8033651522E-10
  0: GLBSOI: jiter,jiterstart,jiterlast,jiterend=    1    1    2    1
 27: [h2c31:278555:0:278555] Caught signal 8 (Floating point exception: floating-point invalid operation)
 27: ==== backtrace (tid: 278555) ====
 27:  0 0x00000000000534e9 ucs_debug_print_backtrace()  ???:0
 27:  1 0x0000000000012cf0 __funlockfile()  :0
 27:  2 0x00000000055e87fc rad_setup_mp_setuprad_()  /scratch1/NCEPDEV/da/Xin.C.Jin/debug_gsi/GSI/src/gsi/setuprad.f90:1650
 27:  3 0x0000000003fb0440 gsi_radoper_mp_setup__()  /scratch1/NCEPDEV/da/Xin.C.Jin/debug_gsi/GSI/src/gsi/gsi_radOper.F90:100
 27:  4 0x000000000263c3c5 setuprhsall_()  /scratch1/NCEPDEV/da/Xin.C.Jin/debug_gsi/GSI/src/gsi/setuprhsall.f90:492
 27:  5 0x0000000003f10fa7 glbsoi_()  /scratch1/NCEPDEV/da/Xin.C.Jin/debug_gsi/GSI/src/gsi/glbsoi.f90:323
 27:  6 0x00000000010989fc gsisub_()  /scratch1/NCEPDEV/da/Xin.C.Jin/debug_gsi/GSI/src/gsi/gsisub.F90:200
 27:  7 0x000000000042dbd8 gsimod_mp_gsimain_run_()  /scratch1/NCEPDEV/da/Xin.C.Jin/debug_gsi/GSI/src/gsi/gsimod.F90:2431
 27:  8 0x0000000000414c4b MAIN__()  /scratch1/NCEPDEV/da/Xin.C.Jin/debug_gsi/GSI/src/gsi/gsimain.f90:633
 27:  9 0x0000000000414aa2 main()  ???:0
 27: 10 0x000000000003ad85 __libc_start_main()  ???:0
 27: 11 0x00000000004149ae _start()  ???:0
 27: =================================
 27: forrtl: error (75): floating point exception
 27: Image              PC                Routine            Line        Source
 27: gsi.x              0000000007EC3BCB  Unknown               Unknown  Unknown
 27: libpthread-2.28.s  00001554B1051CF0  Unknown               Unknown  Unknown
 27: gsi.x              00000000055E87FC  rad_setup_mp_setu        1650  setuprad.f90
 27: gsi.x              0000000003FB0440  gsi_radoper_mp_se         100  gsi_radOper.F90
 27: gsi.x              000000000263C3C5  setuprhsall_              492  setuprhsall.f90
 27: gsi.x              0000000003F10FA7  glbsoi_                   323  glbsoi.f90
 27: gsi.x              00000000010989FC  gsisub_                   200  gsisub.F90
 27: gsi.x              000000000042DBD8  gsimod_mp_gsimain        2431  gsimod.F90
 27: gsi.x              0000000000414C4B  MAIN__                    633  gsimain.f90
 27: gsi.x              0000000000414AA2  Unknown               Unknown  Unknown
 27: libc-2.28.so       00001554B032ED85  __libc_start_main     Unknown  Unknown
 27: gsi.x              00000000004149AE  Unknown               Unknown  Unknown

@RussTreadon-NOAA
Copy link
Contributor Author

@xincjin-NOAA , I added your bug fix to a working copy of the code from PR #779. The gsi was built in debug mode and run on Hera using all data for the 2023060712 case. The code fails in the same way as you report

  0: GLBSOI: jiter,jiterstart,jiterlast,jiterend=    1    1    2    1
 27: [h25c06:1552739:0:1552739] Caught signal 8 (Floating point exception: floating-point invalid operation)
 27: ==== backtrace (tid:1552739) ====
 27:  0 0x00000000000534e9 ucs_debug_print_backtrace()  ???:0
 27:  1 0x0000000000012cf0 __funlockfile()  :0
 27:  2 0x00000000055e8d3c rad_setup_mp_setuprad_()  /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/thompson/src/gsi/setuprad.f90:1650
 27:  3 0x0000000003fb0800 gsi_radoper_mp_setup__()  /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/thompson/src/gsi/gsi_radOper.F90:100
 27:  4 0x000000000263c781 setuprhsall_()  /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/thompson/src/gsi/setuprhsall.f90:492
 27:  5 0x0000000003f11367 glbsoi_()  /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/thompson/src/gsi/glbsoi.f90:323
 27:  6 0x0000000001098db8 gsisub_()  /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/thompson/src/gsi/gsisub.F90:200
 27:  7 0x000000000042dbd8 gsimod_mp_gsimain_run_()  /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/thompson/src/gsi/gsimod.F90:2431

@xincjin-NOAA
Copy link
Contributor

@RussTreadon-NOAA, I have cloned gsi from: https://github.com/RussTreadon-NOAA/GSI.git and checked out branch: feature/thompson_reff. However, I got different output when broken:


 62: [h10c13:3954744:0:3954744] Caught signal 8 (Floating point exception: floating-point invalid operation)
 59: [h10c12:768002:0:768002] Caught signal 8 (Floating point exception: floating-point invalid operation)
122:
 30: CAL_TZTR compute  -2.07397       0.00000       0.00000      0.118892E-02  0.232059E-02  -1.80642       1.53929     -0.215549      0.807410E-04   12.0221      0.150000E-04  0.258911      RESET tztr to 0.5 .or. 1.5
 14: [h10c03:1330766:0:1330766] Caught signal 8 (Floating point exception: floating-point invalid operation)
126:  READ_BUFRTOVS:             1  versions of SpcCoeff found for amsua_n19
 59: ==== backtrace (tid: 768002) ====
 59:  0 0x00000000000534e9 ucs_debug_print_backtrace()  ???:0
 59:  1 0x0000000000012cf0 __funlockfile()  :0
 59:  2 0x00000000006fc09d MPIR_SUM()  /build/impi/_buildspace/release/../../src/mpi/coll/op/opsum.c:34
 59:  3 0x00000000007561e6 MPIR_Reduce_local()  /build/impi/_buildspace/release/../../src/mpi/coll/reduce_local/reduce_local.c:210
 59:  4 0x000000000074f56e MPIR_Reduce_intra_shum_ring()  /build/impi/_buildspace/release/../../src/mpi/coll/intel/reduce/reduce_intra_ring.c:169
 59:  5 0x000000000018a06b MPIDI_NM_mpi_reduce()  /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/intel/ofi_coll.h:802
 59:  6 0x000000000018a06b MPIDI_Reduce_intra_composition_beta()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_impl.h:1463
 59:  7 0x000000000018a06b MPID_Reduce_invoke()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_select_utils.c:2379
 59:  8 0x000000000018a06b MPIDI_coll_invoke()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_select_utils.c:3421

I am not sure what I missed. my runscript is /scratch1/NCEPDEV/da/Xin.C.Jin/debug_gsi/run_script/rungsi_dev_russ.sh, which is almost the same as yours mentioned above. The running directory is /scratch2/NCEPDEV/stmp1/Xin.C.Jin/tmp382/debug_778.2023060712

@RussTreadon-NOAA
Copy link
Contributor Author

@xincjin-NOAA , the rest of the trace back in
/scratch2/NCEPDEV/stmp1/Xin.C.Jin/tmp382/debug_778.2023060712/stdout shows

14:  8 0x000000000018a06b MPIDI_coll_invoke()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_select_utils.c:3421
 14:  9 0x00000000001717ec MPIDI_coll_select()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_globals_default.c:130
 14: 10 0x00000000002b49d0 MPID_Reduce()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll.h:285
 14: 11 0x000000000075761c PMPI_Reduce()  /build/impi/_buildspace/release/../../src/mpi/coll/reduce/reduce.c:489
 14: 12 0x00000000000f09ee pmpi_reduce_()  /build/impi/_buildspace/release/../../src/binding/fortran/mpif_h/reducef.c:276
 14: 13 0x000000000376e625 combine_radobs_()  /scratch1/NCEPDEV/da/Xin.C.Jin/debug_gsi/gsi_russ/src/gsi/combine_radobs.f90:137
 14: 14 0x00000000051e46ac read_iasi_()  /scratch1/NCEPDEV/da/Xin.C.Jin/debug_gsi/gsi_russ/src/gsi/read_iasi.f90:984
 14: 15 0x0000000001c4950e read_obsmod_mp_read_obs_()  /scratch1/NCEPDEV/da/Xin.C.Jin/debug_gsi/gsi_russ/src/gsi/read_obs.F90:1757
 14: 16 0x000000000176572d observermod_mp_set__()  /scratch1/NCEPDEV/da/Xin.C.Jin/debug_gsi/gsi_russ/src/gsi/observer.F90:329
 14: 17 0x0000000003f10042 glbsoi_()  /scratch1/NCEPDEV/da/Xin.C.Jin/debug_gsi/gsi_russ/src/gsi/glbsoi.f90:222
 14: 18 0x0000000001098db8 gsisub_()  /scratch1/NCEPDEV/da/Xin.C.Jin/debug_gsi/gsi_russ/src/gsi/gsisub.F90:200
 14: 19 0x000000000042dbd8 gsimod_mp_gsimain_run_()  /scratch1/NCEPDEV/da/Xin.C.Jin/debug_gsi/gsi_russ/src/gsi/gsimod.F90:2431
 14: 20 0x0000000000414c4b MAIN__()  /scratch1/NCEPDEV/da/Xin.C.Jin/debug_gsi/gsi_russ/src/gsi/gsimain.f90:633
 14: 21 0x0000000000414aa2 main()  ???:0
 14: 22 0x000000000003ad85 __libc_start_main()  ???:0
 14: 23 0x00000000004149ae _start()  ???:0
 14: =================================

The code aborts due to a floating point exception in the mpi reduction operation

     call mpi_reduce(data_all_in,data_all,nele*ndata,mpi_rtype,mpi_sum,&
          mype_root,mpi_comm_sub,ierror)

Dan Kokron comments on the behavior in GSI PR #772. He found that initializing data_all to zero in read_iasi.f90 gets the debug gsi.x past errors.

My working copy of thompson_reff on Hera initializes data_all to zero after it is allocated in read_iasi.f90.

@@ -437,6 +437,7 @@ subroutine read_iasi(mype,val_iasi,ithin,isfcalc,rmesh,jsatid,gstime,&
   allocate(allchan(2,1))     ! actual values set after ireadsb
   allocate(bufr_chan_test(1))! actual values set after ireadsb
   allocate(scalef(1))
+  data_all=zero

 ! Big loop to read data file
   next=0

This change is not needed on WCOSS2 for the debug gsi.x to run to completion. WCOSS2 and Hera compile GSI with different versions of the intel compiler and mpi libraries. A new GSI issue could be opened to document this problem and identify a robust solution.

@RussTreadon-NOAA
Copy link
Contributor Author

FYI @xincjin-NOAA . PR #779 was merged into develop at c1eb61c. Branch RussTreadon-NOAA:feature/thompson_reff. has been deleted.

@xincjin-NOAA
Copy link
Contributor

@RussTreadon-NOAA Thank you so much for your information. I made changes as you suggested above. However, it was killed because the time-limit of 30 minutes for the debug queue. How many hours do you set for the run?

Thanks,
Xin

@RussTreadon-NOAA
Copy link
Contributor Author

@xincjin-NOAA , I arbitrarily bumped the wall clock limit up to 3 hours, 30 minutes. Doing so requires the queue to be changed from debug to batch. The debug queue has a maximum wall clock limit of 30 minutes.

@xincjin-NOAA
Copy link
Contributor

@RussTreadon-NOAA and @emilyhcliu I extended the time-limit to 6.5 hours, made a few experiments and found the cause of the issue is that there are NaNs in the bias file: /scratch2/NCEPDEV/stmp3/Emily.Liu/ROTDIRS/v17allskyens/gdas.20230607/06/analysis/atmos/gdas.t06z.abias

3199 gmi_gpm                 12   0.288353E+01   0.348420E+06   101
             NaN    0.000000    0.000000         NaN         NaN         NaN         NaN    0.000000         NaN         NaN
             NaN         NaN

After I removed these NaNs and re-run the experiment. the experiment was normal until canceled after arrived the time-limit of 6.5 hours.

 86: cost,grad,step,b,step? =   1  34  1.116154570241830079E+06  1.553607396778998373E+05  5.049087186508115499E-01  1.177244623397102563E+00  good
 86: cost,grad,step,b,step? =   1  35  1.113557267155878711E+06  1.069051637115771591E+05  4.602519601049419040E-01  1.131151246045327685E+00  good
  0: slurmstepd: error: *** STEP 64903831.1 ON h6c01 CANCELLED AT 2024-08-13T07:55:44 DUE TO TIME LIMIT ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

@RussTreadon-NOAA
Copy link
Contributor Author

Good detective work, @xincjin-NOAA. You should open a PR to get your fix to subroutine deter_sfc_gmi into develop.

@emilyhcliu , the operational abias and abias_int files contain NaN for gmi_gpm channel 12. Do we know why / how NaN got into gmi_gpm channel 12?

@emilyhcliu
Copy link
Contributor

Good detective work, @xincjin-NOAA. You should open a PR to get your fix to subroutine deter_sfc_gmi into develop.

@emilyhcliu , the operational abias and abias_int files contain NaN for gmi_gpm channel 12. Do we know why / how NaN got into gmi_gpm channel 12?

@RussTreadon-NOAA In v16.3, GMI is not assimilated. So, we set it to monitoring mode. Later, the NESDIS upgrade their satellite ingest for GMI and a few other data and our obsproc team modifies their operation accordingly. However, the size of GMI data double, the data are duplicated twice. This caused memory problem with NaN in the bias correction. The obsproc team fixed the problem and size of GMI is back to normal.

xincjin-NOAA added a commit to xincjin-NOAA/GSI that referenced this issue Aug 13, 2024
@RussTreadon-NOAA
Copy link
Contributor Author

Thank you @emilyhcliu for sharing how NaN got in abias. What's the best way to clean this up? It's a trivial change but our implementation process is not agile. Will GFS v17 assimilate gpm_gmi? This would require NCO to pick up a new abias file.

@emilyhcliu
Copy link
Contributor

Thank you @emilyhcliu for sharing how NaN got in abias. What's the best way to clean this up? It's a trivial change but our implementation process is not agile. Will GFS v17 assimilate gpm_gmi? This would require NCO to pick up a new abias file.

We can reset the bias correction in our obs upgrade before v17 implementation.

@xincjin-NOAA
Copy link
Contributor

Good detective work, @xincjin-NOAA. You should open a PR to get your fix to subroutine deter_sfc_gmi into develop.

@RussTreadon-NOAA I have created a pull request (#781 ). Do you know how to remove the fix directory from the changed files.

Thanks,

Xin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants