Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OMIPp25+BLING and CM4 crash with Overflow in reproducing_EFP_sum(_2d) #589

Closed
nikizadehgfdl opened this issue Mar 27, 2024 · 5 comments
Closed
Labels
bug Something isn't working

Comments

@nikizadehgfdl
Copy link

After updating MOM6-examples from commit 40e3937 (on 20231130) to commit ab0c120 (on 20240321) the regression test experiment OMIP_CORE2 (which has BLING on) crashes as follows:

FATAL from PE   125: Overflow in reproducing_EFP_sum(_2d) conversion of   9.56361E+43                                  
                                                                                                                       
Image              PC                Routine            Line        Source                                              
fms_MOM6_SIS2_com  0000000001F4A9A7  mpp_mod_mp_mpp_er          72  mpp_util_mpi.inc                                    
fms_MOM6_SIS2_com  000000000097B4C8  mom_error_handler         191  MOM_error_handler.F90                              
fms_MOM6_SIS2_com  00000000009E8921  mom_coms_mp_repro         203  MOM_coms.F90                                        
fms_MOM6_SIS2_com  0000000000C66F23  mom_spatial_means         391  MOM_spatial_means.F90                              
fms_MOM6_SIS2_com  0000000000B1F851  mom_generic_trace         689  MOM_generic_tracer.F90                              
fms_MOM6_SIS2_com  00000000009EF10C  mom_tracer_flow_c         725  MOM_tracer_flow_control.F90                        
fms_MOM6_SIS2_com  000000000104B38B  mom_sum_output_mp         530  MOM_sum_output.F90                                  
fms_MOM6_SIS2_com  0000000000BACBF0  mom_mp_finish_mom        3431  MOM.F90                                            
fms_MOM6_SIS2_com  00000000009D057E  ocean_model_mod_m         572  ocean_model_MOM.F90                                
fms_MOM6_SIS2_com  000000000041499F  MAIN__                   1063  coupler_main.F90  

For some layouts, it crashes like:

Nan!
fms_MOM6_SIS2_com  0000000001F3D5FA  mpp_mod_mp_mpp_mi          32  mpp_reduce_mpi.fh                                   
fms_MOM6_SIS2_com  0000000000CE0EC9  mom_horizontal_re          86  MOM_horizontal_regridding.F90                       
fms_MOM6_SIS2_com  00000000010461D5  mom_tracer_initia         220  MOM_tracer_initialization_from_Z.F90                
fms_MOM6_SIS2_com  0000000000B27EA5  mom_generic_trace         354  MOM_generic_tracer.F90                              
fms_MOM6_SIS2_com  00000000009F09E0  mom_tracer_flow_c         343  MOM_tracer_flow_control.F90                         
fms_MOM6_SIS2_com  0000000000BBC705  mom_mp_initialize        3323  MOM.F90  

which comes from the "stop" statement in
https://github.com/NOAA-GFDL/MOM6/blob/dev/gfdl/src/framework/MOM_horizontal_regridding.F90#L74

@nikizadehgfdl
Copy link
Author

Running in debug mode (-O0) gives division by 0 and the following traceback.

forrtl: error (73): floating divide by zero                                                                             
Image              PC                Routine            Line        Source                                              
libpthread-2.31.s  000014F967871910  Unknown               Unknown  Unknown                                             
fms_MOM6_SIS2_com  00000000019A5FDA  mom_remapping_mp_         754  MOM_remapping.F90                                   
fms_MOM6_SIS2_com  000000000198F533  mom_remapping_mp_         195  MOM_remapping.F90                                   
fms_MOM6_SIS2_com  0000000003423DDC  mom_ale_mp_ale_re        1335  MOM_ALE.F90                                         
fms_MOM6_SIS2_com  0000000001CCB578  mom_tracer_initia         204  MOM_tracer_initialization_from_Z.F90                
fms_MOM6_SIS2_com  0000000001B3728D  mom_generic_trace         354  MOM_generic_tracer.F90                              
fms_MOM6_SIS2_com  00000000020D3730  mom_tracer_flow_c         343  MOM_tracer_flow_control.F90                         
fms_MOM6_SIS2_com  0000000002016C03  mom_mp_initialize        3323  MOM.F90                                             
fms_MOM6_SIS2_com  0000000001A265B7  ocean_model_mod_m         278  ocean_model_MOM.F90      

The experiment runs fine when I turn off generic tracer BLING.

@nikizadehgfdl
Copy link
Author

nikizadehgfdl commented Apr 24, 2024

The OM4p25+BLING crash seems to happen after applying the following MOM6 commit (around February 1st 2023):
9a6ddee

Which makes sense since the crash happens only when BLING is turned on.

The crash is absent in the previous commit e7a7a82 .

@favorliao
Copy link

favorliao commented May 3, 2024

I think the reason is the missing hSrc in computing the thickness. The source of the issue is here:

dz_neglect = set_dz_neglect(GV, US, remap_answer_date, dz_neglect_edge)

The possible solution should be:

if (h_is_in_Z_units) then
      dz_neglect = set_dz_neglect(GV, US, remap_answer_date, dz_neglect_edge)
      !added to compute the hSrc
      GV_loc = GV ; GV_loc%ke = kd
      call dz_to_thickness_simple(dzSrc, hSrc, G, GV_loc, US) 
      !finish adding
      call ALE_remap_scalar(remapCS, G, GV, kd, hSrc, tr_z, h, tr, all_cells=.false., answer_date=remap_answer_date, &
                            H_neglect=dz_neglect, H_neglect_edge=dz_neglect_edge)
    else
      ! Equation of state data is not available, so a simpler rescaling will have to suffice,
      ! but it might be problematic in non-Boussinesq mode.
      GV_loc = GV ; GV_loc%ke = kd
      call dz_to_thickness_simple(dzSrc, hSrc, G, GV_loc, US)
      call ALE_remap_scalar(remapCS, G, GV, kd, hSrc, tr_z, h, tr, all_cells=.false., answer_date=remap_answer_date )
    endif

Hallberg-NOAA added a commit to Hallberg-NOAA/MOM6 that referenced this issue May 24, 2024
  Corrected an argument to the call to ALE_remap_scalar() when h_in_Z_units is
true in MOM_initialize_tracer_from_Z(), to avoid the problems documented in
github.com/NOAA-GFDL/issues/589.  A comment was also added explaining the
logic of what is going on in this fork of the code.  This commit will change
answers with some generic tracers that are initialized from a Z-space input
file, restoring them to previous values that worked previously (before about
Feb. 1, 2024 on dev/gfdl) in Boussinesq configurations without dimensional
consistency testing, but in a new form that does pass the dimensional
consistency testing for depths and thicknesses.  All answers are bitwise
identical in any cases that do not use generic tracers.
@Hallberg-NOAA
Copy link
Member

Thank you for tracking down the source of this problem, @favorliao, with the use of the uninitialized hSrc array when h_is_in_Z_units == .true., which only occurs with the generic tracers. I agree with your diagnosis, but not the solution you propose.

The issue here is that when h_is_in_Z_units == .true., the thickness variable 'h' is being provided in depth units, not thickness units. The subroutine dz_to_thickness_simple() is converting vertical extents (in depth units) into thicknesses (in thickness units, but that is not what is needed here. Instead, the I think that the solution is to replace call ALE_remap_scalar(..., hSrc, ...) with call ALE_remap_scalar(..., dzSrc, ...) inside of the h_is_in_Z_units == .true. block. I have put in a pull request (#650) that I think should addresses this problem, and it is passing the usual MOM6 regression tests, but obviously this particular generic-tracer related bug is not detected with our usual tests, so I would appreciate it if you could evaluate whether my proposed bug-fix does actually address this problem, @nikizadehgfdl.

@Hallberg-NOAA Hallberg-NOAA added the bug Something isn't working label May 24, 2024
marshallward pushed a commit that referenced this issue May 29, 2024
  Corrected an argument to the call to ALE_remap_scalar() when h_in_Z_units is
true in MOM_initialize_tracer_from_Z(), to avoid the problems documented in
github.com//issues/589.  A comment was also added explaining the
logic of what is going on in this fork of the code.  This commit will change
answers with some generic tracers that are initialized from a Z-space input
file, restoring them to previous values that worked previously (before about
Feb. 1, 2024 on dev/gfdl) in Boussinesq configurations without dimensional
consistency testing, but in a new form that does pass the dimensional
consistency testing for depths and thicknesses.  All answers are bitwise
identical in any cases that do not use generic tracers.
@Hallberg-NOAA
Copy link
Member

It has been verified that this issues was corrected when PR #650 was merged into dev/gfdl.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants