Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

checking of nans in export field bundles #377

Merged
merged 7 commits into from
May 10, 2023

Conversation

mvertens
Copy link
Collaborator

@mvertens mvertens commented May 5, 2023

Description of changes

checking of nans in export field bundles #377

Specific notes

Introduced a new method med_methods_FB_check_for_nans in med_methods_mod and has all of the med_phases_prep_xxx_mod.F90 files call this method. If Nans are found in any field the model will abort, but it will first print out the number of nans found for each field and on each processor.

Contributors other than yourself, if any: @jedwards4b

CMEPS Issues Fixed: #370

Are changes expected to change answers? bfb

Any User Interface Changes (namelist or namelist defaults changes)? None

Testing performed

Verified that if nans were set in the 1d and 2d field in the call to med_phases_prep_atm, then PET files were written with the right statistics and the model aborted.

@mvertens mvertens requested a review from jedwards4b May 5, 2023 13:56
@mvertens mvertens marked this pull request as draft May 5, 2023 13:56
Copy link
Collaborator

@jedwards4b jedwards4b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please choose either FB_check_for_nans or fldbun_check_for_nans.

mediator/med_phases_prep_glc_mod.F90 Outdated Show resolved Hide resolved
mediator/med_phases_prep_lnd_mod.F90 Outdated Show resolved Hide resolved
mediator/med_phases_prep_rof_mod.F90 Outdated Show resolved Hide resolved
@mvertens mvertens marked this pull request as ready for review May 8, 2023 12:00
@mvertens mvertens marked this pull request as draft May 8, 2023 12:04
@mvertens mvertens marked this pull request as ready for review May 8, 2023 12:27
@DeniseWorthen
Copy link
Collaborator

I will test UFS, but I'd like to wait until after the WW3 unstructured mesh gets committed early this week.

@mvertens
Copy link
Collaborator Author

mvertens commented May 8, 2023

@DeniseWorthen - thanks!

@uturuncoglu @denise - This should have no impact on the UFS since I have inserted an #ifdef for this to only be relevant in CESM. The issue is that there is no code in UFS that implements the shr_infnan_mod logic that we are using in CESM. So until that type of functionality exists - UFS should just call this routine and immediately return.

@uturuncoglu - for CMCC this code can be activated since they are also using the share code that CESM is using.

@DeniseWorthen
Copy link
Collaborator

@mvertens Thanks. I hadn't actually looked at the changes yet ;-)

@billsacks
Copy link
Member

Thanks a lot for your work on this, @mvertens - this will be very nice to have!

@jedwards4b jedwards4b self-assigned this May 9, 2023
Copy link
Collaborator

@DeniseWorthen DeniseWorthen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mvertens We have the shr_infnan_mod.F90 available in our CDEPS repo so I will try to get this working on our side for a later PR.

@jedwards4b jedwards4b removed the request for review from uturuncoglu May 10, 2023 14:02
@jedwards4b jedwards4b merged commit 22b6278 into ESCOMP:main May 10, 2023
@DeniseWorthen
Copy link
Collaborator

DeniseWorthen commented May 11, 2023

@jedwards4b In med_methods_mod, we have

use shr_infnan_mod, only: shr_infan_isnan

should that be

use shr_infnan_mod, only: shr_infnan_isnan

?

@jedwards4b
Copy link
Collaborator

This should not have passed testing - I'm looking into it now.

@DeniseWorthen
Copy link
Collaborator

@mvertens @jedwards4b I've been able to turn this feature on for UFS now but I think I'm seeing a fairly big impact on run time. I ran our c96-1deg ocean+ice case for 5 days with this feature on and off. To turn off, I just inserted a return at the top of the med_methods_FB_check_for_nans. The wall clock times were: nan check on 1220s, nan check off 986s.

Was a similar impact seen in CESM?

I'm wondering how to best make this optional for UFS, since it would be a nice feature to have available. One way I thought is

diff --git a/mediator/med_methods_mod.F90 b/mediator/med_methods_mod.F90
index 1da8d6ac..4425e606 100644
--- a/mediator/med_methods_mod.F90
+++ b/mediator/med_methods_mod.F90
@@ -2530,6 +2530,11 @@ contains
     ! ----------------------------------------------
     rc = ESMF_SUCCESS

+#ifndef CESMCOUPLED
+    if (dbug_flag == 0) then
+       return
+    end if
+#endif
     call ESMF_FieldBundleGet(FB, fieldCount=fieldCount, rc=rc)
     if (ChkErr(rc,__LINE__,u_FILE_u)) return

Or, there could be a config variable to toggle. Or, it could be controlled w/ a #ifdef DEBUG. Any thoughts?

@jedwards4b
Copy link
Collaborator

@DeniseWorthen Thanks for looking at the performance impact of this change. I will run a similar test with PFS.ne30pg3_t061.B1850MOM and let you know. I think it would be best to add a namelist variable to control it - I'll open a issue to track that.

@jedwards4b
Copy link
Collaborator

For a 20 day run with PFS.ne30pg3_t061.B1850MOM
with nan checks: 3.36 ypd
without nan checks: 3.51 ypd

@DeniseWorthen
Copy link
Collaborator

wow, much less impact for you.

@mvertens
Copy link
Collaborator Author

@jedwards4b - thanks for this test. I think its worth keeping this check with this small of a hit- particularly since we don't turn on floating point trapping in production runs. Do you agree? I think it also should be discussed at a CSEG meeting to make sure there is agreement to keep this given that there is a cost.

@DeniseWorthen
Copy link
Collaborator

DeniseWorthen commented May 26, 2023

Another follow-up. It turns out that the impact UFS was seeing in run time w/ this feature on was not due to the NaN checking, but w/ the switching of the default PIO settings in an earlier PR (#378).

If PIO settings are not provided, med_io had been using

       cvalue = 'BOX'
       pio_rearranger = PIO_REARR_BOX

This was changed to

       cvalue = 'SUBSET'
       pio_rearranger = PIO_REARR_SUBSET

If I revert that change, turning on the check-for-nan feature has negligible impact on wall-clock for our 5day test (in fact, it was a tiny bit faster, but that is just noise).

I don't understand the PIO settings and how to choose them, but it looks I can specify our old settings as a configuration using

pio_rearranger = BOX

@jedwards4b
Copy link
Collaborator

@DeniseWorthen Thanks for looking into this further. The subset rearranger scales better but is slower at small task counts.

@DeniseWorthen
Copy link
Collaborator

@jedwards4b Interesting! Gerhard has been working w/ Jun on various scaling issues w/ the coupled model and one of the issues he highlighted was both CMEPS and CICE slow I/O. We have not been using PIO for CICE but started to look into it after Gerhard's evaluations. It sounds like we should test using the subset rearranger also.

@mvertens mvertens deleted the feature/check_for_nans branch June 16, 2023 08:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add NaN checks for coupling fields
4 participants