-
Notifications
You must be signed in to change notification settings - Fork 568
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
logic and restart corruption fixes #949
logic and restart corruption fixes #949
Conversation
* bring in changes from mesh cap branch to fix restart corruption * fix logic statements for domain decomp and grid type
Thanks @DeniseWorthen! I'll submit the regtests. |
Hera and GNU RTs with waves all pass. |
Great, thanks for that update @DeniseWorthen. The WW3 regtests are still running though should be done shortly. |
Note that whitepaces have changed in |
Update: fyi @DeniseWorthen. I found my compare script job on hera hung near the end. I'm going to re-submit it to get clean logs, but from what I can tell so far it doesn't look like any answers have changed. I'll be able to merge it this afternoon or early tomorrow. |
Hi @DeniseWorthen, the job I re-ran yesterday hung on the same test comparison, so it is being run once more to be certain there is no issue, and to get the log output. This would normally be ~1hr job, but the queue on |
@MatthewMasarik-NOAA OK, thanks. Are you sure that the issue isn't actually w/ the changes in the PR? |
@DeniseWorthen, that's a possibility. I'm hoping it is just a hiccup in running the compare script, which can happen, but I haven't ruled out that there is an issue with the PR. I will hopefully know more shortly. |
@MatthewMasarik-NOAA Is there any update on the RT testing for this PR? |
Hi @DeniseWorthen, the third attempt that went through over the weekend had the same effect of hanging, and I've been looking into the cause today. There's a file difference between your branch and the develop run for the |
Thanks for the update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On hera, both @MatthewMasarik-NOAA and I had ww3_ufs1.3 differences in the rmp_src_to_dst_conserv_002_001.nc and rmp_src_to_dst_conserv_003_001.nc files. I have looked multiple times at the code changes and there is no explanation for this. Moreover, the results other than the known diffs in the restart files are the same for this regression test. Looking at the results they are on the order of 1e-16. We're having trouble getting through the queue on hera, so I re-ran regression tests on orion, in which there was no issue. Those logs are posted here. The information about the ww3_ufs1.3 test is mentioned here in case we need to back trace problems.
Matrix results from running with intel on Orion:
matrixCompFull.txt
matrixCompSummary.txt
matrixDiff.txt
Only the expected not b4b tests:
**********************************************************************
********************* non-identical cases ****************************
**********************************************************************
mww3_test_03/./work_PR2_UQ_MPI_d2 (15 files differ)
mww3_test_03/./work_PR2_UNO_MPI_e (1 files differ)
mww3_test_03/./work_PR2_UNO_MPI_d2 (9 files differ)
mww3_test_03/./work_PR1_MPI_e (1 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2 (16 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2_c (12 files differ)
mww3_test_03/./work_PR3_UQ_MPI_e (1 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2_c (16 files differ)
mww3_test_03/./work_PR1_MPI_d2 (13 files differ)
mww3_test_03/./work_PR3_UNO_MPI_e_c (1 files differ)
mww3_test_03/./work_PR3_UQ_MPI_e_c (1 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2 (18 files differ)
ww3_ta1/./work_UPD0F_U (0 files differ)
ww3_tp2.10/./work_MPI_OMPH (7 files differ)
ww3_tp2.14/./work_OASACM5 (1 files differ)
ww3_tp2.16/./work_MPI_OMPH (4 files differ)
ww3_tp2.17/./work_ma (1 files differ)
ww3_tp2.17/./work_mc1 (4 files differ)
ww3_tp2.17/./work_mc (4 files differ)
ww3_tp2.17/./work_a (1 files differ)
ww3_tp2.17/./work_b (4 files differ)
ww3_tp2.17/./work_mb (4 files differ)
ww3_tp2.17/./work_c (4 files differ)
ww3_tp2.17/./work_ma1 (1 files differ)
ww3_ufs1.3/./work_a (3 files differ)
Thank you @JessicaMeixner-NOAA for the regtests and documentation of the experience we had on hera. And thank you to @DeniseWorthen for the all work that went into this which fixes 3 issues. Your hard work is very much appreciated. |
Pull Request Summary
This work is part of UFS Issue 1556
Fixes for logic and restart-write corruption for domain-decomposition case.
Description
Cleans up the logical if statements for combinations of lpdlib and grid type to clarify that lpdlib + structured is not valid use-case.
Fixes corruption of
Z0
exported by the wave model at the restart writing frequency by by-passing the code blocks inw3initmd
which initialize the mpi sends/recvs even when running with domain decomposition (where they are not used).Issue(s) addressed
Commit Message
Clarify allowed grid type logic when running with domain decomposition. By-pass mpi send/recv set-up in
w3initmd
when using domain decomposition.Check list
Testing
Tested in UFS branch containing mesh cap with these fixes for ATMW + CMEPS for 24hr control, restart, decomposition (results on 30PE are B4B w/ 20PE) and debug test. The coupled model case does not show decomposition reproducibility however.
UFS will also be run for existing RTs using the mesh cap branch with these changes.