Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

logic and restart corruption fixes #949

Merged

Conversation

DeniseWorthen
Copy link
Contributor

@DeniseWorthen DeniseWorthen commented Mar 16, 2023

Pull Request Summary

This work is part of UFS Issue 1556

Fixes for logic and restart-write corruption for domain-decomposition case.

Description

Cleans up the logical if statements for combinations of lpdlib and grid type to clarify that lpdlib + structured is not valid use-case.

Fixes corruption of Z0 exported by the wave model at the restart writing frequency by by-passing the code blocks in w3initmd which initialize the mpi sends/recvs even when running with domain decomposition (where they are not used).

Issue(s) addressed

Commit Message

Clarify allowed grid type logic when running with domain decomposition. By-pass mpi send/recv set-up in w3initmd when using domain decomposition.

Check list

Testing

  • How were these changes tested?
  • Are the changes covered by regression tests? (If not, why? Do new tests need to be added?)
  • Have the matrix regression tests been run (if yes, please note HPC and compiler)?
  • Please indicate the expected changes in the regression test output, (Note the list of known non-identical tests.)
  • Please provide the summary output of matrix.comp (matrix.Diff.txt, matrixCompFull.txt and matrixCompSummary.txt):

Tested in UFS branch containing mesh cap with these fixes for ATMW + CMEPS for 24hr control, restart, decomposition (results on 30PE are B4B w/ 20PE) and debug test. The coupled model case does not show decomposition reproducibility however.

UFS will also be run for existing RTs using the mesh cap branch with these changes.

* bring in changes from mesh cap branch to fix
restart corruption
* fix logic statements for domain decomp and grid type
@MatthewMasarik-NOAA
Copy link
Collaborator

Thanks @DeniseWorthen! I'll submit the regtests.

@DeniseWorthen
Copy link
Contributor Author

Hera and GNU RTs with waves all pass.

@MatthewMasarik-NOAA
Copy link
Collaborator

Great, thanks for that update @DeniseWorthen. The WW3 regtests are still running though should be done shortly.

@DeniseWorthen
Copy link
Contributor Author

Note that whitepaces have changed in w3initmd.F90 because an if/endif was removed. This will mean the indenting of the enclosed block will change.

@MatthewMasarik-NOAA
Copy link
Collaborator

Update: fyi @DeniseWorthen. I found my compare script job on hera hung near the end. I'm going to re-submit it to get clean logs, but from what I can tell so far it doesn't look like any answers have changed. I'll be able to merge it this afternoon or early tomorrow.

@MatthewMasarik-NOAA
Copy link
Collaborator

Hi @DeniseWorthen, the job I re-ran yesterday hung on the same test comparison, so it is being run once more to be certain there is no issue, and to get the log output. This would normally be ~1hr job, but the queue on hera seems to be quite slow today. My intention is to merge it today as soon as the job returns. Thanks for your patience.

@DeniseWorthen
Copy link
Contributor Author

@MatthewMasarik-NOAA OK, thanks. Are you sure that the issue isn't actually w/ the changes in the PR?

@MatthewMasarik-NOAA
Copy link
Collaborator

@DeniseWorthen, that's a possibility. I'm hoping it is just a hiccup in running the compare script, which can happen, but I haven't ruled out that there is an issue with the PR. I will hopefully know more shortly.

@DeniseWorthen
Copy link
Contributor Author

@MatthewMasarik-NOAA Is there any update on the RT testing for this PR?

@MatthewMasarik-NOAA
Copy link
Collaborator

Hi @DeniseWorthen, the third attempt that went through over the weekend had the same effect of hanging, and I've been looking into the cause today. There's a file difference between your branch and the develop run for the ww3_ufs1.3 regression test. We are currently investigating this. Please bear with us for the delay, with the fairshare levels we have had very slow turn around times the past few days. We've moved some over to orion to hopefully help speed things up. I'll report back as soon as I have some answers.

@DeniseWorthen
Copy link
Contributor Author

Thanks for the update.

@JessicaMeixner-NOAA JessicaMeixner-NOAA self-requested a review March 21, 2023 15:39
Copy link
Collaborator

@JessicaMeixner-NOAA JessicaMeixner-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On hera, both @MatthewMasarik-NOAA and I had ww3_ufs1.3 differences in the rmp_src_to_dst_conserv_002_001.nc and rmp_src_to_dst_conserv_003_001.nc files. I have looked multiple times at the code changes and there is no explanation for this. Moreover, the results other than the known diffs in the restart files are the same for this regression test. Looking at the results they are on the order of 1e-16. We're having trouble getting through the queue on hera, so I re-ran regression tests on orion, in which there was no issue. Those logs are posted here. The information about the ww3_ufs1.3 test is mentioned here in case we need to back trace problems.

Matrix results from running with intel on Orion:

matrixCompFull.txt
matrixCompSummary.txt
matrixDiff.txt

Only the expected not b4b tests:

**********************************************************************
********************* non-identical cases ****************************
**********************************************************************
mww3_test_03/./work_PR2_UQ_MPI_d2                     (15 files differ)
mww3_test_03/./work_PR2_UNO_MPI_e                     (1 files differ)
mww3_test_03/./work_PR2_UNO_MPI_d2                     (9 files differ)
mww3_test_03/./work_PR1_MPI_e                     (1 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2                     (16 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2_c                     (12 files differ)
mww3_test_03/./work_PR3_UQ_MPI_e                     (1 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2_c                     (16 files differ)
mww3_test_03/./work_PR1_MPI_d2                     (13 files differ)
mww3_test_03/./work_PR3_UNO_MPI_e_c                     (1 files differ)
mww3_test_03/./work_PR3_UQ_MPI_e_c                     (1 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2                     (18 files differ)
ww3_ta1/./work_UPD0F_U                     (0 files differ)
ww3_tp2.10/./work_MPI_OMPH                     (7 files differ)
ww3_tp2.14/./work_OASACM5                     (1 files differ)
ww3_tp2.16/./work_MPI_OMPH                     (4 files differ)
ww3_tp2.17/./work_ma                     (1 files differ)
ww3_tp2.17/./work_mc1                     (4 files differ)
ww3_tp2.17/./work_mc                     (4 files differ)
ww3_tp2.17/./work_a                     (1 files differ)
ww3_tp2.17/./work_b                     (4 files differ)
ww3_tp2.17/./work_mb                     (4 files differ)
ww3_tp2.17/./work_c                     (4 files differ)
ww3_tp2.17/./work_ma1                     (1 files differ)
ww3_ufs1.3/./work_a                     (3 files differ)

@MatthewMasarik-NOAA
Copy link
Collaborator

Thank you @JessicaMeixner-NOAA for the regtests and documentation of the experience we had on hera.

And thank you to @DeniseWorthen for the all work that went into this which fixes 3 issues. Your hard work is very much appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants