Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for regression tests on NCEP RDHPC orion machine #468

Merged
merged 26 commits into from
Sep 26, 2021

Conversation

JessicaMeixner-NOAA
Copy link
Collaborator

@JessicaMeixner-NOAA JessicaMeixner-NOAA commented Sep 8, 2021

Pull Request Summary

Adds support for NCEP's RDHPCS resource orion and updates hera to use "srun" as recommended.

Description

This PR allows for running of the regression tests on Orion. No comparison is done on orion and there are still a few errors due to a NetCDF issue on orion when using partitions (see: #451)

Issue(s) addressed

Check list

Commit Message

  • Update NCEP regression tests to use srun and enable orion

Testing

  • How were these changes tested? WW3 standalone regtests
  • Are the changes covered by regression tests? yes
  • Have regression tests been run? yes
  • Which compiler / HPC you used to run the regression tests in the PR? hera.intel and orion.intel
  • Please provide the summary output of matrix.comp (matrix.Diff.txt, matrixCompFull.txt and matrixCompSummary.txt):
    matrixDiff.txt
    matrixCompFull.txt
    matrixCompSummary.txt
    Please indicate the expected changes in the outputs (excluding the known list of non-identical tests). Just the typical non-b4b tests:
**********************************************************************
********************* non-identical cases ****************************
**********************************************************************
mww3_test_03/./work_PR2_UQ_MPI_d2                     (9 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2_c                     (8 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2                     (10 files differ)
mww3_test_03/./work_PR2_UNO_MPI_d2                     (8 files differ)
mww3_test_03/./work_PR1_MPI_d2                     (16 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2_c                     (9 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2                     (8 files differ)
mww3_test_03/./work_PR3_UQ_MPI_e_c                     (1 files differ)
mww3_test_07/./work_PR3_UQ                     (3 files differ)
ww3_tp2.10/./work_MPI_OMPH                     (6 files differ)
ww3_tp2.14/./work_OASACM4                     (0 files differ)
ww3_tp2.14/./work_OASICM                     (0 files differ)
ww3_tp2.14/./work_OASACM6                     (0 files differ)
ww3_tp2.14/./work_OASOCM                     (0 files differ)
ww3_tp2.14/./work_OASACM2                     (0 files differ)
ww3_tp2.14/./work_OASACM5                     (0 files differ)
ww3_tp2.14/./work_OASACM                     (0 files differ)
ww3_tp2.16/./work_MPI_OMPH                     (4 files differ)
ww3_tp2.17/./work_ma1                     (0 files differ)
ww3_tp2.17/./work_mc1                     (1 files differ)
ww3_ufs1.1/./work_b                     (0 files differ)
ww3_ufs1.2/./work_b                     (0 files differ)
ww3_ufs1.3/./work_a                     (1 files differ)

  • Please list which labels code managers should add to indicate code changes:
    none

@JessicaMeixner-NOAA JessicaMeixner-NOAA added the enhancement New feature or request label Sep 8, 2021
model/bin/cmplr.env Show resolved Hide resolved
regtests/bin/matrix_ncep Show resolved Hide resolved
@aliabdolali
Copy link
Contributor

tests went well on Orion except the ones related to NetCDF libraries on orion when using partitions (see: #451)

@JessicaMeixner-NOAA
Copy link
Collaborator Author

If the ww3_tp2.16/./work_MPI_OMPH test is run in develop and in this branch from a fresh clone and only that test is run, then the output is identical (same thing for running 2 times from this branch). However, when you run multiple tests in this branch, we do not get reproducible results or reproduce the develop branch. Typically this type of error is because w3_new has an error. I'm continuing to investigate, but it seems like an unrelated error to this branch but for whatever reason we see it with srun (which we should be using) and not when using mpirun.

@JessicaMeixner-NOAA
Copy link
Collaborator Author

Another test I have tried is to use the tp2.10 OMPH switch file for the tp2.16 test since it's immediately following it, which did not help either. However, the other thing is tp2.10 is a known not b4b tests (#321) and tp2.16 is very similar, just a different grid (arctic), so perhaps we were just lucky before that we were getting reproducibility there?

@aliabdolali I've looked fairly extensively at this point. I'm not sure why srun hasn't been used all along and since every other test is the same except this one which is very similar to a known-not b4b test, I think our options at this point are

  1. wait until the b4b issues are solved, which no one is working on currently and will keep this PR on hold for a long time or
  2. decide we should move forward with this PR as is knowing it will add another test to the EMC not-b4b list and add a note to issue tp2.10 MPI_OMPH test not reproducing with different number of MPI/OMP threads #321 to include the tp2.16 OMPH test

@aliabdolali
Copy link
Contributor

@JessicaMeixner-NOAA I think we should move forward, srun is more efficient especially with OMP options. So, given that, I will rerun the tests and will merge afterward.
Thanks for all the detailed diagnoses.

@aliabdolali
Copy link
Contributor

The tests ran successfully with known non-b4b tests _ ww3_tp2.16


********************* non-identical cases ****************************


mww3_test_03/./work_PR2_UNO_MPI_e (1 files differ)
mww3_test_03/./work_PR2_UQ_MPI_d2 (8 files differ)
mww3_test_03/./work_PR2_UNO_MPI_d2 (8 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2_c (9 files differ)
mww3_test_03/./work_PR3_UQ_MPI_e (1 files differ)
mww3_test_03/./work_PR3_UNO_MPI_e_c (1 files differ)
mww3_test_03/./work_PR1_MPI_d2 (16 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2_c (8 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2 (8 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2 (8 files differ)
mww3_test_07/./work_PR3_UQ (3 files differ)
ww3_tp2.10/./work_MPI_OMPH (7 files differ)
ww3_tp2.16/./work_MPI_OMPH (4 files differ)
ww3_ufs1.3/./work_a (1 files differ)

test ww3_tp2.16/./work_MPI_OMPH is added to the list of non-identical cases with a note in #321

@aliabdolali aliabdolali merged commit bc344b9 into NOAA-EMC:develop Sep 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

enable matrix_ncep on orion
3 participants