Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port of WW3 to S4 #424

Closed
DavidHuber-NOAA opened this issue Jun 25, 2021 · 11 comments · Fixed by #458
Closed

Port of WW3 to S4 #424

DavidHuber-NOAA opened this issue Jun 25, 2021 · 11 comments · Fixed by #458
Assignees
Labels
enhancement New feature or request

Comments

@DavidHuber-NOAA
Copy link
Contributor

The WW3 model should be ported to the S4 cluster to support development efforts of the GFS and GDAS. A port of version production/GFS.v16 is ongoing and will be tested soon.

@DavidHuber-NOAA
Copy link
Contributor Author

I've completed an initial port of WW3 to S4. Almost all of the regression tests pass, with the exception of a few of the tests utilizing parmetis, suggesting a problem with the build of that library. I suspect that since WW3 is built with -march=ivybridge that parmetis must also be built with that flag. I will be attempting that later today and try the regression tests again.

Similarly, the GFS tests all pass, with the exception of control_atmwav, which crashes during the WW3 portion of that test. More details on that test can be found on the GFS port issue (ufs-community/ufs-weather-model#738).

@aliabdolali
Copy link
Contributor

aliabdolali commented Aug 26, 2021

@DavidHuber-NOAA Thanks for the update. The metis and parmetis for WW3 on our RDHPCs are compiled following
https://github.com/NOAA-EMC/WW3/wiki/FAQs-page#how-to-install-Metis-and-Parmetis
How did you compile them on S4?

@DavidHuber-NOAA
Copy link
Contributor Author

@aliabdolali Thanks for the link. I had compiled both without declaring CFLAGS or specifying cc or cxx in the make command, so it looks like it built with mpicc instead of mpiicc. I will give that a try.

DavidHuber-NOAA added a commit to DavidHuber-NOAA/WW3 that referenced this issue Sep 1, 2021
@DavidHuber-NOAA
Copy link
Contributor Author

DavidHuber-NOAA commented Sep 1, 2021

After fixing the install of parmetis, those tests now pass.

However, I'm having issues with the OASIS tests and two of the ww3_multi tests.

I believe the issues with OASIS are partly related to #440, but we are also limited on S4 to running with srun. This causes an additional issue since the OASIS calls appear to be heterogeneous CORRECTION multiple-program-multiple-data model invocations (i.e. run_test#L1690). I'm not sure how to mimic this with srun. Is this possible? I see now that this can be achieved just by calling srun twice. However, I still cannot get these tests to pass. CORRECTION: I can see now that this is achieved by running srun --multi-prog. These tests are now completing correctly.

The two ww3_multi tests that fail are called with
run_test -b slurm -c s4.intel -S -T -s MPI -s NO_PDLIB -w work_ma1 -m grdset_a1 -f -p srun -n 24 -o all ../model11 ww3_tp2.17
run_test -b slurm -c s4.intel -S -T -s MPI -s PDLIB -w work_mc1 -m grdset_c1 -f -p srun -n 24 -o all ../model11 ww3_tp2.17
Both fail when attempting to read the input file. The former attempts to read regtests/ww3_tp2.17/input/ww3_multi_grdset_a1.inp, but fails to read line 9 at this line of code. The first 10 lines of ww3_multi_grdset_a1.inp are

$
$ Input file to run with Inlet grid
$
1 0 F 1 T T
$
$'points'
$
$
 'inla'  'native' 'native' 'native' 'no' 'no' 'no' 'no'   1  1  0.00 1.00  F
$

The read fails when it attempts to read in a total of 10 strings, but there are only 7. I'm not sure where this input file is generated or copied from, nor do I know if this is an issue with the port or the input file. Could you advise if this input is correct or if there is a bug here?

@DavidHuber-NOAA
Copy link
Contributor Author

@aliabdolali As a sanity check, I ran the test mentioned above (run_test -b slurm -c s4.intel -S -T -s MPI -s NO_PDLIB -w work_ma1 -m grdset_a1 -f -p srun -n 24 -o all ../model11 ww3_tp2.17) on Hera and I am seeing the same failure. The log file can be found here: /scratch1/NESDIS/nesdis-rdo2/David.Huber/gw_dev/sorc/fv3gfs.fd/WW3/regtests/bin/matrix11.out, with the errors occurring on lines 1078-1084.

Since this seems to be an issue with the test, I'd like to go ahead and push forward with a PR on porting WW3 to S4 so porting of the UFS and the global workflow can continue. I can then open a new issue to continue testing at a later time.

@aliabdolali
Copy link
Contributor

Thanks @DavidHuber-NOAA Let me check it. I will be back to you in 30 min.

@aliabdolali
Copy link
Contributor

Hi @DavidHuber-NOAA
It is a known issue and the fix is identified.
#442
Please go ahead with your PR.
What about OASIS tests? have you managed them?

@DavidHuber-NOAA
Copy link
Contributor Author

@aliabdolali All but one of the OASIS tests passed:
run_test -b slurm -c s4.intel -S -T -s OASACM3 -w work_OASACM3 -f -p srun -n 24/2 -o netcdf ../model11 ww3_tp2.14

When -n 24/2 is passed to srun, it returns an error: Invalid numeric value "24/2" for number of tasks. Is the intention of this test to run with 12 tasks?

@aliabdolali
Copy link
Contributor

Yes, on some platforms, it is acceptable, while I have seen the opposite as well.
so if you change it to 12, does it work?

@DavidHuber-NOAA
Copy link
Contributor Author

@aliabdolali Yes, the test passes using 12. Interestingly, on Hera, the same line (with -n 24/2) runs 24 tasks. I could change the line in matrix.base to
$rtst -s OASACM3 -w work_OASACM3 -f -p $mpi -n $(( $np / 2 )) -o netcdf $ww3 ww3_tp2.14
which should come out as 12 (confirmed on Hera and S4).

DavidHuber-NOAA added a commit to DavidHuber-NOAA/WW3 that referenced this issue Sep 3, 2021
@aliabdolali
Copy link
Contributor

@DavidHuber-NOAA I am going to make an issue, and we will fix it in matrix.base in a separate PR. Thanks for checking Hera.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants