Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add capability to run WW3 on unstructured mesh with domain decomposition #1556

Closed
DeniseWorthen opened this issue Jan 4, 2023 · 9 comments
Closed
Assignees
Labels
enhancement New feature or request

Comments

@DeniseWorthen
Copy link
Collaborator

Description

WW3 currently runs on a structured mesh for UWM. The capability to run on an unstructured mesh using domain decomposition should be added.

Solution

The mesh cap in WW3 needs the ability to run on the ustructured mesh, either with card-deck or domain decomposition.

Alternatives

Related to

@DeniseWorthen
Copy link
Collaborator Author

DeniseWorthen commented Mar 7, 2023

An issue has been found relating to the domain decomposition, the restart-write frequency and the failure to reproduce when changing the decomp for WW3.

The test case is the ATM-WAV app, coupled through CMEPS. This app resides in within my fork and uses as a test case a global (77S-85N) unstructured mesh with the MOM6 1-deg tripole land mask (no land points exist in the mesh). The time-stepping mode is explicit for WW3.

In ww3_shel.nml, the following values are set:

&output_date_nml
  date%field%outffile  = '1'
  date%field%stride    = '720'
  date%restart2%stride = '3600'

When run on 5PEs for WW3, the following decomposition is used:

emesh decomp

Using the mediator history file, the following fields for the export of z0 from the wave model are shown at the timestep prior to the restart write, at the restart write, and following the restart write. Note the value of z0 has been multipled by 1.0e3.

wavImp_Sw_z0

The export of z0 is clearly corrupted on the last decomposition PE. This occurs each time the restart file is written. If N PEs are used for WW3, then every Nth value of UST has the value of 1.0e-4.

Using simple print statements, this was found to be related to UST and to occur just after the MPI_STARTALL line in w3wavemd.F90.

#ifdef W3_MPI
        IF ( FLOUT(8) .AND. NRQRS.NE.0 ) THEN
          IF ( DSEC21(TIME,TONEXT(:,8)).EQ.0. ) THEN
	    CALL MPI_STARTALL ( NRQRS, IRQRS , IERR_MPI )
            FLGMPI(8) = .TRUE.
            NRQMAX    = MAX ( NRQMAX , NRQRS )
#endif

Commenting out the call to MPI_STARTALL removed the corruption. This is a repeat of the previous figure with the commented out line:
fix wavImp_Sw_z0

With this line commented out, the ATM-WAV case run out for 24 hours on either 20PE or 30PE reproduces. The 30PE case also restart reproduces.

In the ATM-WAV case, failure to reproduce w/ changing decomposition is due to the corrupted z0 field which depends explicitly on the location of the final decomposition element and which occurs when WW3 writes a restart file. This is confirmed by setting date%restart2%stride to a value outside of the total run length. This gives reproducibility across tasks with no code modification.

Finally, WW3 history files written to out do not show any corruption of the field ust. I believe this is because the z0 field is put into the export state after WW3 has completed it's model advance, including any model output and restart writing.

@DeniseWorthen
Copy link
Collaborator Author

I've tested the FLOUT(4) restart and it also produces the same last-DE field corruption.

@DeniseWorthen
Copy link
Collaborator Author

@aliabdolali I have a fix for this issue which appears to work for both iostyp=0,1 and gives both restart repro and repro when changing the decomp for ww3.

The fix is in w3initmd, where I exclude the LPDLIB case from setting up the send and receives:

index 40b7efbe..aa6a8f98 100644
--- a/model/src/w3initmd.F90
+++ b/model/src/w3initmd.F90
@@ -4729,7 +4729,7 @@ CONTAINS
     IH     = 0
     IROOT  = NAPRST - 1
     !
-    IF ( FLOUT(4) .OR. FLOUT(8) ) THEN
+    IF ((FLOUT(4) .OR. FLOUT(8)) .and. (LPDLIB .eqv. .FALSE.)) THEN

This means that nrqrs = 0, so the mpi_startall block in w3initmd is skipped.

@DeniseWorthen
Copy link
Collaborator Author

@JessicaMeixner-NOAA I've run my test cases w/ and w/o the WISE PR merge. The results do not reproduce (I did not expect them to) but I still get reproducibility, restart reproducibility, and debug tests to pass against a "wise" baseline. I also still get decomposition reproducibilty for the atm-wav case. I didn't see any anomalous behaviour.

For reference, this is the difference in exported Z0 after 12 hours, comparing before and after the WISE merge. I see nothing that appears systematic. I think we're fine proceeding with a PR for Issue #1668.

wavImp_Sw_z0_2021-03-22-64800

@JessicaMeixner-NOAA
Copy link
Collaborator

@DeniseWorthen thank you for letting me know and doing this extra testing! I will move forward with including the WISE PR in the sync-merge PR.

@aliabdolali
Copy link
Collaborator

@DeniseWorthen thanks for the info, I am totally on board with your thoughts.

@JessicaMeixner-NOAA
Copy link
Collaborator

In terms of SCOTCH being installed, from my basic searches on hera this morning we do not have scotch installed in official areas. (@MatthewMasarik-NOAA and I have installs of scotch on hera/orion with intel that we are using for now). Its very likely that we'll need to update the build instructions and/or version to get passed the SCOTCH scaling issue. Debugging on that is actively ongoing. Do we want to wait for this? (I'm guessing no), In which, in terms of getting scotch installed for this work in the meantime, how will this be in terms of timing for the open draft PR to move to spack-stack? Ie do we need to ask for scotch on both hpc-stack and spack-stack?

@DeniseWorthen
Copy link
Collaborator Author

I'm fine committing this using METIS for now, and once SCOTCH is working correctly and installed on the platforms, it can be switched to SCOTCH.

Is METIS installed everywhere except WCOSS2? Or just on Cheyenne, Hera and Orion?

@JessicaMeixner-NOAA
Copy link
Collaborator

ParMetis is not installed anywhere in any "official" location to my knowledge either (and is not on wcoss2 for sure). I can provide unofficial locations, but at that point might as well also just use scotch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
No open projects
Development

No branches or pull requests

3 participants