Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Post fails for low-resolution experiments #1060

Closed
WalterKolczynski-NOAA opened this issue Oct 11, 2022 · 7 comments · Fixed by #1061 or #1112
Closed

Post fails for low-resolution experiments #1060

WalterKolczynski-NOAA opened this issue Oct 11, 2022 · 7 comments · Fixed by #1061 or #1112
Assignees
Labels
bug Something isn't working

Comments

@WalterKolczynski-NOAA
Copy link
Contributor

Expected behavior
Post should run for any resolution.

Current behavior
When running C96, post fails with a too many MPI tasks, max is 96 stopping message. Presumably UPP is limiting the number of ranks to the resolution (hold over from spectral?)

Machines affected
Discovered on Orion, but presumably on every machine.

To Reproduce

  1. Set up any experiment using C96 resolution
  2. Confirm the gdaspost and/or gfspost tasks fail in the first full cycle
  3. Check the outpost_gfs_${CDATE}_postcntrl_gfs_anl.xml file in ${DATA} to see the error message
@WalterKolczynski-NOAA WalterKolczynski-NOAA added the bug Something isn't working label Oct 11, 2022
@WalterKolczynski-NOAA WalterKolczynski-NOAA self-assigned this Oct 11, 2022
WalterKolczynski-NOAA added a commit to WalterKolczynski-NOAA/global-workflow that referenced this issue Oct 11, 2022
Limits the number of MPI tasks for post to the resolution of the
forecast. UPP seems to fail if it is given more ranks than the
resolution.

Fixes NOAA-EMC#1060
WalterKolczynski-NOAA added a commit that referenced this issue Oct 11, 2022
Limits the number of MPI tasks for post to the resolution of the
forecast. UPP seems to fail if it is given more ranks than the
resolution.

Fixes #1060
@CoryMartin-NOAA
Copy link
Contributor

I am not convinced this is fixed.
A C96 experiment in rocoto had 3 nodes, 40 processors on Orion. 120 PEs, wich 96 < 120, so it failed. My guess is the config.resources did something like:
126 -> 96 (for resolution) but then 96/40 = 3 (because of 40 PEs/node on Orion)

@WalterKolczynski-NOAA would it be possible to confirm this is working as you intended it to be by an XML generation test only? Otherwise, I think the post scripts have to be modified to call srun with only the right number of PEs.

@WalterKolczynski-NOAA
Copy link
Contributor Author

Yeah, I've seen it too. I have a fix I just need to submit it. In env/ORION.env, change APRUN_NP in the post section to ${launcher} -n ${npe_post}

@CoryMartin-NOAA
Copy link
Contributor

@WalterKolczynski-NOAA thanks, just wanted to make sure I wasn't going crazy/had a bad setup.

@WalterKolczynski-NOAA
Copy link
Contributor Author

I think the fix did work, but then subsequent updates forced a different solution.

@KateFriedman-NOAA KateFriedman-NOAA self-assigned this Nov 1, 2022
@KateFriedman-NOAA
Copy link
Member

Taking this on.

KateFriedman-NOAA added a commit to KateFriedman-NOAA/global-workflow that referenced this issue Nov 7, 2022
- Create mpmd_opt variable and set it to "--multi-prog".
- Replace instances of "--multi-prog" in launcher commands
with new mpmd_opt variable.

Refs NOAA-EMC#1060
KateFriedman-NOAA added a commit to KateFriedman-NOAA/global-workflow that referenced this issue Nov 7, 2022
- Create new mpmd_opt variable (="--multi-prog") and replace
instances of "--multi-prog" in launcher commands with mpmd_opt variable.
- Update launcher commands that were missing "-n $npe" flags
to now include npe # flag.

Refs NOAA-EMC#1060
KateFriedman-NOAA added a commit to KateFriedman-NOAA/global-workflow that referenced this issue Nov 7, 2022
- Increase tasks from 20 to 40 for the lowest resolutions for
the eobs jobs (C96 and C48).
- The C96 eobs job was hitting the walltime with only 20 tasks.

Refs NOAA-EMC#1060
KateFriedman-NOAA added a commit to KateFriedman-NOAA/global-workflow that referenced this issue Nov 7, 2022
- Some launcher commands were missing the "-n $npe" flag.
- Add "-n" and task # variable in launcher commands if missing.

Refs NOAA-EMC#1060
KateFriedman-NOAA added a commit to KateFriedman-NOAA/global-workflow that referenced this issue Nov 14, 2022
KateFriedman-NOAA added a commit that referenced this issue Nov 15, 2022
* Update multi-prog in HERA.env and ORION.env
* Update launcher commands in HERA.env and ORION.env
* Adjust C96 & C48 eobs resources in config.resources

Refs #1060
@KateFriedman-NOAA
Copy link
Member

@CoryMartin-NOAA This should now be fixed. Let me know if you encounter further issues with the low-res post jobs.

@CoryMartin-NOAA
Copy link
Contributor

Thank you @KateFriedman-NOAA will do!!

KateFriedman-NOAA added a commit to KateFriedman-NOAA/global-workflow that referenced this issue Nov 17, 2022
- When the mpmd variable in the R&D env files was renamed to
mpmd_opt the wave_mpmd setting in JGLOBAL_WAVE_INIT was not
updated to match and thus broke the job when tested.
- Update the wave_mpmd setting in JGLOBAL_WAVE_INIT to use the
wave_mpmd setting defined in the env files instead of the old
mpmd variable.

Refs NOAA-EMC#1060
KateFriedman-NOAA added a commit to KateFriedman-NOAA/global-workflow that referenced this issue Nov 17, 2022
- Make matching changes to Jet and S4 env files to set mpmd_opt
and use it in launcher commands in place of prior mpmd variable.

Refs NOAA-EMC#1060
KateFriedman-NOAA added a commit to KateFriedman-NOAA/global-workflow that referenced this issue Nov 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants