Install and test unified environment on supported HPCs #478

climbfuji · 2023-02-18T02:18:00Z

Is your feature request related to a problem? Please describe.
We need to install and test the unified environment all supported HPCs. A good starting point is the list of preconfigured and configurable (generic) platforms in https://spack-stack.readthedocs.io/en/latest/Platforms.html.

Describe the solution you'd like

Left over from previous PR #454:

Remove ncl from global-workflow-env (also affects macos site config)

See epic #503 for a list of final installations and successful tests. Consider this issue as completed if all the required boxes are ticket in the epic.

Preliminary testing done beforehand:

Orion
- Update site config to contain Intel, GNU, and legacy Intel 18 (for global workflow) configurations
- Install unified environment for Intel + GNU
- Install unified environment for legacy Intel
- Test unified environment for Intel + GNU
  - JEDI-Skylab
  - UFS Weather Model
  - UFS SRW App

Additional context
n/a

The text was updated successfully, but these errors were encountered:

ulmononian · 2023-02-21T16:17:15Z

@climbfuji i can install this in the role.epic space on orion, jet, and cheyenne to start. hera may have to wait for an epic-owned installation because our nems account is at capacity.

would you mind sharing the install recipe?

climbfuji · 2023-02-21T16:25:55Z

@climbfuji i can install this in the role.epic space on orion, jet, and cheyenne to start. hera may have to wait for an epic-owned installation because our nems account is at capacity.

would you mind sharing the install recipe?

Thanks for volunteering. I think we need to agree on the directory structure and naming conventions first, then create an install recipe that we can more or less copy and paste or automate with Jenkins. I wonder if this can wait until Thursday when we have our spack-stack meeting.

Also, we need to update all site configs to have the compilers configured correctly. That can be a separate PR that goes in first. For example, we have this for Orion (https://github.com/NOAA-EMC/spack-stack/blob/develop/configs/sites/orion/packages.yaml):

packages:
  all:
    compiler:: [intel@2022.0.2, intel@18.0.5, gcc@10.2.0]
    providers:
      mpi:: [intel-oneapi-mpi@2021.5.1, intel-mpi@2018.5.274, openmpi@4.0.4]

but what we want is

packages:
  all:
    compiler:: [intel@2022.0.2, gcc@10.2.0]
    #compiler:: [intel@18.0.5]
    providers:
      mpi:: [intel-oneapi-mpi@2021.5.1, openmpi@4.0.4]
      #mpi:: [intel-mpi@2018.5.274]

and then our instructions/automation needs to take care of swapping between Intel-latest+GNU and Intel-18 for global workflow. Also, most sites do not have an Intel 18 configuration. We need to add this for sites were users run the global workflow. This is only a small number of sites, all others are ok with just Intel-whateveristherealready+GNU-whateveristherealready.

ulmononian · 2023-02-21T16:38:51Z

@climbfuji

@climbfuji i can install this in the role.epic space on orion, jet, and cheyenne to start. hera may have to wait for an epic-owned installation because our nems account is at capacity.
would you mind sharing the install recipe?

Thanks for volunteering. I think we need to agree on the directory structure and naming conventions first, then create an install recipe that we can more or less copy and paste or automate with Jenkins. I wonder if this can wait until Thursday when we have our spack-stack meeting.

Also, we need to update all site configs to have the compilers configured correctly. That can be a separate PR that goes in first. For example, we have this for Orion (https://github.com/NOAA-EMC/spack-stack/blob/develop/configs/sites/orion/packages.yaml):
packages:
  all:
    compiler:: [intel@2022.0.2, intel@18.0.5, gcc@10.2.0]
    providers:
      mpi:: [intel-oneapi-mpi@2021.5.1, intel-mpi@2018.5.274, openmpi@4.0.4]
but what we want is
packages:
  all:
    compiler:: [intel@2022.0.2, gcc@10.2.0]
    #compiler:: [intel@18.0.5]
    providers:
      mpi:: [intel-oneapi-mpi@2021.5.1, openmpi@4.0.4]
      #mpi:: [intel-mpi@2018.5.274]
and then our instructions/automation needs to take care of swapping between Intel-latest+GNU and Intel-18 for global workflow. Also, most sites do not have an Intel 18 configuration. We need to add this for sites were users run the global workflow. This is only a small number of sites, all others are ok with just Intel-whateveristherealready+GNU-whateveristherealready.

thanks for this information. totally happy wait until thursday's meeting to discuss things before beginning the installs. since these site configs need updated and some sites need intel 18, there is plenty of prep. to do. are we using spack to install intel@18.0.5 on sites where the GW will be run that do not yet have it?

climbfuji · 2023-02-21T16:48:00Z

No, global workflow runs on a few hpcs that all have intel 18.

…

On Feb 21, 2023, at 9:39 AM, Cameron Book ***@***.***> wrote: @climbfuji <https://github.com/climbfuji> @climbfuji <https://github.com/climbfuji> i can install this in the role.epic space on orion, jet, and cheyenne to start. hera may have to wait for an epic-owned installation because our nems account is at capacity. would you mind sharing the install recipe? Thanks for volunteering. I think we need to agree on the directory structure and naming conventions first, then create an install recipe that we can more or less copy and paste or automate with Jenkins. I wonder if this can wait until Thursday when we have our spack-stack meeting. Also, we need to update all site configs to have the compilers configured correctly. That can be a separate PR that goes in first. For example, we have this for Orion (https://github.com/NOAA-EMC/spack-stack/blob/develop/configs/sites/orion/packages.yaml <https://github.com/NOAA-EMC/spack-stack/blob/develop/configs/sites/orion/packages.yaml>): packages: all: compiler:: ***@***.***, ***@***.***, ***@***.*** providers: mpi:: ***@***.***, ***@***.***, ***@***.*** but what we want is packages: all: compiler:: ***@***.***, ***@***.*** #compiler:: ***@***.*** providers: mpi:: ***@***.***, ***@***.*** #mpi:: ***@***.*** and then our instructions/automation needs to take care of swapping between Intel-latest+GNU and Intel-18 for global workflow. Also, most sites do not have an Intel 18 configuration. We need to add this for sites were users run the global workflow. This is only a small number of sites, all others are ok with just Intel-whateveristherealready+GNU-whateveristherealready. thanks for this information. totally happy wait until thursday's meeting to discuss things before beginning the installs. since these site configs need updated and some sites need intel 18, there is plenty of prep. to do. are we using spack to install ***@***.***? on sites where the GW will be run that do not yet have it? — Reply to this email directly, view it on GitHub <#478 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB5C2RMDXEYPI3JHI55VTJLWYTVSLANCNFSM6AAAAAAVACBJHU>. You are receiving this because you were mentioned.

ulmononian · 2023-02-21T16:53:25Z

No, global workflow runs on a few hpcs that all have intel 18.

10-4

KateFriedman-NOAA · 2023-02-22T15:29:39Z

No, global workflow runs on a few hpcs that all have intel 18.

Once the GSI is able to move off of intel 18 and to the same intel as the other GFS components we shouldn't need intel 18 anywhere anymore. Hoping this happens soon!

ulmononian · 2023-02-22T20:18:18Z

@KateFriedman-NOAA @climbfuji speaking of:

"Dear RDHPCS users,

We plan to deprecate the software module intel/18.0.5.274 and impi/2018.0.4 from Hera .

You are receiving this email because you have loaded the module from either your login profile or your batch jobs during the past year. Deprecating a software module means:

This software module will not be supported by the RDHPCS Application Support Group, including related help tickets. 

The module name will be hidden and users will not see it from the “module avail” command.

The module and related software packages are still on the System(s) without any changes, therefore users can still load and use it as they did before. 

The module and related software packages will NOT be removed from the system(s) until they do not function (e.g. future OS or System upgrades) or it is no longer used. 

The deprecated software list can be found in “ /apps/modules/modulefiles/.modulerc”.

If you believe this module should remain supported (un-deprecated) please start a help ticket to request reversing this change within 5 work days. Otherwise, no response needed. https://rdhpcs-common-docs.rdhpcs.noaa.gov/wiki/index.php/Help_Requests.

Thank you very much!

RDHPCS User Support Group
"

Similar email for Intel 18 and wgrib2/2.0.8 on Jet...

AlexanderRichert-NOAA · 2023-03-14T02:23:37Z

I'm working on #333 on Hera (testing the unified environment with esmf@8.4.1 and mapl@2.35.2), and I've run into the following:

UFS appears to fail when hdf5 has +threadsafe enabled (verifying this right now). I think I've run into this on at least one other machine. The problem is that cdo wants hdf5+threadsafe when using +hdf5 or +netcdf. If that's really the issue, I don't have a good solution, other than to either remove the threadsafe requirement from cdo and hope for the best, or try to resolve the issue in hdf5/UFS, which could, of course, be a rabbit hole. I don't think we've been using hdf5+threadsafe in hpc-stack, so... maybe it would be okay to remove the requirement?
nco tar file needs to be deleted from source cache due to checksum change (someone with permissions: rm /scratch1/NCEPDEV/global/spack-stack/source-cache/nco/nco-5.0.6.tar.gz /scratch1/NCEPDEV/global/spack-stack/source-cache/_source-cache/archive/37/37d11ffe582aa0ee89f77a7b9a176b41e41900e9ab709e780ec0caf52ad60c4b.tar.gz)
We need network access to ftp://ftp.ssec.wisc.edu/ for crtm-fix
We need network access to https://download.osgeo.org/ for GDAL and GEOS

climbfuji · 2023-03-14T02:50:16Z

@AlexanderRichert-NOAA To get around the network access you should be able to transfer hera's spack.lock to a machine that has access (e.g. orion, cheyenne, or your laptop ...) and then create a mirror that you transfer back to hera to intall from. There are a few packages that download stuff during the build, but I don't think this applies to gdal, geos, crtm-fix.

Regarding nco, do you not have write access? I can try if I can delete the cached nco files.

hdf+threadsafe: Not sure if it's a good idea to remove +threadsafe and hope for the best. Someone must have put it in for a reason. But, if you know for sure that cdo only ever gets used without OpenMP parallelism, then it may be ok. But let's make sure first that hdf5+threadsafe is really the problem.

climbfuji · 2023-03-14T03:04:09Z

@AlexanderRichert-NOAA I removed the link and the source file behind it:

[ =0 03:03:20 10000 emc.nemspara@hfe01 ]
~> rm /scratch1/NCEPDEV/global/spack-stack/source-cache/nco/nco-5.0.6.tar.gz
[ =0 03:03:26 10001 emc.nemspara@hfe01 ]
~> rm /scratch1/NCEPDEV/global/spack-stack/source-cache/_source-cache/archive/37/37d11ffe582aa0ee89f77a7b9a176b41e41900e9ab709e780ec0caf52ad60c4b.tar.gz
[ =0 03:03:39 10002 emc.nemspara@hfe01 ]

AlexanderRichert-NOAA · 2023-03-14T16:59:38Z

Well, rats, cdo does use OpenMP... and yet in hpc-stack, hdf5 is built without thread safety. @KateFriedman-NOAA do you know whether cdo could be run without OpenMP for global workflow? If so, then I could probably ease the thread safety requirement for cdo by adding "+openmp" to the when.

KateFriedman-NOAA · 2023-03-14T17:29:13Z

do you know whether cdo could be run without OpenMP for global workflow?

I do not know unfortunately. I don't know much about cdo, I'm just a user of it. :)

climbfuji · 2023-04-10T15:30:44Z

Done, finally. See #503.

climbfuji assigned climbfuji, KateFriedman-NOAA, ulmononian, Hang-Lei-NOAA and AlexanderRichert-NOAA Feb 18, 2023

ulmononian mentioned this issue Feb 21, 2023

Update compiler/MPI entries in site configs for the UE #480

Merged

climbfuji added the INFRA JEDI Infrastructure label Feb 25, 2023

This was referenced Mar 3, 2023

Enable usage of shared pio ufs-community/ufs-weather-model#1640

Closed

UFS-WM testing w/ spack-stack ufs-community/ufs-weather-model#1651

Closed

climbfuji closed this as completed Apr 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Install and test unified environment on supported HPCs #478

Install and test unified environment on supported HPCs #478

climbfuji commented Feb 18, 2023 •

edited

Loading

ulmononian commented Feb 21, 2023

climbfuji commented Feb 21, 2023

ulmononian commented Feb 21, 2023 •

edited

Loading

climbfuji commented Feb 21, 2023 via email

ulmononian commented Feb 21, 2023 •

edited

Loading

KateFriedman-NOAA commented Feb 22, 2023

ulmononian commented Feb 22, 2023 •

edited

Loading

AlexanderRichert-NOAA commented Mar 14, 2023

climbfuji commented Mar 14, 2023

climbfuji commented Mar 14, 2023

AlexanderRichert-NOAA commented Mar 14, 2023

KateFriedman-NOAA commented Mar 14, 2023

climbfuji commented Apr 10, 2023

Install and test unified environment on supported HPCs #478

Install and test unified environment on supported HPCs #478

Comments

climbfuji commented Feb 18, 2023 • edited Loading

ulmononian commented Feb 21, 2023

climbfuji commented Feb 21, 2023

ulmononian commented Feb 21, 2023 • edited Loading

climbfuji commented Feb 21, 2023 via email

ulmononian commented Feb 21, 2023 • edited Loading

KateFriedman-NOAA commented Feb 22, 2023

ulmononian commented Feb 22, 2023 • edited Loading

AlexanderRichert-NOAA commented Mar 14, 2023

climbfuji commented Mar 14, 2023

climbfuji commented Mar 14, 2023

AlexanderRichert-NOAA commented Mar 14, 2023

KateFriedman-NOAA commented Mar 14, 2023

climbfuji commented Apr 10, 2023

climbfuji commented Feb 18, 2023 •

edited

Loading

ulmononian commented Feb 21, 2023 •

edited

Loading

ulmononian commented Feb 21, 2023 •

edited

Loading

ulmononian commented Feb 22, 2023 •

edited

Loading