Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create template for unified environment, test install on orion and AWS parallelcluster #454

Merged
merged 42 commits into from
Feb 18, 2023

Conversation

climbfuji
Copy link
Collaborator

@climbfuji climbfuji commented Jan 25, 2023

Description

This PR adds a new, unified environment that can be built for multiple compilers at once on any given site. It also demonstrates this for Orion and AWS parallelcluster, where multiple compilers+MPI combinations are listed in the site-specific packages.yaml file. Similar changes will need to be made for other site configs in follow-up pull requests.

The associated spack PRs JCSDA/spack#216 and JCSDA/spack#221 contains a number of necessary updates to support the unified environment (the latter with Intel 18).

Potential caveat: As described in #455, spack stack setup-meta-modules works for multiple compiler+MPI combinations for lmod modules, but possibly not for tcl modules. However, this shouldn't be a problem, since tcl modules are becoming less and less, and the sites that currently use them typically only use one compiler.

Todo:

  • Update Orion site config to include all compiler+MPI combinations in the site-specific packages.yaml file and install in a test location (Intel latest, GNU, Intel 18):
    • Intel latest, GNU: /work2/noaa/da/dheinzel-new/spack-stack-unified-env-io-updates/envs/unified-dev-test4/install
    • Intel 18: /work2/noaa/da/dheinzel-new/spack-stack-unified-env-io-updates/envs/unified-dev-test4-intel-18/install
  • Update AWS pcluster site config to include all compiler+MPI combinations in the site-specific packages.yaml file and install in a test location (Intel 2021.4.0 for now): /mnt/experiments-efs/the-real-dom/r2d2-myql/spack-stack-r2d2-mysql/envs/unified-dev/install
  • Build locally on Dom's macOS with apple-clang + gfortran
  • Test the unified environment on Orion with ufs-weather-model, ufs-srw-app, global-workflow and jedi-bundle/skylab
    • Intel latest
      • ufs-weather-model
      • ufs-srw-app
      • global-workflow
      • skylab
    • GNU
      • ufs-weather-model
      • ufs-srw-app
      • global-workflow
      • skylab
    • Intel 18
      • global-workflow
  • Test the unified environment on AWS parallelcluster with ufs-weather-model, ufs-srw-app and jedi-bundle/skylab
    • Intel 2021.4.0
      • ufs-weather-model
      • ufs-srw-app
      • skylab
  • Don't update the CI tests to the new environment yet, but make sure they still work
    • They don't. The CI tests are not functional, and it's got nothing to do with this PR. I tried running the CI tests for another PR (Add Narwhal GNU site config, add self-hosted CI runner on macOS #476) which has no changes whatsoever that could affect the CI tests, and they still fail. Time to switch over to self-hosted runners that we control!
    • macOS apple-clang
    • Linux Intel
    • Linux gcc9/mvapich2
    • Linux gcc10/mpich
  • Ensure the self-hosted runner test on Dom's Mac laptop works (with unified environment):
  • Update documentation!

Issues

Fixes #448
Fixes #471

Dependencies

Testing

See above

Copy link
Collaborator

@srherbener srherbener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this on my M1 MacBook, and after making the change in my comment below, spack-stack built successfully. Then I tried building jedi-bundle which worked, and then I saw 30 ctest failures. However these failures also occur with the current develop and jcsda_emc_spack_stack branches in spack-stack and the spack fork respectively. I don't think that these failures are due to changes in these PRs, but rather to known faulty handling of signals and exceptions on the Mac.

configs/templates/unified-dev/spack.yaml Show resolved Hide resolved
@climbfuji climbfuji added the INFRA JEDI Infrastructure label Jan 30, 2023
doc/source/Platforms.rst Outdated Show resolved Hide resolved
doc/source/Quickstart.rst Outdated Show resolved Hide resolved
doc/source/Quickstart.rst Outdated Show resolved Hide resolved
@KateFriedman-NOAA
Copy link

@climbfuji FYI, I have begun testing global-workflow on Orion using the spack-stack install. I will update issue #471 with a more complete list of modules needed in the global-workflow-env/unified-dev module.

I also need modules that are missing from spack-stack (e.g. pio). Should I open a separate issue for those? If so, what issue type? Compiling a list of missing modules still while testing.

@climbfuji
Copy link
Collaborator Author

climbfuji commented Feb 17, 2023 via email

Copy link
Collaborator

@srherbener srherbener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested on my Mac M1 arm64 architecture using jedi-bundle. spack-stack built successfully as well as jedi-bundle building successfully. I'm seeing the saber test issue with netcdf-c 4.9.0, but that is a separate issue from this PR so I think this is good to go (from my point of view). Thanks!

@climbfuji
Copy link
Collaborator Author

Thanks, @srherbener and @ulmononian . I am waiting for the self-hosted runner test to complete on my macOS. If it does, then I'll merge the spack PR, update the submodule pointer here, and merge this PR.

@climbfuji
Copy link
Collaborator Author

Thanks, @srherbener and @ulmononian . I am waiting for the self-hosted runner test to complete on my macOS. If it does, then I'll merge the spack PR, update the submodule pointer here, and merge this PR.

Tests on my macOS passed :-) Merging spack PR now.

@climbfuji climbfuji merged commit b6298e9 into JCSDA:develop Feb 18, 2023
@climbfuji climbfuji deleted the feature/unified-env branch February 18, 2023 02:05
@KateFriedman-NOAA
Copy link

PIO exists : parallelio

Ah, thanks! I'll use that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
INFRA JEDI Infrastructure
Projects
None yet
6 participants