Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to building system libraries with Spack #353

Merged
merged 27 commits into from
Apr 14, 2022

Conversation

xylar
Copy link
Collaborator

@xylar xylar commented Apr 7, 2022

This merge switched to using Spack to build system libraries. This change:

  • adds support for Albany wherever possible (see conda/albany_support.txt for supported machines, compilers and MPI libraries)
  • harnesses maintenance from the community for updating package versions and recipes
  • provides a path to adding future dependencies built with system compilers (e.g. PETSC)

To support Albany, we have moved away from E3SM supported compilers and modules on Badger. In the future, it may make sense to update the E3SM configuration for Badger to match.

To do:

  • Remove grizzly and cori-knl from the list of machines
  • Update the documentation

@xylar xylar added enhancement New feature or request dependencies and deployment Changes relate to creating conda and Spack environments, and creating a load script labels Apr 7, 2022
@xylar xylar self-assigned this Apr 7, 2022
@pep8speaks
Copy link

pep8speaks commented Apr 7, 2022

Hello @xylar! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 308:80: E501 line too long (86 > 79 characters)

Comment last updated at 2022-04-14 07:55:54 UTC

@xylar
Copy link
Collaborator Author

xylar commented Apr 7, 2022

Documentation needs to be updated.

@xylar
Copy link
Collaborator Author

xylar commented Apr 7, 2022

@matthewhoffman and @trhille, could you please confirm that you can run the full_integration test suite or any other tests you would like to run?

The process for getting a conda environment and load script on each machine is:

./conda/configure_compass_env.py --env_name compass_spack --conda ~/miniconda3/ --compiler gnu --mpi <mpi> --spack <test_spack_path> --with_albany

The <test_spack_path> are:

  • Anvil: /lcrc/soft/climate/compass/anvil/test_spack
  • Badger: /usr/projects/climate/SHARED_CLIMATE/compass/badger/test_spack
  • Chrysalis: /lcrc/soft/climate/compass/chrysalis/test_spack
  • Cori-Haswell: /global/cfs/cdirs/e3sm/software/compass/cori-haswell/test_spack

See https://github.com/xylar/compass/blob/switch_to_spack/conda/albany_supported.txt#L3-L7 for supported compilers (always gnu) and MPI libraries.

@xylar
Copy link
Collaborator Author

xylar commented Apr 7, 2022

@mark-petersen, could you please confirm that you can run the nightly test suite or any other tests you would like to run on a few machines and compilers?

The process for getting a conda environment and load script on each machine is:

./conda/configure_compass_env.py --env_name compass_spack --conda ~/miniconda3/ --compiler <compiler> --mpi <mpi> --spack <test_spack_path>

This is the same as the instructions for Matt and Trevor but without the --with-ablany flag.

The <test_spack_path> are:

  • Anvil: /lcrc/soft/climate/compass/anvil/test_spack
  • Badger: /usr/projects/climate/SHARED_CLIMATE/compass/badger/test_spack
  • Chrysalis: /lcrc/soft/climate/compass/chrysalis/test_spack
  • Compy: /share/apps/E3SM/conda_envs/compass/test_spack
  • Cori-Haswell: /global/cfs/cdirs/e3sm/software/compass/cori-haswell/test_spack

Grizzly and Cori-KNL will no longer be supported after this update but all compiler and MPI configurations for other machines in the following table should work:
https://mpas-dev.github.io/compass/latest/developers_guide/machines/index.html#supported-machines

@xylar
Copy link
Collaborator Author

xylar commented Apr 7, 2022

Testing

I have run the ocean nightly test suite on all the following machines and configurations:

  • Anvil intel impi
  • Anvil intel openmpi
  • Anvil intel mvapich
  • Anvil gnu openmpi
  • Anvil gnu mvapich
  • Badger intel impi
  • Badger gnu mvapich
  • Chrysalis intel impi
  • Chrysalis intel openmpi
  • Chrysalis gnu openmpi
  • Compy intel impi
  • Cori-Haswell intel mpt
  • Cori-Haswell gnu mpt
  • Linux gnu mpich
  • Linux gnu openmpi (a few tests fail due to too few cores)
  • OSX gnu/clang mpich (QU tests fail for unknown reasons unrelated to this PR)
  • OSX gnu/clang openmpi (Most test fail due to too few processors)

I have run the landice full_integration test suite on all the following machines and configurations:

  • Anvil gnu openmpi
  • Anvil gnu mvapich
  • Badger gnu mvapich
  • Chrysalis gnu openmpi
  • Cori-Haswell gnu mpt
    Not all tests pass, as I have mentioned to @trhille and @matthewhoffman but we don't think these are related to Spack, just to lack of adequate testing capability prior to this work.

@trhille
Copy link
Collaborator

trhille commented Apr 7, 2022

On Badger, I was able to run the full_integration suite, with one test failing during validation, as we have been seeing (i.e., probably not a Spack issue):

Test Runtimes:
00:11 PASS landice_dome_2000m_sia_restart_test
00:05 PASS landice_dome_2000m_sia_decomposition_test
00:07 PASS landice_dome_variable_resolution_sia_restart_test
00:04 PASS landice_dome_variable_resolution_sia_decomposition_test
00:30 PASS landice_enthalpy_benchmark_A
00:12 PASS landice_eismint2_decomposition_test
00:12 PASS landice_eismint2_enthalpy_decomposition_test
00:12 PASS landice_eismint2_restart_test
00:12 PASS landice_eismint2_enthalpy_restart_test
00:17 PASS landice_greenland_sia_restart_test
00:11 PASS landice_greenland_sia_decomposition_test
00:09 PASS landice_hydro_radial_restart_test
00:06 PASS landice_hydro_radial_decomposition_test
00:16 PASS landice_dome_2000m_fo_decomposition_test
00:15 PASS landice_dome_2000m_fo_restart_test
00:10 PASS landice_dome_variable_resolution_fo_decomposition_test
00:12 PASS landice_dome_variable_resolution_fo_restart_test
00:16 FAIL landice_circular_shelf_decomposition_test
00:31 PASS landice_greenland_fo_decomposition_test
00:38 PASS landice_greenland_fo_restart_test
00:32 PASS landice_thwaites_decomposition_test
00:44 PASS landice_thwaites_restart_test
Total runtime 06:06
FAIL: 1 test failed, see above.

@trhille
Copy link
Collaborator

trhille commented Apr 7, 2022

On Cori, I had to cherry-pick commit ef2a1d9319d8efbfc71e2d96c79b89f49fbda1f4 in order to build because MALI has not yet merged this commit.
The same test failed on validation as on Badger (landice_circular_shelf_decomposition_test), but landice_thwaites_restart_test also failed on validation.

Test Runtimes:
00:36 PASS landice_dome_2000m_sia_restart_test
00:42 PASS landice_dome_2000m_sia_decomposition_test
00:40 PASS landice_dome_variable_resolution_sia_restart_test
00:28 PASS landice_dome_variable_resolution_sia_decomposition_test
00:54 PASS landice_enthalpy_benchmark_A
00:32 PASS landice_eismint2_decomposition_test
00:35 PASS landice_eismint2_enthalpy_decomposition_test
00:35 PASS landice_eismint2_restart_test
00:27 PASS landice_eismint2_enthalpy_restart_test
00:16 PASS landice_greenland_sia_restart_test
00:15 PASS landice_greenland_sia_decomposition_test
00:19 PASS landice_hydro_radial_restart_test
00:16 PASS landice_hydro_radial_decomposition_test
00:23 PASS landice_dome_2000m_fo_decomposition_test
00:24 PASS landice_dome_2000m_fo_restart_test
00:21 PASS landice_dome_variable_resolution_fo_decomposition_test
00:21 PASS landice_dome_variable_resolution_fo_restart_test
00:34 FAIL landice_circular_shelf_decomposition_test
00:38 PASS landice_greenland_fo_decomposition_test
00:43 PASS landice_greenland_fo_restart_test
00:35 PASS landice_thwaites_decomposition_test
00:58 FAIL landice_thwaites_restart_test
Total runtime 11:49
FAIL: 2 tests failed, see above.

This is from the end of case_outputs/landice_thwaites_restart_test.log:

Running: srun -n 36 ./landice_model -n namelist.landice.rst -s streams.landice.rst
PIO: ERROR: Writing variable (PIO_GLOBAL, varid=-1) attribute (parent_id) to file (output.nc, ncid=21) failed. Internal I/O library (PIO_IOTYPE_PNETCDF) call failed. NetCDF: Operation not allowed in data mode (error num=-38), (/tmp/xylar/spack-stage/spack-stage-scorpio-1.2.2-m4bgexru5on3bwi4sfbccikfl7nwzh47/spack-src/src/clib/pio_getput_int.c:408)
PIO: ERROR: Changing the define mode for file (output.nc) failed. Low-level I/O library API failed. NetCDF: Operation not allowed in data mode (error num=-38), (/tmp/xylar/spack-stage/spack-stage-scorpio-1.2.2-m4bgexru5on3bwi4sfbccikfl7nwzh47/spack-src/src/clib/pioc_support.c:3377)


compass calling: compass.landice.tests.thwaites.restart_test.RestartTest.validate()
  in /global/cfs/cdirs/piscees/trhille/compass/compass/landice/tests/thwaites/restart_test/__init__.py

thickness            Time index: 0, 1, 2, 3, 4, 5
5:  l1: 1.70530256582424e-13  l2: 1.27105748646260e-13  linf: 1.13686837721616e-13
  FAIL /global/cscratch1/sd/trhille/spack_20220307/landice/thwaites/restart_test/full_run/output.nc
       /global/cscratch1/sd/trhille/spack_20220307/landice/thwaites/restart_test/restart_run/output.nc
surfaceSpeed         Time index: 0, 1, 2, 3, 4, 5
4:  l1: 2.03759069417942e-18  l2: 1.72837878003098e-19  linf: 4.06575814682064e-20
5:  l1: 1.77898756491574e-18  l2: 1.09073291152169e-19  linf: 2.03287907341032e-20
  FAIL /global/cscratch1/sd/trhille/spack_20220307/landice/thwaites/restart_test/full_run/output.nc
       /global/cscratch1/sd/trhille/spack_20220307/landice/thwaites/restart_test/restart_run/output.nc
Internal test case validation failed
NoneType: None

@xylar
Copy link
Collaborator Author

xylar commented Apr 7, 2022

I also saw the failures in landice_thwaites_restart_test on Cori-Haswell. I suspect they are not necessarily related to Spack but rather different compilers and/or MPI but it would be worth investigating (outside of this PR).

@xylar
Copy link
Collaborator Author

xylar commented Apr 7, 2022

The updated documentation built successfully but it's undoubtedly full of typos...'cause I wrote it...

@mark-petersen
Copy link
Collaborator

I used gnu on badger successfully, start to finish, with:

./conda/configure_compass_env.py --env_name compass_spack --conda ~/miniconda3/ --compiler gnu --spack /usr/projects/climate/SHARED_CLIMATE/compass/badger/test_spack --with_albany
source load_compass_spack_badger_gnu_mvapich.sh
compass suite -s -c ocean -t nightly -f Config_dir/master -w $n/ocean_model_220407_8d33109b_ba_gfortran_openmp_debug_master

and then compiled e3sm/master and ran the suite successfully. But the intel compiler on badger fails. Starting on a fresh node,

./conda/configure_compass_env.py --env_name compass_spack --conda ~/miniconda3/ --compiler intel --spack /usr/projects/climate/SHARED_CLIMATE/compass/badger/test_spack --with_albany  --mpi impi
...
    raise ValueError(f'{compiler} with {mpi} is not supported with Albany on '
ValueError: intel with impi is not supported with Albany on badger

you have badger, impi, intel listed above. Is that correct? or am I using it incorrectly?

@xylar
Copy link
Collaborator Author

xylar commented Apr 8, 2022

@mark-petersen, you had separate instructions not to include the --with-albany flag. Albany is not supported with Intel so far.

Copy link
Collaborator

@mark-petersen mark-petersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this works with intel on badger when I remove --with_albany . Thanks!

./conda/configure_compass_env.py --env_name compass_spack --conda ~/miniconda3/ --compiler intel --spack /usr/projects/climate/SHARED_CLIMATE/compass/badger/test_spack  --mpi impi

@xylar
Copy link
Collaborator Author

xylar commented Apr 8, 2022

Thanks @mark-petersen! I appreciate the testing and review.

Copy link
Collaborator

@trhille trhille left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xylar, I'm approving this based on testing on Badger and Cori that I reported on above. @matthewhoffman will test on Anvil; I'm not sure we have access to Chrysalis. I've looked through the 37 updated files, but most of the particulars there are beyond me.


If you are on a login node, the script should automatically recognize what
machine you are on. You can supply the machine name with ``-m <machine>`` if
you run into trouble with the automatic recognition (e.g. if you're setting
up the environment on a compute node, which is not recommended).

If you are working with MALI, you should specify ``--with_albany``. This will
ensure that the Albany and Trilinos libraries are included along among those
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra word here? "along among"

@matthewhoffman
Copy link
Member

@xylar , this is maybe a minor detail, but I noticed the --help results for configure_compass_env.py give a different name for mpich (mpich) than the name for mpich listed in https://github.com/xylar/compass/blob/switch_to_spack/conda/albany_supported.txt#L3-L7 (mvapich). Also the mpt MPI library on Cori-Haswell is not listed. I realize it would be hard to maintain the list in the help output. In that case, maybe the list should be removed and one should be referred to the docs?

$ ./conda/configure_compass_env.py --help
usage: configure_compass_env.py [-h] [-m MACHINE] [--conda CONDA_BASE]
                                [--spack SPACK_BASE] [--env_name ENV_NAME]
                                [-p PYTHON] [-i MPI] [-c COMPILER]
                                [--env_only] [--recreate] [-f CONFIG_FILE]
                                [--check] [--use_local] [--update_spack]
                                [--tmpdir TMPDIR] [--with_albany]

Deploy a compass conda environment

optional arguments:
...
  -i MPI, --mpi MPI     The MPI library (nompi, mpich, openmpi or a system
                        flavor) to deploy
...

@matthewhoffman
Copy link
Member

matthewhoffman commented Apr 12, 2022

@xylar , I tried this on Anvil (gnu, mvapich), and I was able to create the conda env and build MALI without any problems, which is amazing! When I try compass run on the test suite I set up, I get this error

$ compass run
landice/dome/2000m/sia_restart_test
Traceback (most recent call last):
  File "/home/ac.mhoffman/miniconda3/envs/compass_spack/bin/compass", line 33, in <module>
    sys.exit(load_entry_point('compass', 'console_scripts', 'compass')())
  File "/gpfs/fs1/home/ac.mhoffman/mpas/compass/compass/__main__.py", line 62, in main
    commands[args.command]()
  File "/gpfs/fs1/home/ac.mhoffman/mpas/compass/compass/run.py", line 336, in main
    run_suite(suite, quiet=args.quiet)
  File "/gpfs/fs1/home/ac.mhoffman/mpas/compass/compass/run.py", line 99, in run_suite
    set_cores_per_node(test_case.config)
  File "/gpfs/fs1/home/ac.mhoffman/mpas/compass/compass/parallel.py", line 87, in set_cores_per_node
    node = os.environ['SLURMD_NODENAME']
  File "/home/ac.mhoffman/miniconda3/envs/compass_spack/lib/python3.9/os.py", line 679, in __getitem__
    raise KeyError(key) from None
KeyError: 'SLURMD_NODENAME'

I'm guessing this is an Anvil env issue and not related to the spack functionality, but I'm familiar enough with COMPASS configuration to know what to do. Is the answer obvious to you? (Note: I've never tried to run COMPASS or MALI on Anvil before.)

@matthewhoffman
Copy link
Member

@xylar , in addition to my partial success on Anvil, I tried this on Badger, and things worked great end-to-end! I get the same results as Trevor - everything passes except a validation failure on the landice/circular_shelf/decomposition_test test. The failure there is too big to be the typical decomposition error we sometimes see when we update Albany (i.e. needing to change decomp test tolerance slightly). It looks like something more fundamental, but I suspect it is related to refactoring of Albany b.c. that Mauro did awhile ago, but we haven't been able to test easily until now. In any case, we can ignore that one failure for the purposes of this PR.

Regarding the <test_spack_path> option, do you think we could have this be selected by the machine you are on if you don't provide that option? I think we'd still want the option to provide our own spack path if we are working with a development branch of Albany or on an unsupported machine, but for the standard supported machines, it would be a nice convenience to not need to provide that option.

Regarding additional testing, I'm happy with Trevor having successfully tried Badger and Cori and me successfully testing Badger and Anvil (once we resolve the Anvil issue I noted above). I've never run anything on Chrysalis. How similar is that to running on Anvil? (You can follow up directly with me about that.)

Note: A side effect of this spack config in COMPASS is the total runtime for our full integration suite on Badger went down from 12:45 to 6:00. I think this has to do with us having had a wonky i/o layer in our old library stack (we used to get a bunch of i/o warnings in our output logs)), and the clean, consistent library stack being generated by Spack fixes that. That is really nice!

@matthewhoffman
Copy link
Member

@xylar , one other question - what was the reason for abandoning Cori-KNL? We've been using sometimes still recently with MALI. I don't think it would be a major limitation to not have that ability here, but I'm wondering if there was a reason beyond it's slow and not preferred.

@xylar
Copy link
Collaborator Author

xylar commented Apr 13, 2022

@matthewhoffman

I realize it would be hard to maintain the list in the help output. In that case, maybe the list should be removed and one should be referred to the docs?

I agree, I've changed the --mpi help to just point developers to the documentation. I was thinking of giving a URL but even that is complicated and not very concise.

Get rid of ESMF details

Use path to `nc-config`, `nf-config` and `pnetcdf-config` to find
NetCDF and pNetCDF paths on Cori (just like on other machines).
@xylar
Copy link
Collaborator Author

xylar commented Apr 13, 2022

in addition to my partial success on Anvil, I tried this on Badger, and things worked great end-to-end! I get the same results as Trevor - everything passes except a validation failure on the landice/circular_shelf/decomposition_test test. The failure there is too big to be the typical decomposition error we sometimes see when we update Albany (i.e. needing to change decomp test tolerance slightly). It looks like something more fundamental, but I suspect it is related to refactoring of Albany b.c. that Mauro did awhile ago, but we haven't been able to test easily until now. In any case, we can ignore that one failure for the purposes of this PR.

I also suspect that this is due to changes in Albany. You'll have to decide how you want to deal with that but I'm happy to help provide "before" and "after" Spack environments for assessing changes in Albany if that's needed. It won't be trivial so just keep that in mind.

Regarding the <test_spack_path> option, do you think we could have this be selected by the machine you are on if you don't provide that option? I think we'd still want the option to provide our own spack path if we are working with a development branch of Albany or on an unsupported machine, but for the standard supported machines, it would be a nice convenience to not need to provide that option.

The plan is that, before I merge this PR, I will build all 13 spack environment in the standard location. There are already standard locations defined for each machine and supplying the --update_spack flag without the --spack flag will update the spack environment in the standard location. After the first build, updating Spack will only rebuild packages if there are changes in the packages themselves (or if the version of mache has changed, in which case it will start over from scratch). Here is an example of where the default path is defined:
https://github.com/xylar/compass/blob/switch_to_spack/compass/machines/anvil.cfg#L33

Regarding additional testing, I'm happy with Trevor having successfully tried Badger and Cori and me successfully testing Badger and Anvil (once we resolve the Anvil issue I noted above). I've never run anything on Chrysalis. How similar is that to running on Anvil? (You can follow up directly with me about that.)

E3SM has a preference for running on Anvil rather than Chrysalis but Chrysalis is much faster. I think it's rather important to do MALI testing on Chrysalis because it is one of the primary machines for E3SM simulations. But that really only matters once we get Intel working for Albany, since the production simulations are always done with Intel.

what was the reason for abandoning Cori-KNL? We've been using sometimes still recently with MALI. I don't think it would be a major limitation to not have that ability here, but I'm wondering if there was a reason beyond it's slow and not preferred.

I struggled a lot trying to get Spack to work for KNL. It is very complicated because the packages need to be compiled on KNL compute nodes (i.e. interactive jobs) rather than on login nodes. Python performance on KNL is also abysmal. So I tried to get Spack working on KNL but I gave up after more than a month of trying to make it happen. I would discourage anyone else from wasting time on this for a machine (and an architecture) that has limited remaining lifespan.

@xylar
Copy link
Collaborator Author

xylar commented Apr 13, 2022

@matthewhoffman, could you re-review based on my responses as soon as you are able? That way, I can start building the 13 Spack environments and then merge.

Copy link
Member

@matthewhoffman matthewhoffman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xylar , thanks for your explanations to the questions I left yesterday. I was able to confirm I can run successfully on Anvil after correcting my salloc command.

I also try running the MALI full_integration suite on Chrysalis. I was able to set up the compass conda env using spack with Albany without a problem (gnu/openmpi). I was able to use compass suite to set up the test suite. However, when I ran the tests, I got this MALI runtime error:

/lcrc/group/e3sm/ac.mhoffman/COMPASS_TESTS/BASELINE_chrys/landice/dome/2000m/sia_restart_test/full_run/./landice_model: error while loading shared libraries: libhdf5.so.103: cannot open shared object file: No such file or directory

It's possible there is some corruption between Anvil and Chrysalis, but I tried to be careful to avoid that. In any case, I'm approving this PR in case you want to proceed with what is known to be working. I don't anticipate us using Chrysalis for MALI any time soon, so I'm happy to follow up on sorting that out in a later PR if you prefer. Or we can try to figure it out now - whatever is best for your workload.

compass/machines/default.cfg Outdated Show resolved Hide resolved
docs/developers_guide/quick_start.rst Show resolved Hide resolved
xylar added 18 commits April 14, 2022 08:36
This allows us to use modules other than the E3SM defaults where
needed.

This merge also adds such a custom template for building with
gnu and mvapich on Badger.
If the spack environment include Albany, the libraries needed to
link in Albany will be added to the environment variable
`MPAS_EXTERNAL_LIBS`.
Spack is running out of space on `/tmp` on some machines (e.g.
Compy) and needs a different temp. directory.
We want to exclude it by default because it adds a lot of build
overhead.
This gives us a lot more freedom to explore compiler and MPI
library combinations.
It's not working still...
A file lists the machines, compilers and MPI libraries that work
with Albany.  When a developer creates a compass environment with
a given set and using the `--with_albany` flag, an error will be
raised if the configuration is not supported.
This is currently only needed on Anvil and Chrysalis with Albany
and OpenMPI.
@xylar
Copy link
Collaborator Author

xylar commented Apr 14, 2022

@matthewhoffman, regarding your difficulty on Chrysalis, we should try to track this down. I was able to run without a problem but it isn't great if I'm the only one who can. But we can leave that for the future.

@xylar
Copy link
Collaborator Author

xylar commented Apr 14, 2022

Deployment

I have deployed the following machines and configurations:

  • Anvil intel impi
  • Anvil intel openmpi
  • Anvil intel mvapich
  • Anvil gnu openmpi
  • Anvil gnu mvapich
  • Badger intel impi
  • Badger gnu mvapich
  • Chrysalis intel impi
  • Chrysalis intel openmpi
  • Chrysalis gnu openmpi
  • Compy intel impi
  • Cori-Haswell intel mpt
  • Cori-Haswell gnu mpt

@xylar xylar merged commit 6b95495 into MPAS-Dev:master Apr 14, 2022
@xylar xylar deleted the switch_to_spack branch April 14, 2022 14:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies and deployment Changes relate to creating conda and Spack environments, and creating a load script enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants