Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Addition of dual Serial/MPI builds #649

Open
1 of 4 tasks
lexming opened this issue Sep 15, 2020 · 3 comments
Open
1 of 4 tasks

Addition of dual Serial/MPI builds #649

lexming opened this issue Sep 15, 2020 · 3 comments
Labels
Milestone

Comments

@lexming
Copy link
Contributor

lexming commented Sep 15, 2020

Following the conf call on dual serial/MPI builds, we reached some milestones and defined a path forward. The main concluding remarks are

  1. It is desirable that the serial and MPI modules of a given package can be loaded at the same time
  2. Packages will have a suffix for serial builds and another suffix for MPI builds (tentative names X.serial and X.MPI)
  3. Both the serial and MPI modules will be visible by default
  4. The toolchain of the serial builds will be decided on a per package basis

Any further actions taken will be tracked in the projects page: https://github.com/orgs/easybuilders/projects/8

First wave of libraries to be dualized:

Anybody interested in contributing to this project can do so by:

  • Making PRs and testing the serial and MPI builds of the aforementioned packages
  • Searching, reporting or PRing easyconfigs that could benefit from switching to serial dependencies. Please follow these requirements
    • Only easyconfigs in the EasyBuild tree
    • Only easyconfigs in the latests toolchains (foss/2020a or intel/2020a and compatible toolchains)
    • PRs of easyconfigs for this project should target the branch dual_serial_mpi
@lexming lexming added this to the next release (4.3.1) milestone Sep 15, 2020
@Micket
Copy link
Contributor

Micket commented Sep 15, 2020

So, an important point to resolve between this can proceed is how compatible an HDF5.Serial and HDF5(.MPI) is. Because one shouldn't be allow to break everything by simply loading 2 modules normally, which we can boil down to, e.g.

module load TensorFlow    # will load HDF5 (MPI) as a dependency (via h5py)
module load SRA-Toolkit   # will load HDF5.Serial as a dependency

mpirun python my_machine_learning_project.py

can't be allowed to result in anything broken, and we can't assume use of RPATH. If this leads to anything broken (and I strongly suspect it does, because.. how could it not?) then we must use a versionsuffix (e.g. HDF5-1.12.0-gompi-2020a-serial.eb) instead to ensure a package conflict, instead of pretending that it is compatible and quietly break things.

@lexming
Copy link
Contributor Author

lexming commented Sep 15, 2020

@Micket I totally agree, that is why we will work in the dual_serial_mpi branch for now. To test such cases and see where things break. As soon as we hit any insurmountable issues, then we can revise our strategy.

@smoors
Copy link
Contributor

smoors commented Sep 15, 2020

the valid concerns raised by @Micket may be (partially?) mitigated if the following rules are adopted:

  • prerequisite: MPI variant is superset of serial variant, that is, all symbols in serial variant are also in MPI variant
  • rule 1) MPI variant depends on serial variant
  • rule 2) serial variant does not depend on MPI variant
  • rule 3) serial variant is not loaded when MPI variant is loaded, preventing that serial variant jumps before MPI variant in LD_LIBRARY_PATH

possible implementation of rule 3) in Lmod:

if not ( isloaded("X.MPI") ) then
    <set environment for name.serial>
end

possible implementation of rule 3) in Tcl:

if { ![ is-loaded X.MPI ] } {
    <set environment for name.serial>
}

expected results:

  • case 1: loading only MPI variant
    due to 1), serial variant is loaded first and MPI variant last, thus libs of MPI variant come first in LD_LIBRARY_PATH
  • case 2: loading only serial variant
    due to 2), only serial variant is in LD_LIBRARY_PATH
  • case 3: loading MPI variant first, serial variant second
    due to 3), only MPI variant is in LD_LIBRARY_PATH
  • case 4: loading serial variant first, MPI variant second
    due to order of loading, libs of MPI variant come first in LD_LIBRARY_PATH

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants