Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade components and install PMIx #329

Merged
merged 78 commits into from
Apr 3, 2024
Merged
Show file tree
Hide file tree
Changes from 76 commits
Commits
Show all changes
78 commits
Select commit Hold shift + click to select a range
c66db4c
add json file
LiquidPT Mar 19, 2024
3155c8f
first pass of adding config calls
LiquidPT Mar 20, 2024
6a6aaa8
cross reference from JSON
LiquidPT Mar 20, 2024
d10c567
triple check
LiquidPT Mar 20, 2024
764e32e
fix impi in alma
LiquidPT Mar 20, 2024
1db4f98
add prerequisites
LiquidPT Mar 20, 2024
0795b7f
update execute permissions
LiquidPT Mar 20, 2024
e731d63
fix azcopy casing
LiquidPT Mar 20, 2024
8ad8073
fix azcopy path
LiquidPT Mar 20, 2024
48729ee
don't upgrade ubuntu 20.04 to see if it fixes lustre
LiquidPT Mar 20, 2024
f324ea8
fix MVAPICH2 and IMPI
LiquidPT Mar 20, 2024
6570c27
put the 20.04 upgrade back
LiquidPT Mar 20, 2024
92b5f16
update cuda version
LiquidPT Mar 20, 2024
0c7ba59
fix cuda and intel libs
LiquidPT Mar 20, 2024
352429c
update lustre version
LiquidPT Mar 20, 2024
1211980
fix nccl install
LiquidPT Mar 20, 2024
489fbef
fix intel lib on alma
LiquidPT Mar 20, 2024
a7e3b42
fix intel libs on ubuntu
LiquidPT Mar 20, 2024
73bb63d
add execute to disable_user_namespaces
LiquidPT Mar 20, 2024
3385add
fix alma impi
LiquidPT Mar 21, 2024
f9703cf
Update requirments with packages for mar2024
KimPhillips128 Mar 21, 2024
1b54c51
update mpis stripping trailing zero from version for mv statement-imp…
KimPhillips128 Mar 22, 2024
f2c4235
fix path for impi_2021
KimPhillips128 Mar 22, 2024
5af1c03
Update run-tests with package versions
KimPhillips128 Mar 22, 2024
b99c806
update run-tests fixnng mkl version location
KimPhillips128 Mar 22, 2024
24005bc
fix impi2021 path reference in run_tests
KimPhillips128 Mar 22, 2024
77c0cdf
delete old distros
LiquidPT Mar 22, 2024
f22fb75
remove extension scripts
LiquidPT Mar 22, 2024
2ac9ac3
remove Ubuntu 18.04 references
LiquidPT Mar 22, 2024
e0c48c7
fix symbolic link for impi2021
KimPhillips128 Mar 22, 2024
6fbc029
removing .0 from impi_2021 folders.
KimPhillips128 Mar 22, 2024
8c3846a
remove MOFED LTS
LiquidPT Mar 22, 2024
17b14be
Merge branch 'liquidpt/master-requirements-json' into kim/mar2024requ…
LiquidPT Mar 23, 2024
3b95d07
tweak impi config
LiquidPT Mar 23, 2024
03e1a91
add version to end of module path
LiquidPT Mar 25, 2024
1e1fd0d
Merge remote-tracking branch 'upstream/master' into kim/mar2024requir…
LiquidPT Mar 25, 2024
22f793d
test no apt upgrade
LiquidPT Mar 25, 2024
2bd7c94
Revert "test no apt upgrade"
LiquidPT Mar 25, 2024
00991ba
stop fabric manager before starting
LiquidPT Mar 25, 2024
f163a9c
upgrade nvidia drivers to 550.54.15
LiquidPT Mar 26, 2024
7e9c75e
disable fabric manager initialization
LiquidPT Mar 26, 2024
69702c0
enable fabric manager
LiquidPT Mar 26, 2024
ecbaef4
add starting fabric manager to install
LiquidPT Mar 26, 2024
9446c45
update alma
LiquidPT Mar 26, 2024
ad2dc2d
fix impi module filename in alma
LiquidPT Mar 26, 2024
1128732
add logging to mpi install
LiquidPT Mar 26, 2024
d524d0b
add pmix scripts
LiquidPT Mar 27, 2024
ac8d543
update execute permissions
LiquidPT Mar 27, 2024
8b1b247
add pmix version config
LiquidPT Mar 27, 2024
1e5073f
call PMIX from install
LiquidPT Mar 27, 2024
2757327
clean up line endings
LiquidPT Mar 27, 2024
5e72617
install developer libraries
LiquidPT Mar 28, 2024
466aef5
update ubuntu to use the LTS kernel
LiquidPT Mar 28, 2024
b4370ae
move PMC repo to prerequisites
LiquidPT Mar 28, 2024
533c6ab
update LTS kernel package name
LiquidPT Mar 29, 2024
507a334
revert back to upgrading all packages
LiquidPT Mar 29, 2024
095030a
fix prerequisites
LiquidPT Mar 29, 2024
d148896
cleanup whitespace
LiquidPT Mar 29, 2024
a465207
remove IMPI 2018
LiquidPT Mar 30, 2024
0f37b6a
update mpi installs
LiquidPT Mar 30, 2024
7d394e2
fix pmix path
LiquidPT Mar 30, 2024
8716385
fix path to repo file
LiquidPT Mar 30, 2024
e4c50da
fix pmix pathing
LiquidPT Mar 30, 2024
46f8149
set repo dir
LiquidPT Mar 31, 2024
e9e2456
roll back openmpi
LiquidPT Mar 31, 2024
48e9487
fix alma ompi test directory
LiquidPT Mar 31, 2024
0cd10b6
fox for ompi 5 on alma
LiquidPT Apr 1, 2024
1271d54
update alma ompi test
LiquidPT Apr 1, 2024
14af60c
use pmix-libdir
LiquidPT Apr 1, 2024
bd7c124
update pmix path
LiquidPT Apr 1, 2024
6efa1fb
don't remove files
LiquidPT Apr 1, 2024
7e47ab0
Revert "update pmix path"
LiquidPT Apr 2, 2024
dd54c4d
Revert "use pmix-libdir"
LiquidPT Apr 2, 2024
27e228a
Revert "don't remove files"
LiquidPT Apr 2, 2024
95ebd1a
Reapply "don't remove files"
LiquidPT Apr 2, 2024
54696a9
remove PMIX from mvapich2
LiquidPT Apr 2, 2024
f34d85c
Revert "Reapply "don't remove files""
LiquidPT Apr 2, 2024
b953e8c
remove commented out lines
LiquidPT Apr 2, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 7 additions & 4 deletions alma/alma-8.x/alma-8.7-hpc/install.sh
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,9 @@ $ALMA_COMMON_DIR/install_lustre_client.sh "8"
# install mellanox ofed
./install_mellanoxofed.sh

# install PMIX
$ALMA_COMMON_DIR/../alma-8.x/common/install_pmix.sh

# install mpi libraries
./install_mpis.sh

Expand All @@ -32,10 +35,10 @@ $ALMA_COMMON_DIR/install_lustre_client.sh "8"
./install_intel_libs.sh

# cleanup downloaded tarballs - clear some space
rm -rf *.tgz *.bz2 *.tbz *.tar.gz *.run *.deb *_offline.sh
rm -rf /tmp/MLNX_OFED_LINUX* /tmp/*conf*
rm -rf /var/intel/ /var/cache/*
rm -Rf -- */
#rm -rf *.tgz *.bz2 *.tbz *.tar.gz *.run *.deb *_offline.sh
LiquidPT marked this conversation as resolved.
Show resolved Hide resolved
#rm -rf /tmp/MLNX_OFED_LINUX* /tmp/*conf*
#rm -rf /var/intel/ /var/cache/*
#rm -Rf -- */

# Install NCCL
./install_nccl.sh
Expand Down
10 changes: 9 additions & 1 deletion alma/alma-8.x/alma-8.7-hpc/install_mpis.sh
Original file line number Diff line number Diff line change
Expand Up @@ -19,12 +19,20 @@ HPCX_DOWNLOAD_URL=$(jq -r '.url' <<< $hpcx_metadata)
TARBALL=$(basename $HPCX_DOWNLOAD_URL)
HPCX_FOLDER=$(basename $HPCX_DOWNLOAD_URL .tbz)

PMIX_VERSION=$(jq -r '.pmix."'"$DISTRIBUTION"'".version' <<< $COMPONENT_VERSIONS)
PMIX_PATH=${INSTALL_PREFIX}/pmix/${PMIX_VERSION:0:-2}

$COMMON_DIR/download_and_verify.sh $HPCX_DOWNLOAD_URL $HPCX_SHA256
tar -xvf ${TARBALL}

sed -i "s/\/build-result\//\/opt\//" ${HPCX_FOLDER}/hcoll/lib/pkgconfig/hcoll.pc
mv ${HPCX_FOLDER} ${INSTALL_PREFIX}
HPCX_PATH=${INSTALL_PREFIX}/${HPCX_FOLDER}
$COMMON_DIR/write_component_version.sh "HPCX" $HPCX_VERSION

# rebuild HPCX with PMIx
${HPCX_PATH}/utils/hpcx_rebuild.sh --with-hcoll --ompi-extra-config --with-pmix=${PMIX_PATH}

# exclude ucx from updates
sed -i "$ s/$/ ucx*/" /etc/dnf/dnf.conf

Expand All @@ -38,7 +46,7 @@ cat << EOF >> /usr/share/Modules/modulefiles/mpi/hpcx-${HPCX_VERSION}
# HPCx ${HPCX_VERSION}
#
conflict mpi
module load ${HPCX_PATH}/modulefiles/hpcx
module load ${HPCX_PATH}/modulefiles/hpcx-rebuild
EOF

# Create symlinks for modulefiles
Expand Down
47 changes: 0 additions & 47 deletions alma/alma-8.x/common/install_mpis.sh

This file was deleted.

4 changes: 4 additions & 0 deletions alma/alma-8.x/common/install_nvidiagpudriver.sh
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,10 @@ yum install -y ./nvidia-fabric-manager-${NVIDIA_FABRICMANAGER_VERSION}.x86_64.rp
sed -i "$ s/$/ nvidia-fabric-manager/" /etc/dnf/dnf.conf
$COMMON_DIR/write_component_version.sh "NVIDIA_FABRIC_MANAGER" ${NVIDIA_FABRICMANAGER_VERSION}

systemctl enable nvidia-fabricmanager
systemctl start nvidia-fabricmanager
systemctl is-active --quiet nvidia-fabricmanager

# cleanup downloaded files
rm -rf *.run *tar.gz *.rpm
rm -rf -- */
20 changes: 20 additions & 0 deletions alma/alma-8.x/common/install_pmix.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
#!/bin/bash
set -ex

PMIX_VERSION=$(jq -r '.pmix."'"$DISTRIBUTION"'".version' <<< $COMPONENT_VERSIONS)
REPO_DIR="$ALMA_COMMON_DIR/../alma-8.x/common"

cp ${REPO_DIR}/slurmel8.repo /etc/yum.repos.d/slurm.repo

## This package is pre-installed in all hpc images used by cyclecloud, but if customer wants to
## build an image from generic marketplace images then this package sets up the right gpg keys for PMC.
if [ ! -e /etc/yum.repos.d/microsoft-prod.repo ];then
curl -sSL -O https://packages.microsoft.com/config/rhel/8/packages-microsoft-prod.rpm
rpm -i packages-microsoft-prod.rpm
rm packages-microsoft-prod.rpm
fi

dnf config-manager --set-enabled powertools
yum -y install pmix-$PMIX_VERSION.el8 hwloc-devel libevent-devel munge-devel
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok for this PR, but pls add a follow up PR (after release) to move all the dependencies to a single file.


$COMMON_DIR/write_component_version.sh "PMIX" ${PMIX_VERSION}
7 changes: 7 additions & 0 deletions alma/alma-8.x/common/slurmel8.repo
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
[slurm]
name=Slurm Workload Manager
baseurl=https://packages.microsoft.com/yumrepos/slurm-el8-insiders
enabled=1
gpgcheck=1
gpgkey=https://packages.microsoft.com/keys/microsoft.asc
priority=10
35 changes: 21 additions & 14 deletions alma/common/install_mpis.sh
Original file line number Diff line number Diff line change
@@ -1,12 +1,15 @@
#!/bin/bash
set -e
set -ex

GCC_VERSION=$1
HPCX_PATH=$2

PMIX_VERSION=$(jq -r '.pmix."'"$DISTRIBUTION"'".version' <<< $COMPONENT_VERSIONS)

HCOLL_PATH=${HPCX_PATH}/hcoll
UCX_PATH=${HPCX_PATH}/ucx
INSTALL_PREFIX=/opt
PMIX_PATH=${INSTALL_PREFIX}/pmix/${PMIX_VERSION:0:-2}

# Load gcc
export PATH=/opt/${GCC_VERSION}/bin:$PATH
Expand Down Expand Up @@ -41,7 +44,9 @@ OMPI_FOLDER=$(basename $OMPI_DOWNLOAD_URL .tar.gz)
$COMMON_DIR/download_and_verify.sh $OMPI_DOWNLOAD_URL $OMPI_SHA256
tar -xvf $TARBALL
cd $OMPI_FOLDER
./configure --prefix=${INSTALL_PREFIX}/openmpi-${OMPI_VERSION} --with-ucx=${UCX_PATH} --with-hcoll=${HCOLL_PATH} --enable-mpirun-prefix-by-default --with-platform=contrib/platform/mellanox/optimized && make -j$(nproc) && make install
./configure --prefix=${INSTALL_PREFIX}/openmpi-${OMPI_VERSION} --with-ucx=${UCX_PATH} --with-hcoll=${HCOLL_PATH} --with-pmix=${PMIX_PATH} --enable-mpirun-prefix-by-default --with-platform=contrib/platform/mellanox/optimized
make -j$(nproc)
make install
cd ..
$COMMON_DIR/write_component_version.sh "OMPI" ${OMPI_VERSION}

Expand All @@ -57,7 +62,9 @@ IMPI_OFFLINE_INSTALLER=$(basename $IMPI_DOWNLOAD_URL)

$COMMON_DIR/download_and_verify.sh $IMPI_DOWNLOAD_URL $IMPI_SHA256
bash $IMPI_OFFLINE_INSTALLER -s -a -s --eula accept
mv ${INSTALL_PREFIX}/intel/oneapi/mpi/${IMPI_VERSION}/modulefiles/mpi ${INSTALL_PREFIX}/intel/oneapi/mpi/${IMPI_VERSION}/modulefiles/impi

impi_2021_version=${IMPI_VERSION:0:-2}
mv ${INSTALL_PREFIX}/intel/oneapi/mpi/${impi_2021_version}/etc/modulefiles/mpi ${INSTALL_PREFIX}/intel/oneapi/mpi/${impi_2021_version}/etc/modulefiles/impi
$COMMON_DIR/write_component_version.sh "IMPI" ${IMPI_VERSION}

# Setup module files for MPIs
Expand Down Expand Up @@ -100,25 +107,25 @@ setenv MPI_HOME /opt/openmpi-${OMPI_VERSION}
EOF

#IntelMPI-v2021
cat << EOF >> /usr/share/Modules/modulefiles/mpi/impi_${IMPI_VERSION}
cat << EOF >> /usr/share/Modules/modulefiles/mpi/impi_${impi_2021_version}
#%Module 1.0
#
# Intel MPI ${IMPI_VERSION}
# Intel MPI ${impi_2021_version}
#
conflict mpi
module load /opt/intel/oneapi/mpi/${IMPI_VERSION}/modulefiles/impi
setenv MPI_BIN /opt/intel/oneapi/mpi/${IMPI_VERSION}/bin
setenv MPI_INCLUDE /opt/intel/oneapi/mpi/${IMPI_VERSION}/include
setenv MPI_LIB /opt/intel/oneapi/mpi/${IMPI_VERSION}/lib
setenv MPI_MAN /opt/intel/oneapi/mpi/${IMPI_VERSION}/man
setenv MPI_HOME /opt/intel/oneapi/mpi/${IMPI_VERSION}
module load /opt/intel/oneapi/mpi/${impi_2021_version}/etc/modulefiles/impi/${impi_2021_version}
setenv MPI_BIN /opt/intel/oneapi/mpi/${impi_2021_version}/bin
setenv MPI_INCLUDE /opt/intel/oneapi/mpi/${impi_2021_version}/include
setenv MPI_LIB /opt/intel/oneapi/mpi/${impi_2021_version}/lib
setenv MPI_MAN /opt/intel/oneapi/mpi/${impi_2021_version}/share/man
setenv MPI_HOME /opt/intel/oneapi/mpi/${impi_2021_version}
EOF

# Create symlinks for modulefiles
ln -s /usr/share/Modules/modulefiles/mpi/mvapich2-${MVAPICH2_VERSION} /usr/share/Modules/modulefiles/mpi/mvapich2
ln -s /usr/share/Modules/modulefiles/mpi/openmpi-${OMPI_VERSION} /usr/share/Modules/modulefiles/mpi/openmpi
ln -s /usr/share/Modules/modulefiles/mpi/impi_${IMPI_VERSION} /usr/share/Modules/modulefiles/mpi/impi-2021
ln -s /usr/share/Modules/modulefiles/mpi/impi_${impi_2021_version} /usr/share/Modules/modulefiles/mpi/impi-2021

# cleanup downloaded tarballs and other installation files/folders
rm -rf *.tar.gz *offline.sh
rm -rf -- */
# rm -rf *.tar.gz *offline.sh
LiquidPT marked this conversation as resolved.
Show resolved Hide resolved
# rm -rf -- */
5 changes: 3 additions & 2 deletions customizations/ndv4.sh
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,9 @@ EOF

## NVIDIA Fabric manager
systemctl enable nvidia-fabricmanager
systemctl start nvidia-fabricmanager
systemctl is-active --quiet nvidia-fabricmanager
# systemctl stop nvidia-fabricmanager
# systemctl start nvidia-fabricmanager
# systemctl is-active --quiet nvidia-fabricmanager

error_code=$?
if [ ${error_code} -ne 0 ]
Expand Down
5 changes: 3 additions & 2 deletions customizations/ndv5.sh
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,9 @@ EOF

## NVIDIA Fabric manager
systemctl enable nvidia-fabricmanager
systemctl start nvidia-fabricmanager
systemctl is-active --quiet nvidia-fabricmanager
# systemctl stop nvidia-fabricmanager
# systemctl start nvidia-fabricmanager
# systemctl is-active --quiet nvidia-fabricmanager

error_code=$?
if [ ${error_code} -ne 0 ]
Expand Down
Loading