-
Notifications
You must be signed in to change notification settings - Fork 0
EESSI hackathon Dec'21
- when: week of Dec 13-17 2021
- main goal: focused effort on various tasks in EESSI
-
expectations:
- joining kickoff/sync/show&tell meetings
- spending a couple of hours that week on one or more of the outlined tasks (in group)
- take extensive notes (to integrate into documentation later)
- registration: https://doodle.com/poll/xha7h6pawwuk5xc2
- original list of potential tasks
- GitHub repo for EESSI hackathon(s): https://github.com/EESSI/hackathons
-
Mon Dec 13th 2021, 09:00 UTC: kickoff
- clarify expectations
- overview of tasks
- getting organised: who works on what, form groups
-
Wed Dec 15th 2021, 09:00 UTC: sync
- sync meeting notes:
- quick progress report per group
- briefly discuss next steps
- notes: (see below)
-
Fri Dec 17th 2021, 13:00 UTC: show & tell
- each group briefly demos/presents what they worked on
- outline follow-up steps
- slides: https://raw.githubusercontent.com/EESSI/meetings/main/meetings/EESSI_hackathon_2021-12_show_and_tell.pdf
- recording: https://www.youtube.com/watch?v=H6Wx6hAO-r0
If you plan to actively participate in this hackathon:
- add your name + affiliation + GitHub handle below (or ask someone to do it for you)
- feel free to pick ONE task you would like to work on, add your name to the list for that task (see
people working on this
)
Joining:
- Kenneth Hoste (HPC-UGent) -
@boegel
- Thomas Röblitz (HPC-UBergen) -
@trz42
- Bob Dröge (HPC-UGroningen) -
@bedroge
- Ward Poelmans (VUB-HPC) -
@wpoely86
- Jurij Pečar (EMBL) -
@jpecar
- Martin Errenst (HPC.NRW / University of Wuppertal) -
@stderr-enst
- Axel Rosén (HPC-UOslo) -
@rungitta
- Terje Kvernes (UOslo) -
@terjekv
- Alan O'Cais (CECAM) -
@ocaisa
- Caspar van Leeuwen (SURF) -
@casparvl
- Ahmad Hesam (SURF)
- Michael Hübner (UBonn - HPC.NRW)
- Erica Bianco (HPCNow!) -
@kErica
- Hugo Meiland (Microsoft Azure) -
@hmeiland
- Jörg Saßmannshausen (NHS/GSTT) -
@sassy-crick
Please use the virtual clusters we have set up for this hackathon!
- EESSI pilot repository is readily available
- Different CPU types supported
- Singularity is installed
- managed by Alan
- all info at https://github.com/EESSI/hackathons/tree/main/2021-12/magic_castle
- managed by Kenneth
- all info at https://github.com/EESSI/hackathons/tree/main/2021-12/citc
If you need help, contact us via the EESSI Slack (join via https://www.eessi-hpc.org/join)
General hackathon channel: #hackathon.
See also task-specific channels!
Based on the doodle, a subset of main tasks was selected for this hackathon:
- [02] Installing software on top of EESSI
- task lead: Kenneth
- participating: Kenneth, Erica, Martin, (Ahmad)
- notes: (see below)
- Slack channel: #hackathon-software_on_top
- Zoom: https://uib.zoom.us/j/65344277321?pwd=THVPY3hZQmlRa0loOWd6b2xKaFRrZz09
- [03] Workflow to propose additions to EESSI software stack
- task lead: Bob
- participating: Bob, Jörg (+ Kenneth)
- notes: (see below)
- Slack channel: #hackathon-contribution_workflow
- Zoom: https://uib.zoom.us/j/69823235860?pwd=UjRNYmV0UGoxSmdGMkZsclpBSGJZQT09
- [05] GPU support
- task lead: Alan
- participating: Alan, Michael, Ward
- notes: (see below)
- Slack channel: #hackathon-gpu_support
- Zoom: https://uib.zoom.us/j/69890745932?pwd=bWlxV2prTyswS0Q4SWptMzA3bDVBQT09
- [06] EESSI test suite
- task lead: Caspar
- participating: Caspar, Vasileois, Thomas, Hugo, (Bob)
- notes: (see below)
- Slack channel: #hackathon-test_suite
- Zoom: https://uib.zoom.us/j/63178835002?pwd=SnUzTmFpcmlhS0VueWRwM2RicGtBdz09
- [16] Export a version of the EESSI stack to a tarball and/or container image
- task lead: Jure
- participating: Jure
- notes: (see below)
- Slack channel: #hackathon-export_software_stack
Lone wolves:
- Axel + Terje: monitoring
- notes: (see below)
- Zoom: https://uib.zoom.us/j/61135526605?pwd=VkZnRXhMVTI1RkIxTis2Vm4yUkRtQT09
- Ahmad + Axel: private Stratum-1
- notes: (see below)
- Zoom: https://uib.zoom.us/j/62100150823?pwd=OFNpWk9RZ3llTWdqZ0VId1VUMG03UT09
- Hugo (+ Matt): Azure support in CitC
Task progress:
- task notes: (see below)
- executive summary:
- great progress by Martin on including RPATH wrappers into
GCCcore
installation in EESSI to facilitate building software manually on top of EESSI - Hugo got WRF installed on top of EESSI using EasyBuild
- TODO:
- Figure out best way to add support to GCC easyblock to opt-in to also installing RPATH wrappers
- Check on interference between included RPATH wrappers and the dynamic ones set up by EasyBuild
- Documentation on installing software on top of EESSI
- Fully autonomous build script (in Prefix env + build container, etc.)
- use stdin trick to run stuff in Prefix env
- great progress by Martin on including RPATH wrappers into
/.../startprefix <<<
...
- task notes: (see below)
- executive summary:
- initial planning + implementation done
- GitHub App/bot is being developed in https://github.com/EESSI/eessi-bot-software-layer
- bot can already react to opening of PR
- support was added to replay events (to facilitate testing)
- Jörg's build container/script can be used in "backend" of bot
- test PR: https://github.com/EESSI/hackathons/pull/2/files
- Is the bot used by CVMFS public?
- How will distributed resources be used?
- One bot that talks to other resources
- Multiple bots negotiating
- Bot should report back results of build/test (especially in terms of failure)
- task notes: (see below)
- executive summary:
- ...
- would be useful to have more recent toolchains installed in 2021.12 (for AMD Rome)
- task notes: (see below)
- executive summary:
- ReFrame intro by Vasileios
- Created list of compat & software layer tests needed (see task notes)
- Setup ReFrame 3.9.2 (on top of EESSI! :) ) on Magic Castle
- Made GROMACS EESSI test (written on top of CSCS GROMACS libtest) work on Magic Castle
- Only works from Jupyter terminal because of SELinux, but should be resolved in the future (see https://github.com/ComputeCanada/puppet-magic_castle/issues/163)
- Default mem on Magic Castle jobs limited => Need to add memory request / requirements to GROMACS EESSI test
- Looking into compat tests (https://github.com/EESSI/compatibility-layer/blob/main/test/compat_layer.py)
- so far mostly trying to understand ReFrame, rerunning tests, ...
- Started WRF test, working on download benchmark and prepping rundir
- separate CVMFS repo in EESSI for shipping large data files (benchmark inputs)?
- ReFrame: don't copy large files?
- CVMFS: dealing with large files?
- task notes: (see below)
- executive summary:
- see detailed notes
- copy takes time, needed variant symlinks not yet in place
- alternative approach could be a separate archive repo + docs to use it
- task notes: (see below)
- executive summary:
- see notes
- Hugo: CVMFS through Azure blob will make monitoring more challenging
- also applies to updating the repo contents
- task notes: (see below)
- executive summary:
- see notes
- most work done by Ahmad, testing by Alex
- test suite to verify Stratum-1? => https://github.com/EESSI/filesystem-layer/issues/111
-
task notes: ???
-
executive summary:
- WIP
-
credit status in AWS:
-
$110 on Mon ($25 GPU node, ~$45 EFA nodes in Magic Castle) -
$130 on Tue ($40 GPU node, ~$45 EFA nodes in Magic Castle, ~$20 CitC nodes)
-
- separate note: working towards user-facing docs: https://hackmd.io/irkuPm4BSye6OL24wKmpgw
- building with GCC included in EESSI -- For notes, see section below
- Have PythonPackage to work on top of EESSI (with EasyBuild)
- have a simple EB recipe working on EESSI out of the box (configuring EB correctly)
- start prefix and then configure correctly EB
/cvmfs/pilot.eessi-hpc.org/2021.06/compat/linux/$(uname -m)/startprefix
- https://github.com/EESSI/software-layer/blob/main/configure_easybuild
- start prefix and then configure correctly EB
- some documentation on it
- have a simple EB recipe working on EESSI out of the box (configuring EB correctly)
- Hugo: building WRF on top of EESSI (with EasyBuild)
- standalone script to install software on top of EESSI
startprefix <<<
source configure_easybuild
eb example.eb
-
module load GCC
should set up RPATH wrappers, such thatgcc
and other tools do the right thing with respect to RPATH- Should be transparent to users for simple builds
- Users should still be aware of the issue => Documentation needed
- Development workflow:
-
module load GCC
with unchanged module - Write python script that generates rpath wrappers from easybuild framework functions
- Put resulting script in
PATH
to replacegcc
and other commands and forward to correct commands within script - If everything works, figure out how to create & ship wrappers on install step of GCC
-
- Look into
prepare_rpath_wrapper
from easybuild-framework - What to do with filters and include paths?
- Make use of environment variables
- Environment variables could be set during easybuild installation step
- Env-Variables should be discussed in documentation if users want to exclude/inject certain libraries in their build process
- Just for reference, ComputeCanada provide a script to patch binaries: https://github.com/ComputeCanada/easybuild-computecanada-config/blob/main/setrpaths.sh
- And they ship an ld-wrapper with their compatibility layer: https://github.com/ComputeCanada/gentoo-overlay/blob/8fdb45ba676a5fbb19f165bd85a9c82470218753/sys-devel/binutils-config/files/ld-wrapper.sh
-
First script available
- Assumes
source /cvmfs/pilot.eessi-hpc.org/2021.06/init/bash
andmodule load EasyBuild/4.4.1
- Produces wrapper scripts in
/tmp/eb-.../tmp...../rpath_wrappers/{gcc,gxx,gfortran,ld.bfd,ld.gold,ld}_wrapper
- Output looks like this:
export PATH=/tmp/eb-d3kwvfit/tmpg3zkeuoc/rpath_wrappers/ld.bfd_wrapper:<other wrapper paths + original $PATH>
- You can set the environment variables
$RPATH_FILTER_DIRS
and$RPATH_INCLUDE_DIRS
to exclude/include certain paths as RPATHs. Right now the variables are expected to be comma-separated lists. Should we change it to ':' or configurable separator? - Compiling MCFM as a test project.
readelf -d
lists RPATHs and program seems to work. - Compiling a hello world example with and without this modified
PATH
gives:
- Assumes
$ readelf -d hello_world_* | grep rpath
File: hello_world_norpath
File: hello_world_rpath
0x000000000000000f (RPATH) Library rpath: [/cvmfs/pilot.eessi-hpc.org/2021.06/software/linux/x86_64/intel/haswell/software/GCCcore/9.3.0/lib/gcc/x86_64-pc-linux-gnu/9.3.0:/cvmfs/pilot.eessi-hpc.org/2021.06/software/linux/x86_64/intel/haswell/software/GCCcore/9.3.0/lib/gcc/x86_64-pc-linux-gnu/9.3.0/../../../../lib64:/cvmfs/pilot.eessi-hpc.org/2021.06/compat/linux/x86_64/lib/../lib64:/cvmfs/pilot.eessi-hpc.org/2021.06/compat/linux/x86_64/usr/lib/../lib64:/cvmfs/pilot.eessi-hpc.org/2021.06/software/linux/x86_64/intel/haswell/software/GCCcore/9.3.0/lib/gcc/x86_64-pc-linux-gnu/9.3.0/../../..:/cvmfs/pilot.eessi-hpc.org/2021.06/compat/linux/x86_64/lib:/cvmfs/pilot.eessi-hpc.org/2021.06/compat/linux/x86_64/usr/lib]
- Integrate rpath wrapper creating function into corresponding EasyBuild blocks
- Creates wrappers
- moves them to
bin/rpath_wrappers
in install location - sets
self.wrapperdir
- Function should be made more generic. Reusable for other compiler easyblocks as well, not just
gcc.py
- Not an opt-in option yet, but used whenever
build_option('rpath')
is set
- Setting
PATH
while loading GCCcore can just be done by addingguesses['PATH'].insert(0, self.wrapperdir)
to gcc.py#make_module_req_guess() - Left to do (bold = WIP):
- Refactor
create_rpath_wrappers
to a generic loaction - Integrate rpath wrappers as opt-in option
install_rpath_wrappers = True
in EasyBuild - Include environment variable
DISABLE_RPATH_WRAPPERS
in module file - How to handle ld wrappers when loading GCC module? How to do this in EasyBuild vs. how to do this in EESSI with binutils from compatibility layer?
- Make sure rpath wrappers don't mess up loading GCC modules within EasyBuild
- Ship similar feature for other popular compilers?
- write documentation in EasyBuild RPATH, Compiler docu. (Also in module file itself?)
- Refactor
- Continuing discussion here: EasyBuild PR 2638
#!/usr/bin/env python
from easybuild.tools.options import set_up_configuration
set_up_configuration(silent=True)
from easybuild.tools.toolchain.toolchain import Toolchain
tc = Toolchain()
tc.prepare_rpath_wrappers([], [])
- The following procedure will build correct rpath binaries for WRF
/cvmfs/pilot.eessi-hpc.org/2021.06/compat/linux/$(uname -m)/startprefix
source /cvmfs/pilot.eessi-hpc.org/2021.06/init/bash
ml load EasyBuild/4.4.1
export EASYBUILD_PREFIX=/project/def-sponsor00/easybuild
export EASYBUILD_IGNORE_OSDEPS=1
export EASYBUILD_SYSROOT=${EPREFIX}
export EASYBUILD_RPATH=1
export EASYBUILD_FILTER_ENV_VARS=LD_LIBRARY_PATH
export EASYBUILD_FILTER_DEPS=Autoconf,Automake,Autotools,binutils,bzip2,cURL,DBus,flex,gettext,gperf,help2man,intltool,libreadline,libtool,Lua,M4,makeinfo,ncurses,util-linux,XZ,zlib
export EASYBUILD_MODULE_EXTENSIONS=1
eb -S WRF
-> get correct CFGS1=/cvmfs/pilot.eessi-hpc.org/2021.06/software/linux/x86_64/intel/<arch>/software/EasyBuild/4.4.1/easybuild/easyconfigs
eb -r $CFGS1/w/WRF/WRF-3.9.1.1-foss-2020a-dmpar.eb
- configure easybuild properly
- feed a script with easybuild recipe to install it on top of EESSI
1. start EESSI environment
source /cvmfs/pilot.eessi-hpc.org/2021.06/init/bash
2. load EasyBuild
ml EasyBuild/4.4.1
3. install your recipe
eb YOURAPP.eb
- check the different archs to build the app upon
- read the recipe(s) and the patches related
→
git diff
with the actual app list? - build the app(s) on the different archs, ideally in parallel
# ARCHLIST = list of architectures
# REPO = where your recipes and patches are stored
for arch in $ARCHLIST
do
install_on_top.sh $REPO $ARCH
done
eessi-bot@3.250.220.9
https://github.com/EESSI/eessi-bot-software-layer
- Go through brainstorm meeting notes [Everyone]
- Set up the app on VM [Bob]
- Make a very simple easystack example [Jörg]
- Collect some event data [Bob]
- Can easily be done using our Smee URL: https://smee.io/7PIXBDoqczjEVXaf
- Meet at 4pm CET
- Set up bot account on CitC [Kenneth - DONE]
- Install + start app on bot account [Bob - DONE]
- Make pull request with an easystack file
- Collect some event data [Bob]
- use
hackathon
repo
- Add possibility to use dummy event data as input [Bob]
- Add backend script to bot account on AWS node [Jörg]
- Adapt backend scripts for EESSI/AWS cluster [Jörg]
- Add some basic funcionality [Bob]
- React to PR
- Grab the easystack file from the pull request (checkout of branch)
- Submit a job that launches the build script for the given easystack
- On the (build/login) node where the app is running:
- Create unique working directory for the job (use event id?)
- Checkout the branch with the easystack file
- Submit the job
- Take and upload the log(s) in case of failures
- (submit job to) do some eb run
- Install the apps from the easystack
- Run tests (
eb --sanity-check-only
) - Make a tarball
- report back in PR
- (submit job to) do test run
- Different OS
- Unpack tarball
- Re-run tests
- report back in PR
sbatch -C shape=c4.2xlarge
=> haswell
# Clone the repo
git clone https://github.com/EESSI/eessi-bot-software-layer
# Run smee (in screen)
cd eessi-bot-software-layer
./smee.sh
# Run the app itself with a Python virtual environment
cd eessi-bot-software-layer
python3 -m venv venv
source venv/bin/activate
pip3 install -r requirements.txt
./run.sh
https://github.com/PyGithub/PyGithub/issues/1766#issuecomment-749519409
# Fill the required information
APP_ID =
PRIVATE_TOKEN =
INSTALLATION_ID =
github_integration = github.GithubIntegration(APP_ID, PRIVATE_TOKEN)
# Note that installation access tokens last only for 1 hour, you will need to regenerate them after they expire.
access_token = github_integration.get_access_token(INSTALLATION_ID)
login = github.Github(access_token)
-
APP_ID
can be found at: https://github.com/organizations/EESSI/settings/apps/eessi-bot-software-layer -
PRIVATE_TOKEN
is a private key that can be generated on the same page -
INSTALLATION_ID
can be found by going to this page, selecting the configuration button for the installed app, and copy it from the URL: https://github.com/organizations/EESSI/settings/apps/eessi-bot-software-layer/installations- or: use
github_integration.get_installation('EESSI', 'software-layer')
(or some other repo to which the app is subscribed)
- or: use
CUDA cannot (currently) be distributed by EESSI!
Install compatiblity layers from CUDA to deal with CUDA/driver (mis)matching
Use host-injection for EESSI to get CUDA available to EESSI. Use symlink in EESSI that needs to point to correct CUDA path on local site. Need check to check it's actually working.
So EESSI needs to set up cuda on the local site.
Approach:
- EESSI software expect CUDA to be at certain location (a broken symlink by default)
- Host side should 'provide' the symlink location (host injection)
- nvidia drivers need to be available on the host system (outside EESSI)
- EESSI will set up CUDA on the host side (need writable path)
- Lmod visible hook: only show CUDA dependant modules when CUDA works (maybe check variable EESSI_GPU_SUPPORT_ACTIVE=1?)
- Use EasyBuild to install CUDA module on the host side.
Planning:
- Alan will try to reproduce what he already did before but didn't document ;)
- Ward has a solid block of available time on Thursday
- Michael helps out wherever he can
Related issues on GitHub:
- Enabling end-user GPU support and ABI-compatible overrides
- GPU support in the compatability layer
- Some canary in the coalmine issues with GPU support
Working environment is on eessi-gpu.learnhpc.eu.
A shared space for installations at /project/def-sponsor00/easybuild
is to be created. To behave similarly to the EESSI installation script, we need to drop into a Gentoo prefix shell with :
$EPREFIX/startprefix
The full EasyBuild environment used was
source /etc/profile.d/z-01-site.sh
export EASYBUILD_PREFIX=/project/def-sponsor00/easybuild
export EASYBUILD_IGNORE_OSDEPS=1
export EASYBUILD_SYSROOT=${EPREFIX}
export EASYBUILD_RPATH=1
export EASYBUILD_FILTER_ENV_VARS=LD_LIBRARY_PATH
export EASYBUILD_FILTER_DEPS=Autoconf,Automake,Autotools,binutils,bzip2,cURL,DBus,flex,gettext,gperf,help2man,intltool,libreadline,libtool,Lua,M4,makeinfo,ncurses,util-linux,XZ,zlib
export EASYBUILD_MODULE_EXTENSIONS=1
module load EasyBuild
At this point we can now install software with EasyBuild
Nothing special here, standard installation with EasyBuild:
eb CUDAcore-11.3.1.eb
Once installed, we need to make the module available:
module use /project/def-sponsor00/easybuild/modules/all/
We just installed CUDA 11.3 but we check the CUDA version supported by our driver:
[ocaisa@gpunode1 ocaisa]$ nvidia-smi
Mon Dec 13 13:57:00 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GRID V100-4C On | 00000000:00:05.0 Off | N/A |
| N/A N/A P0 N/A / N/A | 304MiB / 4096MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
We do not (necessarily) need to update our drivers to use the latest CUDA. NVIDIA have long term support drivers (R450 and R470 for now) with which you can use CUDA compatability libraries to use the latest CUDA. Making these libaries findable by the compat layer is enough to give you a working CUDA.
For details see https://docs.nvidia.com/datacenter/tesla/drivers/ (and specifically https://docs.nvidia.com/datacenter/tesla/drivers/#cuda-drivers) where they say
A production branch that will be supported and maintained for a much longer time than a normal production branch is supported. Every LTSB is a production branch, but not every production branch is an LTSB.
. We can parse https://docs.nvidia.com/datacenter/tesla/drivers/releases.json to figure out the LTS branches (and whether someone should upgrade).
At any point in time, it is best to install the latest version of the compat libraries (since these will track the driver versions). To find the right compat libraries to install we need to be able to navigate https://developer.download.nvidia.com/compute/cuda/repos/, selecting the right OS and the latest version of the compat libraries.
Once we know this, we install the CUDA compatability libraries so that 11.3 will work with our driver version. Let's put the drivers in a place that is automatically found by the EESSI linker (/cvmfs/pilot.eessi-hpc.org/host_injections/2021.06/compat/linux/x86_64/lib
) and set things up so we can universally upgrade to a later version of the compat libaries. /cvmfs/pilot.eessi-hpc.org/host_injections
points to /opt/eessi
by default and I have made that group writable on our cluster:
# Create a general space for our NVIDIA compat drivers
mkdir -p /cvmfs/pilot.eessi-hpc.org/host_injections/nvidia
cd /cvmfs/pilot.eessi-hpc.org/host_injections/nvidia
# Grab the latest compat library RPM
wget https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-compat-11-5-495.29.05-1.x86_64.rpm
# Unpack it
rpm2cpio cuda-compat-11-5-495.29.05-1.x86_64.rpm | cpio -idmv
mv usr/local/cuda-11.5 .
rm -r usr
# Add a symlink that points to the latest version
ln -s cuda-11.5 latest
# Create the space to host the libraries
mkdir -p /cvmfs/pilot.eessi-hpc.org/host_injections/2021.06/compat/linux/x86_64
# Symlink in the path to the latest libraries
ln -s /cvmfs/pilot.eessi-hpc.org/host_injections/nvidia/latest/compat /cvmfs/pilot.eessi-hpc.org/host_injections/2021.06/compat/linux/x86_64/lib
Now we can again check the supported CUDA version:
[ocaisa@gpunode1 ~]$ nvidia-smi
Mon Dec 13 14:06:15 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.5 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GRID V100-4C On | 00000000:00:05.0 Off | N/A |
| N/A N/A P0 N/A / N/A | 304MiB / 4096MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Note that Compute Canada are considering putting the compatability libraries directly into their Gentoo Prefix compatability layer (see https://github.com/ComputeCanada/software-stack/issues/79). I'm not sure whether this can really be OS independent.
Make a local copy of the CUDA examples:
module load CUDAcore
cp -r $EBROOTCUDACORE/samples ~/
Build the CUDA samples with GCC from EESSI as the host compiler:
module load GCC CUDAcore
cd ~/samples
make HOST_COMPILER=$(which g++)
Unfortunately this seems to fail for some samples:
make[1]: Leaving directory '/home/ocaisa/samples/7_CUDALibraries/conjugateGradientUM'
/cvmfs/pilot.eessi-hpc.org/2021.06/software/linux/x86_64/amd/zen2/software/GCCcore/9.3.0/lib/gcc/x86_64-pc-linux-gnu/9.3.0/../../../../lib64/libstdc++.so: error: undefined reference to 'fstat64', version 'GLIBC_2.33'
/cvmfs/pilot.eessi-hpc.org/2021.06/software/linux/x86_64/amd/zen2/software/GCCcore/9.3.0/lib/gcc/x86_64-pc-linux-gnu/9.3.0/../../../../lib64/libstdc++.so: error: undefined reference to 'stat', version 'GLIBC_2.33'
/cvmfs/pilot.eessi-hpc.org/2021.06/software/linux/x86_64/amd/zen2/software/GCCcore/9.3.0/lib/gcc/x86_64-pc-linux-gnu/9.3.0/../../../../lib64/libstdc++.so: error: undefined reference to 'lstat', version 'GLIBC_2.33'
collect2: error: ld returned 1 exit status
make[1]: *** [Makefile:363: simpleCUFFT_callback] Error 1
make[1]: Leaving directory '/home/ocaisa/samples/7_CUDALibraries/simpleCUFFT_callback'
make: *** [Makefile:51: 7_CUDALibraries/simpleCUFFT_callback/Makefile.ph_build] Error 2
This looks related to the compatability layer.
Once built, we can test some of the resulting exectables:
[ocaisa@gpunode1 samples]$ ./bin/x86_64/linux/release/deviceQuery
./bin/x86_64/linux/release/deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GRID V100-4C"
CUDA Driver Version / Runtime Version 11.3 / 11.3
CUDA Capability Major/Minor version number: 7.0
Total amount of global memory: 4096 MBytes (4294967296 bytes)
(080) Multiprocessors, (064) CUDA Cores/MP: 5120 CUDA Cores
GPU Max Clock rate: 1380 MHz (1.38 GHz)
Memory Clock rate: 877 Mhz
Memory Bus Width: 4096-bit
L2 Cache Size: 6291456 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 98304 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 7 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: No
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 5
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.3, CUDA Runtime Version = 11.3, NumDevs = 1
Result = PASS
Compute Canada do this with their setrpaths.sh script.
This needed a tiny modification for use with EESSI (the linker path inside was incorrect). It throws errors though, the source of which we are not really sure about (but suspect it is related to permissions on files):
ldd: warning: you do not have execution permission for `/project/def-sponsor00/easybuild/software/CUDAcore/11.3.1/nsight-systems-2021.1.3/target-linux-x64/libcupti.so.11.1'
patchelf: open: Permission denied
ldd: warning: you do not have execution permission for `/project/def-sponsor00/easybuild/software/CUDAcore/11.3.1/nsight-systems-2021.1.3/target-linux-x64/libcupti.so.11.3'
patchelf: open: Permission denied
ldd: warning: you do not have execution permission for `/project/def-sponsor00/easybuild/software/CUDAcore/11.3.1/nsight-systems-2021.1.3/target-linux-x64/libcupti.so.11.2'
patchelf: open: Permission denied
patchelf: open: Permission denied
patchelf: open: Permission denied
patchelf: open: Permission denied
If the user (or more specifically an admin) updates the drivers on the system then the compat libraries will cease to work with errors like:
[ocaisa@gnode1 release]$./deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
cudaGetDeviceCount returned 803
-> system has unsupported display driver / cuda driver combination
Result = FAIL
We can't really control this. The best we could do is check nvidia-smi
and make sure we are still using a compatible compat library
driver_cuda=$(nvidia-smi -q --display=COMPUTE | grep CUDA | awk 'NF>1{print $NF}' | sed s/\\.//)
eessi_cuda=$(LD_LIBRARY_PATH=/cvmfs/pilot.eessi-hpc.org/host_injections/nvidia/latest/compat/:$LD_LIBRARY_PATH nvidia-smi -q --display=COMPUTE | grep CUDA | awk 'NF>1{print $NF}' | sed s/\\.//)
if [ "$driver_cuda" -gt "$eessi_cuda" ]; then echo "You need to update your CUDA compatability libraries"; fi
This could be done on shell initialisation.
See https://github.com/EESSI/hackathons/tree/05_gpu/2021-12/05_gpu for the scripts that more or less capture the content discussed here.
This is a good solution for CUDA, but doesn't cover the GL and EGL libraries that would be needed for visualisation. Having said that if we adopt the GL approach taken at JSC, we should be able to figure out how to correctly set the paths to find the system NVIDIA GL/EGL libraries without needing to do any other magic.
- Thomas Röblitz
- Hugo Meiland
- Caspar van Leeuwen
- Vasileous Karakasis (support for ReFrame questions)
- (Bob, input on required compat layer tests)
14:00-15:00 CEST: intro to ReFrame (by Vasileous) Directly after (~15:00-16:00 CEST): planning & dividing tasks
module use -a /project/def-sponsor00/easybuild/modules/all
module load ReFrame/3.9.2
PYTHONPATH=$PYTHONPATH:~/reframe/:~/software-layer/tests/reframe/ reframe -C config/settings.py -c eessi-checks/applications/ -l -t CI -t singlenode
Current tests @ https://github.com/EESSI/compatibility-layer/blob/main/test/compat_layer.py We emerge e.g. https://github.com/EESSI/gentoo-overlay/blob/main/etc/portage/sets/eessi-2021.12-linux-x86_64
Some issues
- Some architectures don't support all packages, e.g. OPA-PSM is not support by arm => use
skip_if
to skip those selected architectures? - New features (e.g. host-injections for libraries, GPU support, building on top of EESSI)
List of tests we need
- Compiler tests (Thomas, see e.g. on how to catch https://github.com/EESSI/software-layer/issues/26)
- Python (Thomas)
- RDMA core (Thomas)
- OPA-PSM (Thomas)
- test host-injections library (Thomas)
Probably it makes sense to not test all individual libraries (cairo, boost, etc) but (at least start with) end-user applications. Maybe a few very key low level libs (that are part of toolchains for example) could be tested though.
- WRF (Hugo Meiland)
- OpenFOAM
- GROMACS (Caspar van Leeuwen)
- ParaView
- QuantumESPRESSO
- Python
- R
- R-bundle-Bioconductor
- UCX
- OpenMPI
- OSU-Micro-Benchmarks
- OpenBLAS?
- FFTW?
- ScaLAPACK?
- Brainstorm on software deployment procedure & testing: https://github.com/EESSI/meetings/wiki/Brainstorm-meeting-software-deployment-Nov-24th-2021
- ReFrame library test GROMACS https://github.com/eth-cscs/reframe/blob/v3.9.2/hpctestlib/sciapps/gromacs/benchmarks.py
- CSCS implementation GROMACS https://github.com/eth-cscs/reframe/blob/master/cscs-checks/apps/gromacs/gromacs_check.py
- Potential EESSI implementation GROMACS https://github.com/casparvl/software-layer/blob/gromacs_cscs/tests/reframe/eessi-checks/applications/gromacs_check.py
- Getting access to AWS test cluster https://github.com/EESSI/hackathons/tree/main/2021-12/magic_castle
- kickoff, intro reframe, discussion of tasks/goals
- Thomas:
- get access to Magic Castle and CitC resources
- revisit some simple ReFrame tutorials on eessi.learnhpc.eu
- looking at https://github.com/EESSI/compatibility-layer/blob/main/test/compat_layer.py
- Test script usually does not assume to run from within an EESSI pilot environment. The env vars it accesses at the beginning (EESSI_VERSION, EESSI_OS & EESSI_ARCH) need to be set before the script is run. They must not be confused with the env vars being set for a pilot environment (e.g., EESSI_PILOT_VERSION, EESSI_OS_TYPE & EESSI_CPU_FAMILY). Of course, running in a pilot environment one could reuse these to set the env vars used in the script.
export EESSI_VERSION=${EESSI_PILOT_VERSION}
export EESSI_OS=${EESSI_OS_TYPE}
export EESSI_ARCH=${EESSI_CPU_FAMILY}
- The GitHub Action which uses this test script is available at https://github.com/EESSI/compatibility-layer/blob/main/.github/workflows/pilot_repo.yml#L74
- Playing with setting the above variables to 'odd' values, e.g., macos (on a linux machine), aarch64 (on a x86_64 machine), results in various errors.
-
idea: check for reasonable values of these vars first, only execute other tests if variables have meaningful values
- revisiting reframe tutorials 3 (https://reframe-hpc.readthedocs.io/en/stable/tutorial_deps.html) & 4 (https://reframe-hpc.readthedocs.io/en/stable/tutorial_fixtures.html)
-
idea: check for reasonable values of these vars first, only execute other tests if variables have meaningful values
- Test script usually does not assume to run from within an EESSI pilot environment. The env vars it accesses at the beginning (EESSI_VERSION, EESSI_OS & EESSI_ARCH) need to be set before the script is run. They must not be confused with the env vars being set for a pilot environment (e.g., EESSI_PILOT_VERSION, EESSI_OS_TYPE & EESSI_CPU_FAMILY). Of course, running in a pilot environment one could reuse these to set the env vars used in the script.
- TODO check ReFrame repo for python/gcc tests in cscs-checks
- TODO check EESSI issue https://github.com/EESSI/software-layer/issues/26
The test from https://github.com/casparvl/software-layer/blob/gromacs_cscs/tests/reframe/eessi-checks/applications/gromacs_check.py does not run out of the box.
- Default launcher in my settings.py file was
srun
. That doesn't work, since there is no SLURM integration withpmi2
between the EESSI stack and the host SLURM. Changed config file to usempirun
- By default, jobs on the test cluster only get 9 GB of memory. That seems to not be enough
- Most portable way to fix this seems to be to define an
extra_resources
in the test https://reframe-hpc.readthedocs.io/en/stable/tutorial_advanced.html?highlight=memory#adding-job-scheduler-options-per-test - It is not very portable though: you need agreement between the test and the
settings.py
on the name of the extra resource name (in this casememory
seems the most sensible...) - Should this be defined differently (e.g. with a fixed keyword) in the settings file? There are two aspects to it: how to get more resources (e.g. passing the
--mem
flag to SLURM) and describing how much resources a partition has (how much memory do the nodes have?)
- Most portable way to fix this seems to be to define an
- Job now seems to run, though I still get
[node1.int.eessi.learnhpc.eu:23967] pml_ucx.c:273 Error: Failed to create UCP worker
- Test still fails,
Reason: permission error: [Errno 13] Permission denied: '/home/casparl/.../md.log'
. No clue why, the file has the correct file permissions...-rw-rw-r-- 1 casparl casparl 27410 Dec 14 10:27 /home/casparl/.../md.log
In this test, especially with requesting extra memory, I realize I still struggle with some limitations in ReFrame regarding requesting extra resources / checking if sufficient resources are present. I have some ideas on improving portability when it comes to extra resources. (memory, gpus, etc). I think it would be good if ReFrame standardized some resources.
Essentially, there are two components:
- I might need to add extra flags to my scheduler to get extra resources (GPU, memory).
- I might want to programmatically check from a test if certain resources are available.
Right now, for the first, I could define an extra_resource, but the part I don't like about that from a portability point of view is that the names of extra resources are free text fields. I.e. in the example in the docs, it's called 'name': 'memory' . That means you create a tight relation between the test and the associated config file: both need to agree that this resource is called memory (and not e.g. mem or something else). That means that if I write a (portable) test suite that uses memory as extra resource, I have to instruct all the user of that they have to define a resource in their config file with that exact name.
For the 2nd point, I'd like to have something similar to the processor object, which describes what is present in that particular partition in terms of hardware. E.g. simply a memory_per_node item that describes the maximum amount of memory that is present per node. For GPUs, I now use devices, but it has the same issue: device names are free text, and thus it creates a tight relation between the devices named in the test and in the config file. I circumvent this by isolating this in the utils
and hooks
.
Steps to be taken in the reframe test
- Download Conus benchmark dataset from http://www2.mmm.ucar.edu/wrf/bench/conus12km_v3911/bench_12km.tar.bz2
- create mirrors? above link is not too fast....
- or even host some benchmark datasets in cvmfs?
- mkdir wrf-workdir && cd wrf-workdir
- ln -s
dirname $(which wrf.exe)
/../run/* . - rm namelist.input
- ln -s bench_12km/* .
- ml load WRF
- mpirun wrf.exe
- on single node, 16 cores -> 2m44s
- On the magic castle hackathon login node: When I run reframe
--list-tags
I get the message
/project/def-sponsor00/easybuild/software/ReFrame/3.9.2/bin/reframe: check path '/project/60005/easybuild/software/ReFrame/3.9.2/lib/python3.9/site-packages/checks' does not exist
- Where does this come from? Can/should I change this?
- You can set the ReFrame search path where it searches for tests using the
-c
argument. Just point it to the dir where you are developing tests.
- You can set the ReFrame search path where it searches for tests using the
- Use
extra_resources
to get GROMACS test to ask for extra memory? Or should we just instruct people to use--mem
in theaccess
item of the ReFrame config to ask for the maximum amount of memory?- The first is pretty tricky: we'd need to check for every use case how much memory is needed (and it potentially varies with node count).
- Probably go for the option of adding
--mem=<max_available>
to theaccess
config item for now
export CHECKOUT_PREFIX=~/eessi-testsuite
mkdir -p $CHECKOUT_PREFIX
# Checkout relevant git repo's
cd $CHECKOUT_PREFIX
git clone https://github.com/casparvl/software-layer.git
git clone https://github.com/eth-cscs/reframe.git
cd reframe
git fetch --all --tags
git checkout tags/v3.9.2
cd ..
cd software-layer/tests/reframe
git checkout gromacs_cscs
# Note: PYTHONPATH needs to be set to find the hpctestlib that comes with ReFrame, as well as eessi-utils/hooks.py and eessi-utils/utils.py
export PYTHONPATH=$PYTHONPATH:$CHECKOUT_PREFIX/reframe/:$CHECKOUT_PREFIX/software-layer/tests/reframe/
# Demonstrating selectiong of tests:
# List tests to be run in CI on build node (only smallest GROMACS test case, single node):
reframe -C config/settings_magic_castle.py -c eessi-checks/applications/ -l -t CI -t singlenode
# List tests to be run in monitoring
reframe -C config/settings_magic_castle.py -c eessi-checks/applications/ -l -t monitoring
# Run actual tests
reframe -C config/settings_magic_castle.py -c eessi-checks/applications/ -r -t CI -t singlenode --performance-report
-
Milestone 1: get familiar with the env
- git repo
- aws cluster
- enough disk space
-
Milestone 2: explore some ideas
- make local copy of 2021.06 ... observations:
- cvmsfs-to-citc copy is kinda slow (26h); there's potential cache polution of this action
- plenty of files in compat layer only readable to root and cvmfs; need to understand if this makes root privs necessary to perform this action
- CVMFS_HIDE_MAGIC_XATTRS issue with cvmfs 2.9.0 ... I rolled back to 2.8.2
- examine directory structure and variant symlinks
- idea: this should allow user to set single env var to point variant symlinks to the tree of choice, even if this tree is outside of /cvmfs hirearchy
- need issue 32 finalized and in place to try this
- see if we can get binaries in local copy to work as expected
- see how to run compat layer test suite on local copy
- make local copy of 2021.06 ... observations:
-
Milestone 3: script this "make local copy" process
-
Milestone 4: explore archival/restore of local copy
- figure out if anything more than just tar/untar is needed
-
Milestone 5: script this archival/restore
-
Milestone 6: explore a creation of container with local copy of eessi
- figure out what to base it on - minimal el8?
- figure out to what degree this makes sense to be scripted
-
Milestone 7: if archiving whole eessi version turns out to be unfeasible, explore making a local copy of just a specific piece of module and its dependencies
- module load + env | grep EBROOT and copy only those + compat layer
- observations:
- tar of local copy of eessi is slow too - maybe the underlying aws fs is also not too happy with small files
- copy+tar of x86_64 compat+foss is 131min
- almost 2GB of stuff in compat layer /var can be ignored, which brings us down to 40min for foss and under 1h for bioconductor
- resulting tarball sizes are 1.5GB for foss, 1.8GB for Gromacs and 8.2GB for bioconductor
-
Milestone 8: wrap up that in a container
- we can possibly adopt some of Jorg's scripts
- I picked latest centos8 as a base
- env vars handling is a big todo
- naming of resulting image also needs to be done better
- tar/untar is there because this script was developed on two systems, can be dropped if everything is available on the same system
Question from my boss: Can we assign something like DOI to these containers?
- Every Stratum1 gets its own installation of prometheus and grafana.
- If the S1 is public, open ports so monitoring.eessi-infra.org can fetch the prometheus data
https://github.com/cloudalchemy/ansible-prometheus https://github.com/cloudalchemy/ansible-node-exporter https://github.com/cloudalchemy/ansible-grafana
Add
https://gitlab.cern.ch/cloud/cvmfs-prometheus-exporter
With the accompanying grafana dashboard.
- Create ansible playbook that installs prometheus, node exporter, and grafana, ensure that they listen to localhost only (see URLs above)
- Extend same ansible playbook to include CVMFS prometheus exporter
- Install default grafana dashboard (copy the json file to /var/lib/dashboards on the server). See https://github.com/cloudalchemy/ansible-grafana/blob/master/defaults/main.yml
- Write something smart about alerts from each S1
- Open ports for monitoring.eessi-infra.org (we can fix the local firewall, https://docs.ansible.com/ansible/latest/collections/ansible/posix/firewalld_module.html, but what about the rules for ACLs for the node itself?
- Install grafana on monitoring.eessi-infra.org and make some pretty dashboards and alerts. For multiple data sources in a single dashboard, see https://stackoverflow.com/questions/63349357/how-to-configure-a-grafana-dashboard-for-multiple-prometheus-datasources
- Do we need https://prometheus.io/docs/prometheus/latest/federation/ on our monitoring.eessi-infra.org?
- Auth and TLS for the services?
Decouple the ansible roles for stratum0s, stratum1s, clients, and proxies from filesystem-layer into galaxy repos (hosted on github or wherever). For multi-role support in the same repo, see https://github.com/ansible/ansible/issues/16804
Purposes
- Faster access to EESSI software stack / alleviate access to further-away servers
- Offline access through private network
Resources Existing Stratum 1 Ansible script: https://github.com/EESSI/filesystem-layer/blob/main/stratum1.yml
Link to CVMFS workshop: https://cvmfs-contrib.github.io/cvmfs-tutorial-2021/
- Take a run through the current instructions to setup S1, and document anything site-specific that needs to be set
- Check how to define a specific directory to download the S0 snapshot to
- Test if a client that is not within the private S1's allowed IP range cannot access the private S1 (result: S1 is accessible by any client)
- Test offline usage of private S1
- Install Ansible & required Ansible roles [1]
- Apply for a GeoIP license [2]
- Set IP ranges for clients accessing this S1 in
local_site_specific_vars.yml
- Set IP of S1 as
hosts
inlocal_site_specific_vars.yml
- Execute
stratum1.yml
playbook withansible-playbook -b -e @inventory/local_site_specific_vars.yml stratum1.yml
(takes several hours -> downloads ~65GB at time of writing)
[1] https://github.com/EESSI/filesystem-layer/blob/main/requirements.yml [2] https://www.maxmind.com/en/geolite2/signup/
Commands to be run after following the instructions here: https://github.com/EESSI/filesystem-layer#clients
echo 'CVMFS_SERVER_URL="http://<S1_IP>/cvmfs/@fqrn@;$CVMFS_SERVER_URL"' | sudo tee -a /etc/cvmfs/domain.d/eessi-hpc.org.local
export CVMFS_SERVER_URL="http://<S1_IP>/cvmfs/@fqrn@;$CVMFS_SERVER_URL"
sudo cvmfs_config reload -c pilot.eessi-hpc.org
- CVMFS allows to set the directory to which a repository is mounted to (by default:
/srv/cvmfs
). This should be exposed in the EESSI config too. Maybe it is already? - I wasn't able to run the Ansible script as localhost on the S1 directly. I get an SSH error (shouldn't SSH if localhost)
-
sudo cvmfs_server check pilot.eessi-hpc.org
: verifies the repository content on the S1
-
curl --head http://<S1IP>/cvmfs/pilot.eessi-hpc.org/.cvmfspublished
: check connection to S1 with IPS1_IP
-
cvmfs_config stat -v pilot.eessi-hpc.org
: check which S1 a client uses to connect to -
cvmfs_config showconfig pilot.eessi-hpc.org
: show the configuration used by a client (to make sure a local config file is picked up correctly) -
sudo cvmfs_config killall
: kill all CVMFS processes (requires anls <path_to_repo>
to remount the repository under /cvmfs) -
sudo cvmfs_config reload -c pilot.eessi-hpc.org
: force-reload the client configuration -
curl "http://S1_IP/cvmfs/pilot.eessi-hpc.org/api/v1.0/geo/CLIENT_IP/S1_IP,aws-eu-west1.stratum1.cvmfs.eessi-infra.org,azure-us-east1.stratum1.cvmfs.eessi-infra.org,bgo-no.stratum1.cvmfs.eessi-infra.org,rug-nl.stratum1.cvmfs.eessi-infra.org"
: returns a list of indices (e.g: 1, 5, 4, 3, 2), which ranks the servers (after CLIENT_IP/) on closest to farthest
- If you get
"Reload FAILED! CernVM-FS mountpoints unusable. pilot.eessi-hpc.org: Failed to initialize root file catalog (16)"
, it probably means your CVMFS_SERVER_URL is not set properly. It should end up in the form"http://<S1_IP>/cvmfs/pilot.eessi-hpc.org"
if you runcvmfs_config stat -v pilot.eessi-hpc.org
- Can we download the snapshot on an external (persistent) disk, and can this be attached to another S1, when the current S1 failed -> avoid redownloading the stack