-
Notifications
You must be signed in to change notification settings - Fork 0
Sync meeting on EESSI test suite (2023 06 28)
Kenneth Hoste edited this page Jun 28, 2023
·
1 revision
- every 2 weeks on Thursday at 14:00 CE(S)T
- next meetings:
- Thu 13 July 14:00 => Kenneth/Sam is on summer vacation, OK for Lara/Caspar/Satish
- Thu 27 July 14:00 => Kenneth is on summer vacation, OK for Lara/Caspar/Satish/Sam(maybe)
- Wed 9 Aug 14:00
- Fri 25 Aug 10:00
- Wed 6 Sept 10:30
- Wed 20 Sept 14:00
- https://github.com/EESSI/meetings/wiki/Sync-meeting-on-EESSI-test-suite-(2023-06-15)
- https://github.com/EESSI/meetings/wiki/Sync-meeting-on-EESSI-test-suite-(2023-05-31)
- https://github.com/EESSI/meetings/wiki/Sync-meeting-on-EESSI-test-suite-(2023-05-17)
- https://github.com/EESSI/meetings/wiki/Sync-meeting-on-EESSI-test-suite-(2023-04-20)
- https://github.com/EESSI/meetings/wiki/Sync-meeting-on-EESSI-test-suite-(2023-03-30)
- https://github.com/EESSI/meetings/wiki/Sync-meeting-on-EESSI-test-suite-(2023-03-10) (incl. 2023-02-23)
- https://github.com/EESSI/meetings/wiki/Sync-meeting-on-EESSI-test-suite-(2023-02-09)
- check on progress towards v0.1 release - https://github.com/EESSI/test-suite/milestone/2
- open PRs
- filter
valid_systems
ongpu_vendor
extras (PR #60)- only small changes needed
- configuration files for
- AWS CitC Slurm cluster (PR #53)
-
--constraint=shape
was essentially ignored when using space rather than=
to set the value!- bug in ReFrame?
- some trouble with CPU auto-detect
- worked at some point, but then not anymore?!
- can access workernode of running job by using
srun --overlap --jobid=JOBID --pty bash -l
- need to temporarily use
srun
as launcher in config file to run CPU autodetect - cfr. https://github.com/reframe-hpc/reframe/issues/2926
- auto-detect gives way more info than just CPU arch, like whether hyperthreading is enabled or not
-
pip install --update pip
that is run in job script submitted by ReFrame to do CPU auto-detect hangs
-
- Vega (EuroHPC) (PR #62)
- requires
export OMPI_MCA_pml=ucx
for GROMACS due to oldfoss
toolchain + Infiniband - does
srun
work on Vega as launcher? - Caspar is running the EESSI test suite with GROMACS every week on Vega
- => change to daily frequency, only with EESSI
- => run for 1+2+4+16 nodes
- Caspar's script to run test suite as a cron job in Vega
This script is then wrapped in#!/bin/bash # This dir is only available on request, see https://en-doc.vega.izum.si/mountpoints/ #TEMPDIR=/ceph/hpc/scratch/user/${USER} TEMPDIR=$(mktemp --directory --tmpdir=/tmp -t rfm.XXXXXXXXXX) # Create virtualenv for ReFrame using system python python3 -m venv $TEMPDIR/reframe_421 source $TEMPDIR/reframe_421/bin/activate python3 -m pip install reframe-hpc==4.2.1 # Clone reframe repo to have the hpctestlib: git clone git@github.com:reframe-hpc/reframe.git --branch v4.2.1 $TEMPDIR/reframe export PYTHONPATH=$PYTHONPATH:$TEMPDIR/reframe # Clone test suite repo git clone git@github.com:EESSI/test-suite.git $TEMPDIR/test-suite export PYTHONPATH=$PYTHONPATH:$TEMPDIR/test-suite/ # Start the EESSI environment unset MODULEPATH source /cvmfs/pilot.eessi-hpc.org/latest/init/bash # Needed in order to make sure the reframe from our TEMPDIR is first on the PATH, # prior to the one shipped with the 2021.12 compat layer # Probably no longer needed with newer compat layer that doesn't include ReFrame deactivate source $TEMPDIR/reframe_421/bin/activate # Run ReFrame cd echo "PYTHONPATH: $PYTHONPATH" #reframe -C $TEMPDIR/test-suite/config/izum_vega.py -c $TEMPDIR/test-suite/eessi/testsuite/tests/apps/ -R -t CI -t "1_node|2_nodes" -l # As long as the config isn't upstreamed yet, we take it from my local checkout reframe -C test-suite/config/izum_vega.py -c test-suite/eessi/testsuite/tests/apps/ -R -t CI -t "1_node|2_nodes|4_nodes|8_nodes|16_nodes" -r # Cleanup rm -rf ${TEMPDIR}
#!/bin/bash # logfile mkdir -p ~/rfm_weekly_logs datestamp=$(date +%Y%m%d_%H%M%S) LOGFILE=~/rfm_weekly_logs/rfm_weekly_${datestamp}.log touch $LOGFILE ~/run_reframe.sh > $LOGFILE 2>&1
- requires
- AWS CitC Slurm cluster (PR #53)
- TensorFlow (WIP) (PR #38)
- TensorFlow module in EESSI is too old (missing stuff in API for Caspar's test, requires TensorFlow >= 2.4)
- progress on binding issues
- separate hook for process binding vs thread binding
- no thread binding used for TensorFlow (unclear threading mechanism)
- problems on Vega due to hyperthreading
- we want to run one task per physical core (so bind to core)
- need to ask Slurm for 256 cores (2 tasks - one per socket, 128 cores per task)
- but we want to only use 64 cores per task
- running one task per virtual core results into low performance
- PR to add TensorFlow test should be separate from controlling threads per task and core binding
- Sam: try
srun --hint=nomultithread
or#SBATCH --ntasks-per-core=1
- only relates to binding?
- separate hook for process binding vs thread binding
- OSU Microbenchmarks (PR #54)
- now using separate tests for point-to-point (2 scales) vs collectives
- generalising is hard if we can't rely on Slurm (cfr. Karolina which has OpenPBS)
- controlling number of switches requires detailed knowledge of the system
- can support it as a feature in configuration file, and change submit options if supported
- Satish is looking into adding ReFrame as a module to EESSI 2023.06
- filter