Skip to content

Sync meeting on EESSI test suite (2023 06 28)

Kenneth Hoste edited this page Jun 28, 2023 · 1 revision

EESSI test suite sync meetings

Planning

  • every 2 weeks on Thursday at 14:00 CE(S)T
  • next meetings:
    • Thu 13 July 14:00 => Kenneth/Sam is on summer vacation, OK for Lara/Caspar/Satish
    • Thu 27 July 14:00 => Kenneth is on summer vacation, OK for Lara/Caspar/Satish/Sam(maybe)
    • Wed 9 Aug 14:00
    • Fri 25 Aug 10:00
    • Wed 6 Sept 10:30
    • Wed 20 Sept 14:00

Previous meetings

Notes for 2023-06-28

  • check on progress towards v0.1 release - https://github.com/EESSI/test-suite/milestone/2
  • open PRs
    • filter valid_systems on gpu_vendor extras (PR #60)
      • only small changes needed
    • configuration files for
      • AWS CitC Slurm cluster (PR #53)
        • --constraint=shape was essentially ignored when using space rather than = to set the value!
          • bug in ReFrame?
        • some trouble with CPU auto-detect
          • worked at some point, but then not anymore?!
          • can access workernode of running job by using srun --overlap --jobid=JOBID --pty bash -l
          • need to temporarily use srun as launcher in config file to run CPU autodetect - cfr. https://github.com/reframe-hpc/reframe/issues/2926
        • auto-detect gives way more info than just CPU arch, like whether hyperthreading is enabled or not
        • pip install --update pip that is run in job script submitted by ReFrame to do CPU auto-detect hangs
      • Vega (EuroHPC) (PR #62)
        • requires export OMPI_MCA_pml=ucx for GROMACS due to old foss toolchain + Infiniband
        • does srun work on Vega as launcher?
        • Caspar is running the EESSI test suite with GROMACS every week on Vega
          • => change to daily frequency, only with EESSI
          • => run for 1+2+4+16 nodes
          • Caspar's script to run test suite as a cron job in Vega
          #!/bin/bash
          
          # This dir is only available on request, see https://en-doc.vega.izum.si/mountpoints/
          #TEMPDIR=/ceph/hpc/scratch/user/${USER}
          TEMPDIR=$(mktemp --directory --tmpdir=/tmp  -t rfm.XXXXXXXXXX)
          
          # Create virtualenv for ReFrame using system python
          python3 -m venv $TEMPDIR/reframe_421
          source $TEMPDIR/reframe_421/bin/activate
          python3 -m pip install reframe-hpc==4.2.1
          
          # Clone reframe repo to have the hpctestlib:
          git clone git@github.com:reframe-hpc/reframe.git --branch v4.2.1 $TEMPDIR/reframe
          export PYTHONPATH=$PYTHONPATH:$TEMPDIR/reframe
          
          # Clone test suite repo
          git clone git@github.com:EESSI/test-suite.git $TEMPDIR/test-suite
          export PYTHONPATH=$PYTHONPATH:$TEMPDIR/test-suite/
          
          # Start the EESSI environment
          unset MODULEPATH
          source /cvmfs/pilot.eessi-hpc.org/latest/init/bash
          
          # Needed in order to make sure the reframe from our TEMPDIR is first on the PATH,
          # prior to the one shipped with the 2021.12 compat layer
          # Probably no longer needed with newer compat layer that doesn't include ReFrame
          deactivate
          source $TEMPDIR/reframe_421/bin/activate
          
          # Run ReFrame
          cd
          echo "PYTHONPATH: $PYTHONPATH"
          #reframe -C $TEMPDIR/test-suite/config/izum_vega.py -c $TEMPDIR/test-suite/eessi/testsuite/tests/apps/ -R -t CI -t "1_node|2_nodes" -l
          # As long as the config isn't upstreamed yet, we take it from my local checkout
          reframe -C test-suite/config/izum_vega.py -c test-suite/eessi/testsuite/tests/apps/ -R -t CI -t "1_node|2_nodes|4_nodes|8_nodes|16_nodes" -r
          
          # Cleanup
          rm -rf ${TEMPDIR}
          
          This script is then wrapped in
          #!/bin/bash
          
          # logfile
          mkdir -p ~/rfm_weekly_logs
          
          datestamp=$(date +%Y%m%d_%H%M%S)
          LOGFILE=~/rfm_weekly_logs/rfm_weekly_${datestamp}.log
          touch $LOGFILE
          
          ~/run_reframe.sh > $LOGFILE 2>&1
          
    • TensorFlow (WIP) (PR #38)
      • TensorFlow module in EESSI is too old (missing stuff in API for Caspar's test, requires TensorFlow >= 2.4)
      • progress on binding issues
        • separate hook for process binding vs thread binding
          • no thread binding used for TensorFlow (unclear threading mechanism)
        • problems on Vega due to hyperthreading
          • we want to run one task per physical core (so bind to core)
          • need to ask Slurm for 256 cores (2 tasks - one per socket, 128 cores per task)
            • but we want to only use 64 cores per task
          • running one task per virtual core results into low performance
        • PR to add TensorFlow test should be separate from controlling threads per task and core binding
        • Sam: try srun --hint=nomultithread or #SBATCH --ntasks-per-core=1
          • only relates to binding?
    • OSU Microbenchmarks (PR #54)
      • now using separate tests for point-to-point (2 scales) vs collectives
      • generalising is hard if we can't rely on Slurm (cfr. Karolina which has OpenPBS)
      • controlling number of switches requires detailed knowledge of the system
        • can support it as a feature in configuration file, and change submit options if supported
    • Satish is looking into adding ReFrame as a module to EESSI 2023.06
Clone this wiki locally