Skip to content

Sync meeting on EESSI Software layer (2023 10 03)

Kenneth Hoste edited this page Oct 5, 2023 · 1 revision

EESSI software layer sync meeting

planning

  • next meeting
    • Tue 10 Oct at 09:00 CEST

Meeting (2023-10-03)

  • attending: Kenneth, Lara, Bob, Julián, Pedro, Thomas, Richard, Caspar, Alan
  • status update eessi.io
    • Stratum-0 is set up at RUG
    • single Stratum-1 running in AWS (using S3 backend)
      • test setup
      • required lots of manual work (create VM + S3 bucket) because Atlantis wasn't working
      • Ansible playbooks sort of worked, but does not support S3 buckets yet
      • see WIP filesystem-layer PR #160
      • GeoAPI doesn't work well with S3 buckets, clients go straight to S3 bucket
    • need to figure out:
      • how many Stratum-1's do we want (initially)?
        • currently we have 4 for eessi-hpc.org (AWS, Azure, RUG in NL, BGO in Norway)
      • how to deal with S3 buckets vs GeoAPI
      • who should have admin access?
      • DNS
      • using CDN (CloudFlare)?
    • sync meeting being planned
  • bot
    • merged PRs:
      • add shared_fs_path configuration setting PR #214
      • README updated (PR #215)
    • v0.1.0 released: https://github.com/EESSI/eessi-bot-software-layer/releases/tag/v0.1.0
    • develop branch
      • for active development (PRs)
      • main branch always corresponds to latest release
    • open PRs:
      • script to clean up tarballs of jobs given a PR number (PR #217)
        • can let bot use this when a PR is merged/closed
        • only cleans up large "checkpoint" tarballs for now, should eventually clean up everything related to a PR?
    • next steps
      • test step in between build & deploy
      • make deploy step agnostic of EESSI
    • new Slurm clusters for bot
      • new Slurm clusters are being set up with Magic Castle
        • in AWS: set up, need to test bot there
          • will (very) soon replace current CitC cluster...
          • next steps
            • create more accounts
            • increase disk space to couple of TBs (no EFS used there)
        • in Azure: to set up, need to figure out account/API stuff
  • software-layer
    • merged PRs
      • foss/2023a (PR #334)
      • foss/2022a (PR #310)
      • R v4.1.0 w/ foss/2021a (PR #328)
      • add YAML file to keep track of known issues in EESSI pilot 2023.06 (PR #340)
      • only increase limit for numerical test failures for OpenBLAS for aarch64/neoverse_v1 (merged PR #345)
    • open PRs
      • TensorFlow
        • TensorFlow v2.7.1 with foss/2021b (PR #321)
          • several test failures in aarch64/* targets
          • may be fixable by backporting a couple of patches, but maybe not worth the trouble?
        • TensorFlow v2.8.4 with foss/2021b (PR #343)
          • assembler errors on aarch64/* when building XNNPACK
            • due to use of -mcpu=native which clashes with custom -march=... options used by XNNPACK build procedure
            • see also easyconfigs issue #18899
            • should be fixed by making sure that -mcpu=... is not used when building XNNPACK, see easyblocks PR #3011
        • TensorFlow v2.11.0 with foss/2022a (PR #346)
        • TensorFlow v2.13.0 with foss/2022b (PR #347)
          • 928 failing scipy tests on aarch64/neoverse_v1...
          • build error on x86_64/intel/haswell because /usr/include/stdio.h is picked up
            • need to set $TF_SYSROOT?
      • matplotlib v3.4.3 with foss/2021b (PR #339)
      • ESPResSo
        • with foss/2021a (PR #332)
        • with foss/2022a (PR #331)
          •             - wrong Python installation is picked up
            
      • WRF
  • notes
    • should add "missing" YAML file (like for old TensorFlow versions on aarch64/*)
    • next packages
      • OpenFOAM
      • newer R
      • Bioconductor
      • AlphaFold (GPU)
    • GPU
      • we should set up a meeting to figure out the right steps...
      • plan is to look into supporting GPUs in software.eessi.io CVMFS repo
      • is ldconfig OK with non-existing paths (to system paths)? also, order matters
        • Apptainer also uses ldconfig to figure out paths to required libraries
      • CUDA compat libs (could be avoided, only needed as a fallback)
      • last location: Apptainer libs
      • first step should be to get it working on assumption that GPU driver is sufficiently recent
      • Alan will look into planning a sync meeting on GPU support

Previous meetings

Clone this wiki locally