Skip to content

meeting 2024 04 04

Kenneth Hoste edited this page Jun 6, 2024 · 3 revisions

Notes for 2024-04-04 meeting

  • date & time: Thu 4 Apr 2024 - 14:00 CEST (12:00 UTC)
    • (every first Thursday of the month)
  • venue: (online, see mail for meeting link, or ask in Slack)
  • agenda:
    • Quick introduction by new people
    • EESSI-related meetings and events in last month
    • Progress update per EESSI layer
    • Update on EESSI production repository software.eessi.io
    • Update on EESSI test suite + build-and-deploy bot
    • EESSI support portal
    • AWS/Azure sponsorship update
    • Update on MultiXscale EuroHPC project
    • Upcoming/recent events: EuroHPC Summit + EasyBuild User Meeting 2024 + ISC’24
    • Q&A

Slides

Meeting notes

(by Bob, Kenneth)

Quick introduction by new people

  • Craig Gross: Research Consultant, Michigan State University
    • making EESSI available on very heterogeneous cluster (incl. Grace Hopper, using about five EESSI CPU targets)
    • rebuilding everything on top of EESSI

EESSI-related meetings in last month

(see slides)

  • "CernVM-FS + EESSI tutorial for EuroHPC hosting entities" is a nice getting-started tutorial for people who want to make EESSI available on their cluster/infrastructure
  • worth looking into making oneAPI available

Progress update per EESSI layer

Filesystem layer

(see slides)

  • maybe an opportunity to work together with Canadian CernVM-FS experts on a public Ansible role for CernVM-FS clients/servers/proxies?
  • (Hugo) is exposing EESSI directly through S3/Azure blob possible, or is a CernVM-FS "frontend" required
    • 50-60 regions in Azure, would be nice to only do this via blob
    • client can directly access repository contents via S3/Azure blob
    • would be good to document this use case
  • riscv.eessi.io repository is a development repository, we retain the freedom to remove stuff from here...
Compatibility layer

(see slides)

  • (Hugo) interest in notes on getting StarFive VisionFive 2 set up
    • Thomas has some notes on this that he can share
    • good place to ask for help is #riscv channel in EESSI Slack
Software layer

(see slides)

  • we should start actively discouraging use of pilot.eessi-hpc.org
    • init script should refuse to set up environment unless you're actively opting into using it
  • ingesting rebuilt software requires manual intervention
    • old installation has to be removed first
    • this has to be done in the same transaction
  • possible use case of the site-specific hooks:
    • disable libfabric on systems that are hitting an OpenMPI bug with new versions of OFED
  • what do we do with software that doesn't allow you to build fat binaries supporting different CUDA compute capabilities?
  • can we detect GPU architecture, supported CUDA compute capabilities?
  • support for installing CUDA compatibility libraries when required is not in place yet
  • (Hugo) what is the minimal driver version that you need for CUDA software?
    • this is not entirely clear, but the compatibility layers are definitely usful here. We know how to do this, but the scripts are not available (yet). May be a nice hackathon task.
  • the --from-commit options have two advantages over --from-pr
    • it's more reproducible (and secure): commits can never change
    • it does not use the Github API (it only downloads the tarball for the specified commit): no more issues with hitting the GitHub API rate limits
Build-and-deploy bot

(see slides)

  • the bot should now be fully independent of what it's building
  • (Alan) in the development repo we / the bot can be a bit more loose with respect to the policies
    • e.g. not required that all builds for all CPU targets have succeeded
software.eessi.io repository

(see slides)

EESSI documentation

(see slides)

  • sync server is still missing on the status page
    • requires some additional work, because it's using S3 and doesn't have all the JSON files that the scraper is looking for
  • overview of available software is generated using a script, so it should be easy to automatically update it (e.g. using a Github Action)
EESSI test suite

(see slides)

Support for EESSI

(see slides)

AWS/Azure sponsored credits

(see slides)

  • A new Slurm cluster has been spun up on Azure, and we will start using it for doing Zen 4 builds.
    • we will probably set up a separate branch to catch up with all the missing software installations
    • NESSI has experience with adding new CPU targets

MultiXscale EU project

(see slides)

Events

(see slides)

Q&A

  • Next meeting: Thu 2 May 2024 at 14:00 CEST (12:00 UTC)
Clone this wiki locally