Skip to content

meeting 2024 09 05

Bob Dröge edited this page Sep 5, 2024 · 5 revisions

Notes for 2024-09-05 meeting

  • date & time: Thu 9 Sept 2024 - 14:00 CEST (13:00 UTC)
    • (every first Thursday of the month)
  • venue: (online, see mail for meeting link, or ask in Slack)
  • agenda:
    • Quick introduction by new people
    • EESSI-related meetings and events in last month(s)
    • Progress update per EESSI layer
    • Update on EESSI production repository software.eessi.io
    • Modulefile for initializing the EESSI stack
    • Update on EESSI documentation + test suite + build-and-deploy bot
    • EESSI as backend in Ramble
    • Status page and monitoring
    • AWS/Azure sponsorship update
    • Upcoming/recent events
    • Frequency of EESSI update meeting
    • Q&A

Slides

Meeting notes

(by Pedro/Bob)

Quick introduction by new people

  • James Simone (Fermilab)
  • Leonardo Honfi Camilo (Wageningen University)
  • Both are EB users and EUM attendees. Welcome!

EESSI-related meetings in last month

(see slides)

  • Thomas will present an EESSI update at upcoming CernVM workshop.
  • Hackathons (usually) every third week to focus on advancing specific topics

Progress update per EESSI layer

Filesystem layer

(see slides)

  • Updated CernVM-FS to new minor release on all servers
  • WIP: Grafana dashboard and alerting of the infrastcture (details later)
Compatibility layer

(see slides)

  • Upcoming compat layer to include OpenSSL 3 and EasyBuild 5.0
Software layer

(see slides)

  • Lot of software package have been added in the last two months
  • For zen4 we have to skip one toolchain that is not supported

software.eessi.io repository

(see slides)

  • Focus on catchinig up to existing previously supported CPU targets. Contributions are very welcome, and we are able to help submissions that possibly run into issues

Modulefile for initializing the EESSI stack

(see slides)

  • Module file is able to play nicely with existing software stacks on a site, when compared to the existing init bash script. There are (possible) quirks to this, e.g., sticky module. If a local Lmod stack is already present, then the modules will be "mixed" with the EESSI ones. This shouldn't be a problem in most cases, however. Please try it out, feedback is very welcome!
  • Using a shell other than bash is likely not a problem.

EESSI documentation + test suite + build-and-deploy bot

(see slides)

  • Possible next stop for bot clean-up is to have a cron job that deletes these directories every month
  • Jobs can have a unique name, which is necessary to run several bot instances by the same account.
  • First step for accelerator build support. E.g., bot: build accelerator:nvidia/Y
  • Community contribution to costumize bot build jobs (more time, RAM, etc)

EESSI documentation

(see slides)

  • Documentation/tutorial for developers who already know how to write ReFrame test (which we already document), but now in a portable way. Useful for test suite contributors.
  • Blog post by Julián on installing Extrae to riscv.eessi.io which was also expanded to include software.eessi.io. Interesting as a example and overview of a complex installation, including in emerging architectures.
  • Improvements to the available software page. Application pages include short description and loadin instructions. zen4 is now included in the software availability pages
  • Starting point for list of sites that already include EESSI. Maybe suggest in the set-up documentation that sysadmins include themselves.
  • CI/CD component in EESSI (GitHub Actions and GitLab Components)

EESSI test suite

(see slides)

  • mpi4py tutorial ReFrame test for writing the tests and making it portable
  • Improvements on handling duplicate modules in the module path (now users are warned)
  • New hook to handle per node memory usage, queried from the master node in the allocation.
  • Open MetalWalls PR from a contributor, under review.
  • Open WIP PR to handle 'mandatory' hooks better. Hooks would run by default because they would be inherited.
  • Progress on dahsboard (WIP) and not publicly available yet. Test reports go to an ElasticSearch database. Useful to see consitency of performance day to day. Some data may not be made public.

EESSI as backend in Ramble

(see slides)

  • Google cloud Ramble tool for benchmarking using existing tools. It now supports EESSI, and this is documented.
  • It is used in another project called BenchPark https://github.com/LLNL/benchpark

Status page and monitoring

(see slides)

  • Revamped status page code (now in Rust). Now now exports json files so the information can be easily scraped and accessed. Overall status depends on individual components (stratum-1s, etc).
  • Also exports statuses as Prometheus metrics to be picked up downstream.
  • WIP Grafana, Prometheus, alerts, exporters for specific information (versions, sync status of stratum-1 with stratum-0 etc). Monitoring server work underway, exporters being added to the cvmfs servers. Thresholds and tweaks under way, as is alert channel in Slack. Installation work of exporters is almost done.

AWS/Azure sponsored credits

(see slides)

  • No major news, likely refresh of credits to happen soon. Azure usage ramping up due to zen4 builds.

Events

(see slides)

  • Thomas presents at CernVM-FS workshop 16-18 September
  • EuroHPC User Day 2024 - EESSI presentation (paper, to be submitted tomorrow) and also presence at CoE "event".
  • SC'24 Birds-of-a-Feather session accepted! HPCNow! will go.

Frequency of EESSI update meeting

  • Proposal to have update meeting every two months. Approved. Next meeting: November 7th
  • Suggestion to have a regular EESSI elevator pitch and demonstration of topics

Q&A

Clone this wiki locally