Skip to content

AWS meeting 2024 03 14

Kenneth Hoste edited this page Mar 14, 2024 · 1 revision

EESSI/AWS sync meetings

Next meeting

  • Every 2 months on 2nd Thursday, 13:00 GMT/BST a.k.a. 14:00 CE(S)T
    • Thu 14 Mar 2024
    • Thu 9 May 2024 => move to June?

Notes 14 Mar 2024 (13:00 UTC)

  • sponsored credits: ~$28.5k left (expires 2024-12-31)
    • current burn rate: ~$6k/month (so OK until ~July'24)
  • project review of MultiXscale EuroHPC CoE
    • good feedback on EESSI aspect of the project
    • interest within EuroHPC community is growing
  • updates
  • OpenMPI bug fixed by Luke
  • private preview for Graviton4 (Neoverse_V2)
    • very similar to NVIDIA Grace (interesting for JUPITER system at JSC)
    • getting access is quite competitive currently
    • r8g instance type
  • ISC'24 (12-16 May'24)
    • EESSI attendees: Alan, Kenneth, Lara, Pedro?
    • our tutorial submissions on EESSI, CernVM-FS, Magic Castle did not get accepted
    • AWS tutorial on Sunday
      • fixed program, mostly DevOps focus
    • Arm Neoverse V1 tutorial on Sunday
      • reach out to Filippo to see if EESSI could be part of this?
    • EESSI BoF session (Tue 9am)
    • paper for RISC-V workshop in the works
    • thinking about submission to Arm workshop (AHUG)...
    • AWS booth?
      • no AWS booth at ISC'24
    • EESSI social event
      • sponsoring opportunity?
      • Tue evening joint AWS/NVIDIA party
  • EUM'24 (23-25 April 2024)

Notes 12 Oct 2023 (12:00 UTC)

  • sponosored credits
    • monthly spent is ramping up
      • April-Aug'23: ~$3k/month
      • Sept'23: $4.1k/month
    • $25k was added by Brendan on 14 Sept'23
      • currently $28k left
      • should suffice until Feb'24 (extra $25k expires 02/29/2024)
  • migrating away from Slurm cluster set up with Cluster-in-the-Cloud, to Magic Castle
    • Magic Castle is developed by The Alliance (a.k.a.) ComputeCanada + actively maintained and supported
    • combo of Terraform and Puppet
    • supports AWS, Azure, OpenStack, GCP, OVH
    • support auto-scaling (power up nodes as jobs are queued)
    • very good fit for EESSI
    • EESSI build-and-deploy bot is now running on Slurm cluster in AWS managed with Magic Castle
      • Arm login node, x86_64 mgmt node
      • mix of Arm and x86_64 partitons
    • Rocky 8 is quickly becoming most popular OS in HPC, see https://docs.easybuild.io/user-survey/#operating-system
  • preparing switch to CVMFS repo under eessi.io domain
    • current EESSI pilot repo @ /cvmfs/pilot.eessi-hpc.org
    • new repo @ /cvmfs/software.eessi.io
    • Stratum-0 (central server) set up @ Univ. of Groningen (NL), funded via MultiXscale EuroHPC project
    • temporary Stratum-1 (mirror) servers running in AWS, Azure
    • both backed by regular storage and S3
    • S3-backed is preferable in some ways, but not in other
      • CVMFS GeoAPI feature not compatible with serving CVMFS repo via S3
        • GeoAPI is used to figure out which Stratum-1 is geographically closest (assumed to be fastest)
      • CVMFS client can go straight to S3 (no need to talk to Stratum-1 server)
      • can Stratum-1 use S3 as storage backend but still operate as a proper Stratum-1 server
      • CDN is something we want to try, but it may complicate things?
      • Spack is using one big S3 bucket in us-west for their binary cache
      • ~1.7GB on x86_64, ~1GB on aarch64 to launch TensorFlow starting from scratch
  • integration of EESSI in ParallelCluster
  • Steve Messenger (AWS) joined last EESSI update meeting
  • "CVMFS for HPC" tutorial
    • preliminary date for online session 1st week of Dec'23
    • may become an ISC'24 tutorial submission as well
    • next to "Getting started with EESSI" tutorial submission
  • some changes to how latest libfabric (1.19.0) and OpenMPI (4.1.6) stack on top of each other w.r.t. EFA
    • fixes w.r.t. intranode MPI msgs in libfabric
  • more details on SC'23?
    • EESSI submission for HUST'23 workshop was not selected
    • Alan + HPCNow! + Henk-Jan (RUG) will be attending
      • also some people of Univ. of Brussels (VUB) who are interested in EESSI
    • Magic Castle tutorial on Sun 12 Nov'23
    • AWS/NVIDIA/Arm "Welcome to Denver" open bar night on Sunday

Notes previous meetings

Clone this wiki locally