Skip to content

AWS meeting 2023 10 12

Kenneth Hoste edited this page Oct 13, 2023 · 1 revision

EESSI/AWS sync meetings

Next meeting

  • Thu 9 Nov 2023, 12:00 UTC

Notes 12 Oct 2023 (12:00 UTC)

  • sponsored credits
    • monthly spent is ramping up
      • April-Aug'23: ~$3k/month
      • Sept'23: $4.1k/month
    • $25k was added by Brendan on 14 Sept'23
      • currently $28k left
      • should suffice until Feb'24 (extra $25k expires 02/29/2024)
  • migrating away from Slurm cluster set up with Cluster-in-the-Cloud, to Magic Castle
    • Magic Castle is developed by The Alliance (a.k.a.) ComputeCanada + actively maintained and supported
    • combo of Terraform and Puppet
    • supports AWS, Azure, OpenStack, GCP, OVH
    • support auto-scaling (power up nodes as jobs are queued)
    • very good fit for EESSI
    • EESSI build-and-deploy bot is now running on Slurm cluster in AWS managed with Magic Castle
      • Arm login node, x86_64 mgmt node
      • mix of Arm and x86_64 partitons
    • Rocky 8 is quickly becoming most popular OS in HPC, see https://docs.easybuild.io/user-survey/#operating-system
  • preparing switch to CVMFS repo under eessi.io domain
    • current EESSI pilot repo @ /cvmfs/pilot.eessi-hpc.org
    • new repo @ /cvmfs/software.eessi.io
    • Stratum-0 (central server) set up @ Univ. of Groningen (NL), funded via MultiXscale EuroHPC project
    • temporary Stratum-1 (mirror) servers running in AWS, Azure
    • both backed by regular storage and S3
    • S3-backed is preferable in some ways, but not in other
      • CVMFS GeoAPI feature not compatible with serving CVMFS repo via S3
        • GeoAPI is used to figure out which Stratum-1 is geographically closest (assumed to be fastest)
      • CVMFS client can go straight to S3 (no need to talk to Stratum-1 server)
      • can Stratum-1 use S3 as storage backend but still operate as a proper Stratum-1 server
      • CDN is something we want to try, but it may complicate things?
      • Spack is using one big S3 bucket in us-west for their binary cache
      • ~1.7GB on x86_64, ~1GB on aarch64 to launch TensorFlow starting from scratch
  • integration of EESSI in ParallelCluster
  • Steve Messenger (AWS) joined last EESSI update meeting
  • "CVMFS for HPC" tutorial
    • preliminary date for online session 1st week of Dec'23
    • may become an ISC'24 tutorial submission as well
    • next to "Getting started with EESSI" tutorial submission
  • some changes to how latest libfabric (1.19.0) and OpenMPI (4.1.6) stack on top of each other w.r.t. EFA
    • fixes w.r.t. intranode MPI msgs in libfabric
  • more details on SC'23?
    • EESSI submission for HUST'23 workshop was not selected
    • Alan + HPCNow! + Henk-Jan (RUG) will be attending
      • also some people of Univ. of Brussels (VUB) who are interested in EESSI
    • Magic Castle tutorial on Sun 12 Nov'23
    • AWS/NVIDIA/Arm "Welcome to Denver" open bar night on Sunday

Notes previous meetings

Clone this wiki locally