-
Notifications
You must be signed in to change notification settings - Fork 0
AWS meeting 2023 10 12
Kenneth Hoste edited this page Oct 13, 2023
·
1 revision
- link to AWS project doc: https://docs.google.com/document/d/1CHG9fCh2LkfJ-EI8J-_Wr5NpHL5iwm8Wu6syfK9h7-c
- Thu 9 Nov 2023, 12:00 UTC
- sponsored credits
- monthly spent is ramping up
- April-Aug'23: ~$3k/month
- Sept'23: $4.1k/month
- $25k was added by Brendan on 14 Sept'23
- currently $28k left
- should suffice until Feb'24 (extra $25k expires 02/29/2024)
- monthly spent is ramping up
- migrating away from Slurm cluster set up with Cluster-in-the-Cloud, to Magic Castle
- Magic Castle is developed by The Alliance (a.k.a.) ComputeCanada + actively maintained and supported
- combo of Terraform and Puppet
- supports AWS, Azure, OpenStack, GCP, OVH
- support auto-scaling (power up nodes as jobs are queued)
- very good fit for EESSI
-
EESSI build-and-deploy bot is now running on Slurm cluster in AWS managed with Magic Castle
- Arm login node, x86_64 mgmt node
- mix of Arm and x86_64 partitons
- Rocky 8 is quickly becoming most popular OS in HPC, see https://docs.easybuild.io/user-survey/#operating-system
- preparing switch to CVMFS repo under eessi.io domain
- current EESSI pilot repo @
/cvmfs/pilot.eessi-hpc.org
- new repo @
/cvmfs/software.eessi.io
- Stratum-0 (central server) set up @ Univ. of Groningen (NL), funded via MultiXscale EuroHPC project
- temporary Stratum-1 (mirror) servers running in AWS, Azure
- both backed by regular storage and S3
- S3-backed is preferable in some ways, but not in other
- CVMFS GeoAPI feature not compatible with serving CVMFS repo via S3
- GeoAPI is used to figure out which Stratum-1 is geographically closest (assumed to be fastest)
- CVMFS client can go straight to S3 (no need to talk to Stratum-1 server)
- can Stratum-1 use S3 as storage backend but still operate as a proper Stratum-1 server
- CDN is something we want to try, but it may complicate things?
- Spack is using one big S3 bucket in us-west for their binary cache
- ~1.7GB on x86_64, ~1GB on aarch64 to launch TensorFlow starting from scratch
- CVMFS GeoAPI feature not compatible with serving CVMFS repo via S3
- current EESSI pilot repo @
- integration of EESSI in ParallelCluster
- Steve Messenger (AWS) joined last EESSI update meeting
- some connection via Univ. of Luxembourg
- they're using Graviton to prep for future EuroHPC chip (Rhea?)
- could be interesting w.r.t. the problems we've been seeing on Graviton3 (with SVE)
- see https://github.com/EESSI/software-layer/blob/2023.06/eessi-2023.06-known-issues.yml
- we should provide more details to Brendan on that to get in touch with Arm experts
- "CVMFS for HPC" tutorial
- preliminary date for online session 1st week of Dec'23
- may become an ISC'24 tutorial submission as well
- next to "Getting started with EESSI" tutorial submission
- some changes to how latest libfabric (1.19.0) and OpenMPI (4.1.6) stack on top of each other w.r.t. EFA
- fixes w.r.t. intranode MPI msgs in libfabric
- more details on SC'23?
- EESSI submission for HUST'23 workshop was not selected
- Alan + HPCNow! + Henk-Jan (RUG) will be attending
- also some people of Univ. of Brussels (VUB) who are interested in EESSI
- Magic Castle tutorial on Sun 12 Nov'23
- AWS/NVIDIA/Arm "Welcome to Denver" open bar night on Sunday
- 14 Sept 2023: no notes were taken :(
- 10 Aug 2023: https://github.com/EESSI/meetings/wiki/AWS-meeting-2023-08-10
- July 2023: (skipped)
- 8 June 2023: https://github.com/EESSI/meetings/wiki/AWS-meeting-2023-06-08
- 11 May 2023: https://github.com/EESSI/meetings/wiki/AWS-meeting-2023-05-11
- 13 April 2023: https://github.com/EESSI/meetings/wiki/AWS-meeting-2023-04-13
- 9 Mar 2023: https://github.com/EESSI/meetings/wiki/AWS-meeting-2023-03-09
- 11 Jan 2023: https://github.com/EESSI/meetings/wiki/AWS-meeting-2023-01-11