-
Notifications
You must be signed in to change notification settings - Fork 0
AWS meeting 2024 03 14
Kenneth Hoste edited this page Mar 14, 2024
·
1 revision
- link to AWS project doc: https://docs.google.com/document/d/1CHG9fCh2LkfJ-EI8J-_Wr5NpHL5iwm8Wu6syfK9h7-c
- Every 2 months on 2nd Thursday, 13:00 GMT/BST a.k.a. 14:00 CE(S)T
- Thu 14 Mar 2024
- Thu 9 May 2024 => move to June?
- sponsored credits: ~$28.5k left (expires 2024-12-31)
- current burn rate: ~$6k/month (so OK until ~July'24)
- project review of MultiXscale EuroHPC CoE
- good feedback on EESSI aspect of the project
- interest within EuroHPC community is growing
- updates
- over 3,500 software installations in place in EESSI production repo (
software.eessi.io
)- ~250 different open source software projects
- incl. TensorFlow, PyTorch, OpenFOAM, WRF, ...
- initial support for NVIDIA GPUs: https://www.eessi.io/docs/gpu
- see also latest EESSI update meeting: https://github.com/EESSI/meetings/blob/main/meetings/EESSI_meeting_20240307.pdf
- over 3,500 software installations in place in EESSI production repo (
- OpenMPI bug fixed by Luke
- private preview for Graviton4 (Neoverse_V2)
- very similar to NVIDIA Grace (interesting for JUPITER system at JSC)
- getting access is quite competitive currently
- r8g instance type
- ISC'24 (12-16 May'24)
- EESSI attendees: Alan, Kenneth, Lara, Pedro?
- our tutorial submissions on EESSI, CernVM-FS, Magic Castle did not get accepted
- AWS tutorial on Sunday
- fixed program, mostly DevOps focus
- Arm Neoverse V1 tutorial on Sunday
- reach out to Filippo to see if EESSI could be part of this?
- EESSI BoF session (Tue 9am)
- paper for RISC-V workshop in the works
- thinking about submission to Arm workshop (AHUG)...
- AWS booth?
- no AWS booth at ISC'24
- EESSI social event
- sponsoring opportunity?
- Tue evening joint AWS/NVIDIA party
- EUM'24 (23-25 April 2024)
- https://easybuild.io/eum24
- sync on ISC'24 plans (Brendan, Kenneth): Thu 18 April 14:00 CEST
- sponosored credits
- monthly spent is ramping up
- April-Aug'23: ~$3k/month
- Sept'23: $4.1k/month
- $25k was added by Brendan on 14 Sept'23
- currently $28k left
- should suffice until Feb'24 (extra $25k expires 02/29/2024)
- monthly spent is ramping up
- migrating away from Slurm cluster set up with Cluster-in-the-Cloud, to Magic Castle
- Magic Castle is developed by The Alliance (a.k.a.) ComputeCanada + actively maintained and supported
- combo of Terraform and Puppet
- supports AWS, Azure, OpenStack, GCP, OVH
- support auto-scaling (power up nodes as jobs are queued)
- very good fit for EESSI
-
EESSI build-and-deploy bot is now running on Slurm cluster in AWS managed with Magic Castle
- Arm login node, x86_64 mgmt node
- mix of Arm and x86_64 partitons
- Rocky 8 is quickly becoming most popular OS in HPC, see https://docs.easybuild.io/user-survey/#operating-system
- preparing switch to CVMFS repo under eessi.io domain
- current EESSI pilot repo @
/cvmfs/pilot.eessi-hpc.org
- new repo @
/cvmfs/software.eessi.io
- Stratum-0 (central server) set up @ Univ. of Groningen (NL), funded via MultiXscale EuroHPC project
- temporary Stratum-1 (mirror) servers running in AWS, Azure
- both backed by regular storage and S3
- S3-backed is preferable in some ways, but not in other
- CVMFS GeoAPI feature not compatible with serving CVMFS repo via S3
- GeoAPI is used to figure out which Stratum-1 is geographically closest (assumed to be fastest)
- CVMFS client can go straight to S3 (no need to talk to Stratum-1 server)
- can Stratum-1 use S3 as storage backend but still operate as a proper Stratum-1 server
- CDN is something we want to try, but it may complicate things?
- Spack is using one big S3 bucket in us-west for their binary cache
- ~1.7GB on x86_64, ~1GB on aarch64 to launch TensorFlow starting from scratch
- CVMFS GeoAPI feature not compatible with serving CVMFS repo via S3
- current EESSI pilot repo @
- integration of EESSI in ParallelCluster
- Steve Messenger (AWS) joined last EESSI update meeting
- some connection via Univ. of Luxembourg
- they're using Graviton to prep for future EuroHPC chip (Rhea?)
- could be interesting w.r.t. the problems we've been seeing on Graviton3 (with SVE)
- see https://github.com/EESSI/software-layer/blob/2023.06/eessi-2023.06-known-issues.yml
- we should provide more details to Brendan on that to get in touch with Arm experts
- "CVMFS for HPC" tutorial
- preliminary date for online session 1st week of Dec'23
- may become an ISC'24 tutorial submission as well
- next to "Getting started with EESSI" tutorial submission
- some changes to how latest libfabric (1.19.0) and OpenMPI (4.1.6) stack on top of each other w.r.t. EFA
- fixes w.r.t. intranode MPI msgs in libfabric
- more details on SC'23?
- EESSI submission for HUST'23 workshop was not selected
- Alan + HPCNow! + Henk-Jan (RUG) will be attending
- also some people of Univ. of Brussels (VUB) who are interested in EESSI
- Magic Castle tutorial on Sun 12 Nov'23
- AWS/NVIDIA/Arm "Welcome to Denver" open bar night on Sunday
- 12 Oct 2023: https://github.com/EESSI/meetings/wiki/AWS-meeting-2023-10-12
- 14 Sept 2023: no notes were taken :(
- 10 Aug 2023: https://github.com/EESSI/meetings/wiki/AWS-meeting-2023-08-10
- July 2023: (skipped)
- 8 June 2023: https://github.com/EESSI/meetings/wiki/AWS-meeting-2023-06-08
- 11 May 2023: https://github.com/EESSI/meetings/wiki/AWS-meeting-2023-05-11
- 13 April 2023: https://github.com/EESSI/meetings/wiki/AWS-meeting-2023-04-13
- 9 Mar 2023: https://github.com/EESSI/meetings/wiki/AWS-meeting-2023-03-09
- 11 Jan 2023: https://github.com/EESSI/meetings/wiki/AWS-meeting-2023-01-11