-
Notifications
You must be signed in to change notification settings - Fork 0
Magic Castle EESSI 2023 10 04
Kenneth Hoste edited this page Oct 5, 2023
·
1 revision
-
attendees: Thomas, Alan, Lara, Kenneth
-
test Slurm cluster with Magic Castle running in AWS
- using https://github.com/EESSI/mc_aws_rocky8_aarch64_202309_test
- breakdown of steps/problems/fixes in https://github.com/EESSI/mc_aws_rocky8_aarch64_202309_test/issues/1
- done:
- correct deployment of login node (currently
aarch64
) + mgmt node (must bex86_64
) - deployment of
x86_64
+aarch64
workernodes- separate partition per node type
- auto-scaling setup
- requires automatic plan+apply enabled in Terraform Cloud
- custom node image for workernodes (Rocky Linux 8.8 + Apptainer)
- don't initialize EESSI by default
- accounts created (incl.
bot
)
- correct deployment of login node (currently
- TODO:
- [Kenneth] rename
EESSI/mc_aws_rocky8_aarch64_202309_test
toEESSI/magic_castle_clusters
+ use branches - [Kenneth] use default
def-sponsor00
group for users - [Kenneth] increase disk space in
/project
(now only 50GB)- should be 5-10TB
- or better let bot job working dirs in
/project
where other accounts can also access them
- [Kenneth] create
/project
subdirectory for bot build logs - [Thomas?] install and configure bot
- for both eessi-hpc.org and eessi.io ?
- via new (private)
EESSI/bot-configs
repo
- [Kenneth] try setting up accounts without using a password
- [Kenneth] install security update of Slurm when available (in login + mgmt node + 2 node images)
- [Kenneth,Thomas] switch to bot on MC AWS for EESSI/software-layer PRs
- ideally on Mon 9 Oct (evening)
- software build sync meeting on Tue 10 Oct 09:00 CEST)
- [Kenneth,Alan] notifications for runs in Terraform Cloud via
magic-castle@eessi.io
- [Alan] software.eessi.io support in Magic Castle (13.x)
- [Alan+Thomas?] set up Magic Castle cluster in Azure
- [Alan] requires service principal, see Azure docs
- [Thomas] set up Magic Castle via
EESSI/magic-castle-clusters
- [Alan?] EFS support (https://github.com/ComputeCanada/magic_castle/issues/256)
- [Kenneth] rename
- pain points
- manual fixes for puppet_magic-castle since there's no Magic Castle 12.6.5 release yet
- re-encrypting secrets with new certificate on mgmt node when starting from scratch
- how to (frequently) update node images, update login/mgtm node vs setting up a new Magic Castle cluster from scratch every N months
- problems
- smee container doesn't work (only available for x86_64)
- can use smee without container as workaround
- not using default
def-sponsor00
user group is creating more work- can't use
/project
to share stuff (like build logs for failing jobs)
- can't use
- smee container doesn't work (only available for x86_64)
-
next sync meeting: 11 Oct 2023, 13:00 CEST