Skip to content

Magic Castle EESSI 2023 10 04

Kenneth Hoste edited this page Oct 5, 2023 · 1 revision

Magic Castle clusters for EESSI

Sync meeting 2023-10-04 (12:00 CEST)

  • attendees: Thomas, Alan, Lara, Kenneth

  • test Slurm cluster with Magic Castle running in AWS

    • using https://github.com/EESSI/mc_aws_rocky8_aarch64_202309_test
    • breakdown of steps/problems/fixes in https://github.com/EESSI/mc_aws_rocky8_aarch64_202309_test/issues/1
    • done:
      • correct deployment of login node (currently aarch64) + mgmt node (must be x86_64)
      • deployment of x86_64 + aarch64 workernodes
        • separate partition per node type
      • auto-scaling setup
        • requires automatic plan+apply enabled in Terraform Cloud
      • custom node image for workernodes (Rocky Linux 8.8 + Apptainer)
      • don't initialize EESSI by default
      • accounts created (incl. bot)
    • TODO:
      • [Kenneth] rename EESSI/mc_aws_rocky8_aarch64_202309_test to EESSI/magic_castle_clusters + use branches
      • [Kenneth] use default def-sponsor00 group for users
      • [Kenneth] increase disk space in /project (now only 50GB)
        • should be 5-10TB
        • or better let bot job working dirs in /project where other accounts can also access them
      • [Kenneth] create /project subdirectory for bot build logs
      • [Thomas?] install and configure bot
        • for both eessi-hpc.org and eessi.io ?
        • via new (private) EESSI/bot-configs repo
      • [Kenneth] try setting up accounts without using a password
      • [Kenneth] install security update of Slurm when available (in login + mgmt node + 2 node images)
      • [Kenneth,Thomas] switch to bot on MC AWS for EESSI/software-layer PRs
        • ideally on Mon 9 Oct (evening)
        • software build sync meeting on Tue 10 Oct 09:00 CEST)
      • [Kenneth,Alan] notifications for runs in Terraform Cloud via magic-castle@eessi.io
      • [Alan] software.eessi.io support in Magic Castle (13.x)
      • [Alan+Thomas?] set up Magic Castle cluster in Azure
        • [Alan] requires service principal, see Azure docs
        • [Thomas] set up Magic Castle via EESSI/magic-castle-clusters
      • [Alan?] EFS support (https://github.com/ComputeCanada/magic_castle/issues/256)
    • pain points
      • manual fixes for puppet_magic-castle since there's no Magic Castle 12.6.5 release yet
      • re-encrypting secrets with new certificate on mgmt node when starting from scratch
      • how to (frequently) update node images, update login/mgtm node vs setting up a new Magic Castle cluster from scratch every N months
    • problems
      • smee container doesn't work (only available for x86_64)
        • can use smee without container as workaround
      • not using default def-sponsor00 user group is creating more work
        • can't use /project to share stuff (like build logs for failing jobs)
  • next sync meeting: 11 Oct 2023, 13:00 CEST


Previous meetings

Clone this wiki locally