Skip to content

meeting Nov 4 2021

Bob Dröge edited this page Nov 4, 2021 · 6 revisions

Notes for 20211104 meeting

20211104 meeting notes

  • date & time: Thu November 4th 2021 - 2pm CET (13:00 UTC)
    • (every first Thursday of the month)
  • venue: (online, see mail for meeting link, or ask in Slack)
  • agenda:
    • Quick introduction by new people
    • EESSI-related meetings in last month
    • 2021.06 version of pilot repository
    • Testing of pilot version 2021.06
    • Progress update per EESSI layer
    • Infrastructure status and updates
    • AWS/Azure sponsorship update
    • Update on EESSI journal paper
    • EESSI risk analysis
    • Upcoming events
    • Q&A

Slides

Meeting notes

(by Alan, Bob)

  • Quick introduction by new people
    • Ahmad from SURF
      • Making EESSI available by default in the cloud
      • Plan is have own repo for SURF
    • Michael Hubner from UniBonn
    • Bartosz Kostrzewa also from UniBonn
      • Plan to contribute on behalf HPC-NRW
      • Michael will be doing a lot of the work
      • Interested in CUDA compatibility and testing infrastructure
    • Hugo Meiland from Azure
  • EESSI-related meetings in last month
    • Oct. 12 CernVM-FS coordination
      • New release, 2.9, soon
      • Discussion on IP vs. DNS entries in client configuration
        • What happens with a DNS outage?
        • Both ways have pros and cons
        • Better not to have both as you should only have 5-10 Stratum1 servers
      • Their main Stratum 1 serves tens of TB a month
      • Can we have private Stratum 1 in our own config?
        • Yes, this is a good idea
        • Should document this (some information already in the wiki, probably needs updating)
          • Ahmad willing to help with this
  • 2021.06 version of pilot repository
    • 2021.06 is the default (2021.03 is gone)
    • Also includes a Zen 3 stack
    • Next pilot version
      • NVidia/CUDA support
        • Some thought and effort gone into this already
        • Alan has already built CUDA software on top of EESSI
        • Script required to put drivers in the right place so that they are picked up by EESSI
          • This process can be useful to also install CUDA and driver compatibility libraries
        • Linker or compiler wrappers
        • More software
  • Testing of pilot version 2021.06
    • Hugo has been doing a bit of this recently
      • Installation works just fine
      • Potential points for end-user improvements
        • Initialisation script is good, but you don't get this until you list the directory
        • Building WRF
          • Some warnings from OpenMPI, which were being picked up by EB and causing the build to fail
          • Fixed in 4.1 so newer toolchain will probably fix this
      • Some improvements to archspec to detect the interconnect which could be leveraged by EESSI
      • Needed some hand-holding to get things started, would be great to get this documented/scripted
      • What about the Intel compiler?
        • Maybe we can treat that like the CUDA idea?
        • Azure has good contacts, could try to make this smoother
          • Azure are allowed deliver CUDA in their open images
    • Thomas tested installing some R packages
      • Using the packages failed "GLIBC_2.33 not found"
        • Because we don't have a linker wrapper
        • Compute Canada does this
        • Using a linker wrapper is a sizeable change in behaviour, worth a dedicated meeting!
      • Needed to fix some SELinux stuff to get EESSI to run (but that might be image specific)
  • Progress update per EESSI layer
    • filesystem layer
      • Another Stratum 1 on FENIX, but not in client configuration yet
      • deb/yum repos created for client configuration packages
        • can also use this for CVMFS client packages for POWER/Arm (which we currently have to build)
      • Fixed issue with automated ingestion
    • compatibility layer
      • No security updates
      • Removed 2021.03 tests from GitHub actions
    • software layer
      • Some improvements to the build script
        • Make sure temporary directory is available
        • upper directory needs to have support for extended attributes, now warns if this is not the case
      • Zen3 stack on AMD Milan in Azure
      • Some changes in archspec that require changes in our detection scripts
  • Infrastructure status and updates
    • Yum and apt repositories now available at
      • Node is completely ephemeral (pulls from GitHub on creation and updates every hour)
      • No meta package for deb (yet)
      • Note that config repo does not require cvmfs package, we should probably look at that
        • If CernVM-FS don't want to distribute packages for certain archs, we could do that though
        • Could also be used to create an EESSI init package that puts something in /usr/local/bin
  • AWS/Azure sponsorship update
    • AWS
      • Spent about 861$ of AWS credits
      • ~7k spent
      • 18K remaining, credits expire at end of January!
    • Azure
      • Spent about 550eur
  • Update on EESSI journal paper
    • No major updates
    • Deadline approaching quickly, progress being made
    • If you want to read it, let us know!
  • EESSI risk analysis (Thomas)
    • Starting looking at this for NESSI (Norwegian project)
    • First assessment of risk plus initial feedback collected
    • Long list
      • nothing sensitive, can share this (just ask)
      • lots of risks relate primarily to CernVM-FS
  • FENIX
    • Got resources last year at the second attempt
    • This year we rehashed last year's proposal
    • Swift storage only became available a few weeks ago
    • If it gets approved we will get access from the 1st of January
  • Upcoming events
    • EESSI talk at SC21 during HPC System Testing BoF
    • Computing Insight UK in December
      • Waiting for decision
    • Compute Canada will present at the Packing conference (EESSI get's a mention)
  • Q&A
    • Resources on AWS
      • Can we use those to emulate ARM/Power?
        • AWS has ARM
        • No Power though, could use Qemu
          • We have access to Power VMs in the US
      • Alan: Could repeat the scaling tests that we did for GROMACS for the paper
        • Create a big Magic Castle cluster which includes EFA fabric support for this
    • Will send out doodle for meeting regarding next pilot version
Clone this wiki locally