-
Notifications
You must be signed in to change notification settings - Fork 0
Sync meeting on EESSI Software layer (2023 10 03)
Kenneth Hoste edited this page Oct 5, 2023
·
1 revision
- next meeting
- Tue 10 Oct at 09:00 CEST
- attending: Kenneth, Lara, Bob, Julián, Pedro, Thomas, Richard, Caspar, Alan
- status update eessi.io
- Stratum-0 is set up at RUG
- single Stratum-1 running in AWS (using S3 backend)
- test setup
- required lots of manual work (create VM + S3 bucket) because Atlantis wasn't working
- Ansible playbooks sort of worked, but does not support S3 buckets yet
- see WIP filesystem-layer PR #160
- GeoAPI doesn't work well with S3 buckets, clients go straight to S3 bucket
- need to figure out:
- how many Stratum-1's do we want (initially)?
- currently we have 4 for eessi-hpc.org (AWS, Azure, RUG in NL, BGO in Norway)
- how to deal with S3 buckets vs GeoAPI
- who should have admin access?
- DNS
- using CDN (CloudFlare)?
- how many Stratum-1's do we want (initially)?
- sync meeting being planned
- bot
- merged PRs:
- v0.1.0 released: https://github.com/EESSI/eessi-bot-software-layer/releases/tag/v0.1.0
-
develop
branch- for active development (PRs)
-
main
branch always corresponds to latest release
- open PRs:
- script to clean up tarballs of jobs given a PR number (PR #217)
- can let bot use this when a PR is merged/closed
- only cleans up large "checkpoint" tarballs for now, should eventually clean up everything related to a PR?
- script to clean up tarballs of jobs given a PR number (PR #217)
- next steps
- test step in between build & deploy
- make deploy step agnostic of EESSI
- new Slurm clusters for bot
- new Slurm clusters are being set up with Magic Castle
- in AWS: set up, need to test bot there
- will (very) soon replace current CitC cluster...
- next steps
- create more accounts
- increase disk space to couple of TBs (no EFS used there)
- in Azure: to set up, need to figure out account/API stuff
- in AWS: set up, need to test bot there
- new Slurm clusters are being set up with Magic Castle
- software-layer
- merged PRs
- foss/2023a (PR #334)
- ignore flaky failing FFTW.MPI tests (see issue #325)
- use patch to fix detection of Neoverse V1 in OpenBLAS (cfr. easyconfigs PR #18870)
- foss/2022a (PR #310)
- R v4.1.0 w/ foss/2021a (PR #328)
- add YAML file to keep track of known issues in EESSI pilot 2023.06 (PR #340)
- only increase limit for numerical test failures for OpenBLAS for aarch64/neoverse_v1 (merged PR #345)
- foss/2023a (PR #334)
- open PRs
- TensorFlow
- TensorFlow v2.7.1 with
foss/2021b
(PR #321)- several test failures in
aarch64/*
targets - may be fixable by backporting a couple of patches, but maybe not worth the trouble?
- several test failures in
- TensorFlow v2.8.4 with
foss/2021b
(PR #343)- assembler errors on
aarch64/*
when building XNNPACK- due to use of
-mcpu=native
which clashes with custom-march=...
options used by XNNPACK build procedure - see also easyconfigs issue #18899
- should be fixed by making sure that
-mcpu=...
is not used when building XNNPACK, see easyblocks PR #3011
- due to use of
- assembler errors on
- TensorFlow v2.11.0 with
foss/2022a
(PR #346)- assembler errors on
aarch64/*
when building XNNPACK, fixed with easyblocks PR #3011
- assembler errors on
- TensorFlow v2.13.0 with
foss/2022b
(PR #347)- 928 failing
scipy
tests onaarch64/neoverse_v1
... - build error on
x86_64/intel/haswell
because/usr/include/stdio.h
is picked up- need to set
$TF_SYSROOT
?
- need to set
- 928 failing
- TensorFlow v2.7.1 with
- matplotlib v3.4.3 with
foss/2021b
(PR #339)- open pr for Pillow in EasyBuild: #PR 18881
- ESPResSo
- WRF
- (PR #336)
- failing netCDF tests due to RPATH issue
- seems to be caused by
-DCMAKE_SKIP_RPATH=ON
that was added in https://github.com/easybuilders/easybuild-easyblocks/pull/1031 (Nov 2016) - maybe needed due to a bug in old CMake versions
- (PR #336)
- TensorFlow
- merged PRs
- notes
- should add "missing" YAML file (like for old TensorFlow versions on
aarch64/*
) - next packages
- OpenFOAM
- newer R
- Bioconductor
- AlphaFold (GPU)
- GPU
- we should set up a meeting to figure out the right steps...
- plan is to look into supporting GPUs in
software.eessi.io
CVMFS repo - is ldconfig OK with non-existing paths (to system paths)? also, order matters
- Apptainer also uses ldconfig to figure out paths to required libraries
- CUDA compat libs (could be avoided, only needed as a fallback)
- last location: Apptainer libs
- first step should be to get it working on assumption that GPU driver is sufficiently recent
- Alan will look into planning a sync meeting on GPU support
- should add "missing" YAML file (like for old TensorFlow versions on
- https://github.com/EESSI/meetings/wiki/Sync-meeting-on-EESSI-Software-layer-(2023-09-26)
- https://github.com/EESSI/meetings/wiki/Sync-meeting-on-EESSI-Software-layer-(2023-09-20)
- https://github.com/EESSI/meetings/wiki/Sync-meeting-on-EESSI-Software-layer-(2023-09-12)
- https://github.com/EESSI/meetings/wiki/Sync-meeting-on-EESSI-Software-layer-(2023-09-05)