Sync meeting on EESSI test suite (2024 07 25)

EESSI test suite sync meetings

Planning

every 2 weeks on Thursday at 14:00 CE(S)T
next meetings:
- Thu 8 Aug'24 14:00 CEST (Caspar, Satish, Sam (maybe), Lara/Kenneth?)

Meeting (2024-07-25)

Attending: Sam Moors, Caspar van Leeuwen

Caspar worked on Tutorial for writing a portable test
- Based on mpi4py all-reduce example
- Substantial progress, but not finished yet. Now at the stage where we have a standard ReFrame test and discuss the steps how to make it portable.
Three releases (0.3.0, 0.3.1, 0.3.2) by the end of June, used for the deliverable in MultiXscale
- Still todo: update docs:
  - Describe ESPResSo test cases (Satish)
  - Update tag names (Satish)
  - Add small section on debugging if a test doesn't succeed
    - Where to find the full logs
    - How to run manually
Apply memory limits using memory hook for all tests
- Caspar will go through the tests and update them where needed
- Suggestion: run top and dump info to figure out max memory useage, e.g. for i in {1..4}; do sleep 0.1 && top -b -n1 | grep "MiB Mem" ; done > cron.txt or for a specific process for i in {1..4}; do sleep 0.1 && top -b -n1 -p <pid>; done > cron.txt. More info: https://www.tecmint.com/save-top-command-output-to-a-file/
- Suggestion 2: get it directly from /proc/<pid>/status
- Suggestion 3: get the maximum useage from the C-group at the end of the job cat /sys/fs/cgroup/memory/$(</proc/self/cpuset)/memory.max_usage_in_bytes
OpenFOAM test
- Satish has a test that works, but no ReFrame test => No progress
Merged PRs
- Improve hook to allow launching 1 task per physical CPU or hardware thread, depending on what makes sense for the application#160
- Added LJ test to ESPResSo #155
- Fix memory request units for ESPResSo #158
Open PRs
- PyTorch: Caspar still needs to set OMP_NUM_THREADS, then Sam will look at it again
- CP2K:
  - OOM on Snellius for 1/8th node test (16 cores). Caspar will rerun to see if it is consistent, if so, try to increase memory request until it succeeds.
  - Caspar will rerun on Karolina to see if the failures on 16 Nodes are consistent
- LAMMPS:
  - Seems to be ready for review, but Lara isn't here to check
  - Caspar will try to run it on Snellius / Karolina, maybe Vega
  - Sam will try to have a look too

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync meeting on EESSI test suite (2024 07 25)

EESSI test suite sync meetings

Planning

Meeting (2024-07-25)

Clone this wiki locally