Skip to content

Sync meeting on EESSI test suite (2024 07 25)

Caspar van Leeuwen edited this page Jul 25, 2024 · 2 revisions

EESSI test suite sync meetings

Planning

  • every 2 weeks on Thursday at 14:00 CE(S)T
  • next meetings:
    • Thu 8 Aug'24 14:00 CEST (Caspar, Satish, Sam (maybe), Lara/Kenneth?)

Meeting (2024-07-25)

Attending: Sam Moors, Caspar van Leeuwen

  • Caspar worked on Tutorial for writing a portable test

    • Based on mpi4py all-reduce example
    • Substantial progress, but not finished yet. Now at the stage where we have a standard ReFrame test and discuss the steps how to make it portable.
  • Three releases (0.3.0, 0.3.1, 0.3.2) by the end of June, used for the deliverable in MultiXscale

    • Still todo: update docs:
      • Describe ESPResSo test cases (Satish)
      • Update tag names (Satish)
      • Add small section on debugging if a test doesn't succeed
        • Where to find the full logs
        • How to run manually
  • Apply memory limits using memory hook for all tests

    • Caspar will go through the tests and update them where needed
    • Suggestion: run top and dump info to figure out max memory useage, e.g. for i in {1..4}; do sleep 0.1 && top -b -n1 | grep "MiB Mem" ; done > cron.txt or for a specific process for i in {1..4}; do sleep 0.1 && top -b -n1 -p <pid>; done > cron.txt. More info: https://www.tecmint.com/save-top-command-output-to-a-file/
    • Suggestion 2: get it directly from /proc/<pid>/status
    • Suggestion 3: get the maximum useage from the C-group at the end of the job cat /sys/fs/cgroup/memory/$(</proc/self/cpuset)/memory.max_usage_in_bytes
  • OpenFOAM test

    • Satish has a test that works, but no ReFrame test => No progress
  • Merged PRs

    • Improve hook to allow launching 1 task per physical CPU or hardware thread, depending on what makes sense for the application#160
    • Added LJ test to ESPResSo #155
    • Fix memory request units for ESPResSo #158
  • Open PRs

    • PyTorch: Caspar still needs to set OMP_NUM_THREADS, then Sam will look at it again

    • CP2K:

      • OOM on Snellius for 1/8th node test (16 cores). Caspar will rerun to see if it is consistent, if so, try to increase memory request until it succeeds.
      • Caspar will rerun on Karolina to see if the failures on 16 Nodes are consistent
    • LAMMPS:

      • Seems to be ready for review, but Lara isn't here to check
      • Caspar will try to run it on Snellius / Karolina, maybe Vega
      • Sam will try to have a look too
Clone this wiki locally