Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add test for QuantumESPRESSO (pw.x) #128

Merged
merged 17 commits into from
May 29, 2024
Merged

Conversation

Crivella
Copy link
Contributor

@Crivella Crivella commented Mar 19, 2024

This tests requires for the PR to reframe hpctestlib to be merged.

I based myself on the gromacs test and adjusted from there.
One thing that should be discussed is on what configuration of machines to run the test.

The current input file and parameter generate 4 configuration that can take from ~15s to ~ 15min to run on 4 performance cores of an i9 13900k

it is possible to increase

  • ecut (size of FFT grids): would not go above 200~250 as it highly unlikely anyone would ever set it higher (in an actual calculation typical values ranges around 50~150)
  • nbnd (size of matrices to diagonalize): should be tested how high this can be pushed without changing the input file (increasing number of atoms) as working with too many empty states could cause problems

My idea when writing the reframe test was to have something mid weight that could be ran in a reasonable time also on a single WS (while giving control to stress either the FFT or DIAG algorithms using ecut and nbnd respectively).
If we want to stress multinode further, than we could also think about using the input files from https://gitlab.com/QEF/q-e/-/tree/develop/test-suite/benchmarks/pw

@casparvl
Copy link
Collaborator

@Crivella thanks for the PR! Some general explaination (since we are still lacking proper docs on the hooks and constants we implemented):

hooks.filter_valid_systems_by_device_type(self, required_device_type=self.nb_impl)

is probably not what you wanted to do: self.nb_impl was a gromacs-specific test paramter. Does QE have GPU support? If so, you probably want to define a

    compute_device = parameter([DEVICE_TYPES[CPU], DEVICE_TYPES[GPU]])

in the class body, and then

hooks.filter_valid_systems_by_device_type(self, required_device_type=self.compute_device)

What this will do is essentially create two test variants of your test: one intended for CPUs, and one intended for GPUs. A call to

hooks.filter_valid_systems_by_device_type(self, required_device_type=self.compute_device)

for test variants where self.compute_device=COMPUTE_UNIT[GPU] will make sure that this test variant is only run when two conditions are met:

  1. The ReFrame configuration for the partition indicates that it has GPUs (see e.g. here)
  2. The module found through find_modules('QuantumESPRESSO') was built with GPU support (this is inferred from the -CUDA suffix that such modules typically have).

Now, at this stage, you might be wondering: why do I need seperate test variants for CPU and GPU? For some codes, you might need to alter the command line arguments to tell the code to use GPUs. But even if codes are intelligent enough, the amount of tasks you want to spawn is typically very different for CPUs and GPUs. A common case would be that on a CPU node, for a pure MPI code, you'd spawn 1 MPI rank per CPU. On a GPU node, you typically spawn 1 MPI rank per GPU.

That is exactly what the assign_tasks_per_compute unit hook can then do for you:

        if self.compute_device == DEVICE_TYPES[GPU]:
            hooks.assign_tasks_per_compute_unit(test=self, compute_unit=COMPUTE_UNIT[GPU])
        else:
            hooks.assign_tasks_per_compute_unit(test=self, compute_unit=COMPUTE_UNIT[CPU])

All in all, this would

  • Create the CPU variant of the test (i.e. one that starts 1 MPI rank per core) if a partition indicates it has a CPU feature
  • Create the GPU variant of the test case (i.e. one that starts 1 MPI rank per GPU) if a partition indicates it has the GPU feature

Note that this also provides an easy way to skip running CPU tests on GPU nodes: in our ReFrame config, we simply don't add a CPU feature.

Now, all of this is under the premisse that QE has GPU support, and that this test case would support it as well, of course :) If it doesn't, than probably just calling

hooks.filter_valid_systems_by_device_type(self, required_device_type=DEVICE_TYPES[CPU])

and

hooks.assign_tasks_per_compute_unit(test=self, compute_unit=COMPUTE_UNIT[CPU])

(i.e. without defining a self.compute_device parameter) makes the most sense.

I'll make an official review in which I'll include a few changes that I made to your test, after which I was able to run it succesfully.

eessi/testsuite/tests/apps/QuantumESPRESSO.py Show resolved Hide resolved
eessi/testsuite/tests/apps/QuantumESPRESSO.py Outdated Show resolved Hide resolved
eessi/testsuite/tests/apps/QuantumESPRESSO.py Outdated Show resolved Hide resolved
eessi/testsuite/tests/apps/QuantumESPRESSO.py Outdated Show resolved Hide resolved
@casparvl
Copy link
Collaborator

Maybe also good to check if the generated job scripts make sense to you. E.g. on a partition with 192 cores, I get:

#!/bin/bash
#SBATCH --job-name="rfm_EESSI_QuantumESPRESSO_PW_50bf6178"
#SBATCH --ntasks=192
#SBATCH --ntasks-per-node=192
#SBATCH --cpus-per-task=1
#SBATCH --output=rfm_job.out
#SBATCH --error=rfm_job.err
#SBATCH --time=0:30:0
#SBATCH -p genoa
#SBATCH --export=None
source /cvmfs/software.eessi.io/versions/2023.06/init/bash
module load QuantumESPRESSO/7.2-foss-2022b
export OMP_NUM_THREADS=1
wget -q http://pseudopotentials.quantum-espresso.org/upf_files/Si.pbe-n-kjpaw_psl.1.0.0.UPF
mpirun -np 192 pw.x -in Si.scf.in

And on half of 128-core node:

#!/bin/bash
#SBATCH --job-name="rfm_EESSI_QuantumESPRESSO_PW_07ac3074"
#SBATCH --ntasks=64
#SBATCH --ntasks-per-node=64
#SBATCH --cpus-per-task=1
#SBATCH --output=rfm_job.out
#SBATCH --error=rfm_job.err
#SBATCH --time=0:30:0
#SBATCH -p rome
#SBATCH --export=None
source /cvmfs/software.eessi.io/versions/2023.06/init/bash
module load QuantumESPRESSO/7.2-foss-2022b
export OMP_NUM_THREADS=1
wget -q http://pseudopotentials.quantum-espresso.org/upf_files/Si.pbe-n-kjpaw_psl.1.0.0.UPF
mpirun -np 64 pw.x -in Si.scf.in

Does that make sense? Is this how you'd run it when you'd write a batch script yourself?

@Crivella
Copy link
Contributor Author

Crivella commented Apr 3, 2024

Hi @casparvl , thank you for the detailed guide!

QE does have gpu support, but for now there isn't a recipe for it.
I think it would make sense to start implementing it once we have the cmake easyblock PR merged (in the configmake version the configure command needs the compute capability and toolkit runtime version manually specified).

For now do you think it would make sense to implement the changes you suggested for the CPU only and add the GPU afterward?

Or maybe go hybrid and implement all changes but set

compute_device = parameter([DEVICE_TYPES[CPU]])

@casparvl
Copy link
Collaborator

casparvl commented Apr 3, 2024

Or maybe go hybrid and implement all changes but set

Up to you, but I have a small preference for the hybrid aproach. It shouldn't be too complicated right now, and allows for easier extension once we have a GPU-capable build. (you could even comment things out that aren't relevant (yet), to make enabling it later on even easier).

Out of curiosity: if you had a 128 core node, how would you write the batch script normally? (e.g. would you use pure MPI? Or hybrid MPI/OpenMP? Would you do binding?)

@Crivella
Copy link
Contributor Author

Crivella commented Apr 3, 2024

Up to you, but I have a small preference for the hybrid aproach. It shouldn't be too complicated right now, and allows for easier extension once we have a GPU-capable build. (you could even comment things out that aren't relevant (yet), to make enabling it later on even easier).

Ok will work on it

Out of curiosity: if you had a 128 core node, how would you write the batch script normally? (e.g. would you use pure MPI? Or hybrid MPI/OpenMP? Would you do binding?)

In general it is preferable to go MPI only with QE, especially large calculations.
OpenMP scales up to 4~8 threads (usually not as well as MPI).
If we also wan't to run the smallest version of the test and it crash, than we could do a 48task/4tread hybrid run to make it run.
Another reason i've seen OpenMP used with QE is either on GPU nodes to try and use all the CPU available (if oversubscribing with MPI tasks is not required), or for memory reason (the most efficient parallelization in QE is often over the k-points, but it also scales the RAM used ~linearly)

@Crivella
Copy link
Contributor Author

Crivella commented Apr 4, 2024

@casparvl
I was able to simulate a 192 core run on my workstation by using --host 127.0.0.1:192 (although it being very slow since i have only 24 real cores/ 32 with hyperthreading).

So as far as it goes i think the test should be able to run in MPI only.
I was wondering if you think I should also add the CPU_SOCKET compute unit to test hybrid parallelization (or if some other way is preferable in order to test MPI+OpenMP)


Tested with QE v7.1 / v7.2 / v7.3 and they handle also the smaller testcase well, especially since after 7.0 the code started guessing the best resource subdivision if none is specified.
In this case it is automatically setting -npools 12 leaving 16 tasks per pool which should be fine considering that for ecut = 50 the FFT gride Z size is 36 (In theory the number of task per pool should not exceed this number for it to scale, and in my experience once it exceeds ~half this number the scaling starts to deviate from linear)

     Dense  grid:    12627 G-vectors     FFT dimensions: (  36,  36,  36)

@casparvl
Copy link
Collaborator

casparvl commented May 7, 2024

Sorry for the super long silence, I was on leave (a long one).

So as far as it goes i think the test should be able to run in MPI only.
I was wondering if you think I should also add the CPU_SOCKET compute unit to test hybrid parallelization (or if some other way is preferable in order to test MPI+OpenMP)

Since you mentioned that it QEs implementation of OpenMP parallelism doesn't typically scale beyond 8 cores, and knowing that most HPC CPU sockets have way more than that (64 cores or more per socket is no exception nowadays), I'd say let's keep it pure MPI. It's also slightly simpler, so that's a plus. The goal of the test suite is to be able to spot performance regressions for realistic use cases. Since you as an expert would run this as pure MPI, I think that makes it a realistic case.

On a side note: I am introducing NUMA_NODE as a compute unit in #130. Once that is merged, if you are curious, you could locally adapt your test to run with NUMA_NODE as compute_unit. A NUMA node is typically smaller (although already 16-24 cores on modern day HPC cpus). If that performs reasonably well, we could consider adding that later - but let's save that for a follow up PR since it depends on #130 anyway.

@casparvl
Copy link
Collaborator

casparvl commented May 7, 2024

I ran with

reframe -t CI -c test-suite/eessi/testsuite/tests/apps/QuantumESPRESSO.py --run --system=snellius:rome

on CPU nodes that have 128 cores per node (i.e. the 1_node scale will launch 128 tasks on a single node here). I've included the output here below. Could you have a look and see if this looks reasonable to you? I'm not familiar with the different computational parts that are timed. Clearly the PWSCF part doesn't scale great, I'm assuming that's because it is a very small test case. Seems the fftw part does scale really well.

I will also run without the CI tag to make sure the other configurations run succesfully as well, but I won't copy all the output here - it's too much and if it makes sense for the CI case, I'm happy to assume it also makes sense for the other cases ;-) If all of that completes, I think this test is ready to be merged :)

output
[       OK ] ( 1/13) EESSI_QuantumESPRESSO_PW %ecut=50 %nbnd=10 %scale=1_2_node %module_name=QuantumESPRESSO/7.2-foss-2022b %compute_device=cpu /07ac3074 @snellius:rome+default
P: extract_report_time: 0 s (r:0, l:None, u:None)
P: PWSCF_cpu: 2.8 s (r:0, l:None, u:None)
P: PWSCF_wall: 7.94 s (r:0, l:None, u:None)
P: electrons_cpu: 1.42 s (r:0, l:None, u:None)
P: electrons_wall: 3.9 s (r:0, l:None, u:None)
P: c_bands_cpu: 1.0 s (r:0, l:None, u:None)
P: c_bands_wall: 1.05 s (r:0, l:None, u:None)
P: cegterg_cpu: 0.82 s (r:0, l:None, u:None)
P: cegterg_wall: 0.86 s (r:0, l:None, u:None)
P: calbec_cpu: 0.04 s (r:0, l:None, u:None)
P: calbec_wall: 0.04 s (r:0, l:None, u:None)
P: fft_cpu: 0.22 s (r:0, l:None, u:None)
P: fft_wall: 0.23 s (r:0, l:None, u:None)
P: ffts_cpu: 0.0 s (r:0, l:None, u:None)
P: ffts_wall: 0.0 s (r:0, l:None, u:None)
P: fftw_cpu: 0.62 s (r:0, l:None, u:None)
P: fftw_wall: 0.65 s (r:0, l:None, u:None)
[       OK ] ( 2/13) EESSI_QuantumESPRESSO_PW %ecut=50 %nbnd=10 %scale=8_nodes %module_name=QuantumESPRESSO/7.2-foss-2022b %compute_device=cpu /3f65de7d @snellius:rome+default
P: extract_report_time: 0 s (r:0, l:None, u:None)
P: PWSCF_cpu: 21.3 s (r:0, l:None, u:None)
P: PWSCF_wall: 34.49 s (r:0, l:None, u:None)
P: electrons_cpu: 1.74 s (r:0, l:None, u:None)
P: electrons_wall: 2.0 s (r:0, l:None, u:None)
P: c_bands_cpu: 0.52 s (r:0, l:None, u:None)
P: c_bands_wall: 0.53 s (r:0, l:None, u:None)
P: cegterg_cpu: 0.06 s (r:0, l:None, u:None)
P: cegterg_wall: 0.06 s (r:0, l:None, u:None)
P: calbec_cpu: 0.0 s (r:0, l:None, u:None)
P: calbec_wall: 0.0 s (r:0, l:None, u:None)
P: fft_cpu: 0.22 s (r:0, l:None, u:None)
P: fft_wall: 0.23 s (r:0, l:None, u:None)
P: ffts_cpu: 0.02 s (r:0, l:None, u:None)
P: ffts_wall: 0.02 s (r:0, l:None, u:None)
P: fftw_cpu: 0.05 s (r:0, l:None, u:None)
P: fftw_wall: 0.05 s (r:0, l:None, u:None)
[       OK ] ( 3/13) EESSI_QuantumESPRESSO_PW %ecut=50 %nbnd=10 %scale=4_nodes %module_name=QuantumESPRESSO/7.2-foss-2022b %compute_device=cpu /aaf5a37a @snellius:rome+default
P: extract_report_time: 0 s (r:0, l:None, u:None)
P: PWSCF_cpu: 10.88 s (r:0, l:None, u:None)
P: PWSCF_wall: 15.95 s (r:0, l:None, u:None)
P: electrons_cpu: 0.69 s (r:0, l:None, u:None)
P: electrons_wall: 1.04 s (r:0, l:None, u:None)
P: c_bands_cpu: 0.25 s (r:0, l:None, u:None)
P: c_bands_wall: 0.26 s (r:0, l:None, u:None)
P: cegterg_cpu: 0.11 s (r:0, l:None, u:None)
P: cegterg_wall: 0.12 s (r:0, l:None, u:None)
P: calbec_cpu: 0.01 s (r:0, l:None, u:None)
P: calbec_wall: 0.01 s (r:0, l:None, u:None)
P: fft_cpu: 0.24 s (r:0, l:None, u:None)
P: fft_wall: 0.25 s (r:0, l:None, u:None)
P: ffts_cpu: 0.0 s (r:0, l:None, u:None)
P: ffts_wall: 0.0 s (r:0, l:None, u:None)
P: fftw_cpu: 0.08 s (r:0, l:None, u:None)
P: fftw_wall: 0.09 s (r:0, l:None, u:None)
[       OK ] ( 4/13) EESSI_QuantumESPRESSO_PW %ecut=50 %nbnd=10 %scale=2_nodes %module_name=QuantumESPRESSO/7.2-foss-2022b %compute_device=cpu /84dec31f @snellius:rome+default
P: extract_report_time: 0 s (r:0, l:None, u:None)
P: PWSCF_cpu: 14.09 s (r:0, l:None, u:None)
P: PWSCF_wall: 17.3 s (r:0, l:None, u:None)
P: electrons_cpu: 1.26 s (r:0, l:None, u:None)
P: electrons_wall: 1.78 s (r:0, l:None, u:None)
P: c_bands_cpu: 0.42 s (r:0, l:None, u:None)
P: c_bands_wall: 0.44 s (r:0, l:None, u:None)
P: cegterg_cpu: 0.22 s (r:0, l:None, u:None)
P: cegterg_wall: 0.23 s (r:0, l:None, u:None)
P: calbec_cpu: 0.01 s (r:0, l:None, u:None)
P: calbec_wall: 0.01 s (r:0, l:None, u:None)
P: fft_cpu: 0.23 s (r:0, l:None, u:None)
P: fft_wall: 0.23 s (r:0, l:None, u:None)
P: ffts_cpu: 0.0 s (r:0, l:None, u:None)
P: ffts_wall: 0.0 s (r:0, l:None, u:None)
P: fftw_cpu: 0.16 s (r:0, l:None, u:None)
P: fftw_wall: 0.18 s (r:0, l:None, u:None)
[       OK ] ( 5/13) EESSI_QuantumESPRESSO_PW %ecut=50 %nbnd=10 %scale=1_node %module_name=QuantumESPRESSO/7.2-foss-2022b %compute_device=cpu /50bf6178 @snellius:rome+default
P: extract_report_time: 0 s (r:0, l:None, u:None)
P: PWSCF_cpu: 4.82 s (r:0, l:None, u:None)
P: PWSCF_wall: 9.79 s (r:0, l:None, u:None)
P: electrons_cpu: 1.2 s (r:0, l:None, u:None)
P: electrons_wall: 1.51 s (r:0, l:None, u:None)
P: c_bands_cpu: 0.6 s (r:0, l:None, u:None)
P: c_bands_wall: 0.62 s (r:0, l:None, u:None)
P: cegterg_cpu: 0.41 s (r:0, l:None, u:None)
P: cegterg_wall: 0.43 s (r:0, l:None, u:None)
P: calbec_cpu: 0.02 s (r:0, l:None, u:None)
P: calbec_wall: 0.02 s (r:0, l:None, u:None)
P: fft_cpu: 0.23 s (r:0, l:None, u:None)
P: fft_wall: 0.23 s (r:0, l:None, u:None)
P: ffts_cpu: 0.0 s (r:0, l:None, u:None)
P: ffts_wall: 0.0 s (r:0, l:None, u:None)
P: fftw_cpu: 0.32 s (r:0, l:None, u:None)
P: fftw_wall: 0.33 s (r:0, l:None, u:None)
[       OK ] ( 6/13) EESSI_QuantumESPRESSO_PW %ecut=50 %nbnd=10 %scale=1_4_node %module_name=QuantumESPRESSO/7.2-foss-2022b %compute_device=cpu /df278d21 @snellius:rome+default
P: extract_report_time: 0 s (r:0, l:None, u:None)
P: PWSCF_cpu: 3.32 s (r:0, l:None, u:None)
P: PWSCF_wall: 9.12 s (r:0, l:None, u:None)
P: electrons_cpu: 2.37 s (r:0, l:None, u:None)
P: electrons_wall: 2.86 s (r:0, l:None, u:None)
P: c_bands_cpu: 1.87 s (r:0, l:None, u:None)
P: c_bands_wall: 1.95 s (r:0, l:None, u:None)
P: cegterg_cpu: 1.69 s (r:0, l:None, u:None)
P: cegterg_wall: 1.77 s (r:0, l:None, u:None)
P: calbec_cpu: 0.08 s (r:0, l:None, u:None)
P: calbec_wall: 0.08 s (r:0, l:None, u:None)
P: fft_cpu: 0.24 s (r:0, l:None, u:None)
P: fft_wall: 0.25 s (r:0, l:None, u:None)
P: ffts_cpu: 0.0 s (r:0, l:None, u:None)
P: ffts_wall: 0.0 s (r:0, l:None, u:None)
P: fftw_cpu: 1.26 s (r:0, l:None, u:None)
P: fftw_wall: 1.32 s (r:0, l:None, u:None)
[       OK ] ( 7/13) EESSI_QuantumESPRESSO_PW %ecut=50 %nbnd=10 %scale=16_nodes %module_name=QuantumESPRESSO/7.2-foss-2022b %compute_device=cpu /4ab41a78 @snellius:rome+default
P: extract_report_time: 0 s (r:0, l:None, u:None)
P: PWSCF_cpu: 83.03 s (r:0, l:None, u:None)
P: PWSCF_wall: 85.74 s (r:0, l:None, u:None)
P: electrons_cpu: 15.64 s (r:0, l:None, u:None)
P: electrons_wall: 15.96 s (r:0, l:None, u:None)
P: c_bands_cpu: 6.22 s (r:0, l:None, u:None)
P: c_bands_wall: 6.26 s (r:0, l:None, u:None)
P: cegterg_cpu: 0.66 s (r:0, l:None, u:None)
P: cegterg_wall: 0.66 s (r:0, l:None, u:None)
P: calbec_cpu: 0.05 s (r:0, l:None, u:None)
P: calbec_wall: 0.05 s (r:0, l:None, u:None)
P: fft_cpu: 0.36 s (r:0, l:None, u:None)
P: fft_wall: 0.37 s (r:0, l:None, u:None)
P: ffts_cpu: 0.21 s (r:0, l:None, u:None)
P: ffts_wall: 0.21 s (r:0, l:None, u:None)
P: fftw_cpu: 0.57 s (r:0, l:None, u:None)
P: fftw_wall: 0.57 s (r:0, l:None, u:None)
[       OK ] ( 8/13) EESSI_QuantumESPRESSO_PW %ecut=50 %nbnd=10 %scale=1_8_node %module_name=QuantumESPRESSO/7.2-foss-2022b %compute_device=cpu /b16b6ac9 @snellius:rome+default
P: extract_report_time: 0 s (r:0, l:None, u:None)
P: PWSCF_cpu: 12.58 s (r:0, l:None, u:None)
P: PWSCF_wall: 15.52 s (r:0, l:None, u:None)
P: electrons_cpu: 11.78 s (r:0, l:None, u:None)
P: electrons_wall: 13.05 s (r:0, l:None, u:None)
P: c_bands_cpu: 11.05 s (r:0, l:None, u:None)
P: c_bands_wall: 12.17 s (r:0, l:None, u:None)
P: cegterg_cpu: 10.71 s (r:0, l:None, u:None)
P: cegterg_wall: 11.79 s (r:0, l:None, u:None)
P: calbec_cpu: 1.31 s (r:0, l:None, u:None)
P: calbec_wall: 1.47 s (r:0, l:None, u:None)
P: fft_cpu: 0.22 s (r:0, l:None, u:None)
P: fft_wall: 0.23 s (r:0, l:None, u:None)
P: ffts_cpu: 0.0 s (r:0, l:None, u:None)
P: ffts_wall: 0.0 s (r:0, l:None, u:None)
P: fftw_cpu: 6.11 s (r:0, l:None, u:None)
P: fftw_wall: 6.6 s (r:0, l:None, u:None)
[       OK ] ( 9/13) EESSI_QuantumESPRESSO_PW %ecut=50 %nbnd=10 %scale=1_cpn_4_nodes %module_name=QuantumESPRESSO/7.2-foss-2022b %compute_device=cpu /7ba896b1 @snellius:rome+default
P: extract_report_time: 0 s (r:0, l:None, u:None)
P: PWSCF_cpu: 9.62 s (r:0, l:None, u:None)
P: PWSCF_wall: 26.34 s (r:0, l:None, u:None)
P: electrons_cpu: 8.15 s (r:0, l:None, u:None)
P: electrons_wall: 8.44 s (r:0, l:None, u:None)
P: c_bands_cpu: 6.52 s (r:0, l:None, u:None)
P: c_bands_wall: 6.71 s (r:0, l:None, u:None)
P: cegterg_cpu: 6.13 s (r:0, l:None, u:None)
P: cegterg_wall: 6.3 s (r:0, l:None, u:None)
P: calbec_cpu: 0.2 s (r:0, l:None, u:None)
P: calbec_wall: 0.2 s (r:0, l:None, u:None)
P: fft_cpu: 0.27 s (r:0, l:None, u:None)
P: fft_wall: 0.32 s (r:0, l:None, u:None)
P: ffts_cpu: 0.01 s (r:0, l:None, u:None)
P: ffts_wall: 0.01 s (r:0, l:None, u:None)
P: fftw_cpu: 5.02 s (r:0, l:None, u:None)
P: fftw_wall: 5.25 s (r:0, l:None, u:None)
[       OK ] (10/13) EESSI_QuantumESPRESSO_PW %ecut=50 %nbnd=10 %scale=1_cpn_2_nodes %module_name=QuantumESPRESSO/7.2-foss-2022b %compute_device=cpu /42db3ef7 @snellius:rome+default
P: extract_report_time: 0 s (r:0, l:None, u:None)
P: PWSCF_cpu: 16.09 s (r:0, l:None, u:None)
P: PWSCF_wall: 20.8 s (r:0, l:None, u:None)
P: electrons_cpu: 14.3 s (r:0, l:None, u:None)
P: electrons_wall: 14.62 s (r:0, l:None, u:None)
P: c_bands_cpu: 11.44 s (r:0, l:None, u:None)
P: c_bands_wall: 11.68 s (r:0, l:None, u:None)
P: cegterg_cpu: 10.72 s (r:0, l:None, u:None)
P: cegterg_wall: 10.92 s (r:0, l:None, u:None)
P: calbec_cpu: 0.29 s (r:0, l:None, u:None)
P: calbec_wall: 0.3 s (r:0, l:None, u:None)
P: fft_cpu: 0.25 s (r:0, l:None, u:None)
P: fft_wall: 0.28 s (r:0, l:None, u:None)
P: ffts_cpu: 0.0 s (r:0, l:None, u:None)
P: ffts_wall: 0.01 s (r:0, l:None, u:None)
P: fftw_cpu: 9.17 s (r:0, l:None, u:None)
P: fftw_wall: 9.34 s (r:0, l:None, u:None)
[       OK ] (11/13) EESSI_QuantumESPRESSO_PW %ecut=50 %nbnd=10 %scale=4_cores %module_name=QuantumESPRESSO/7.2-foss-2022b %compute_device=cpu /a8f9ad60 @snellius:rome+default
P: extract_report_time: 0 s (r:0, l:None, u:None)
P: PWSCF_cpu: 8.63 s (r:0, l:None, u:None)
P: PWSCF_wall: 10.21 s (r:0, l:None, u:None)
P: electrons_cpu: 7.72 s (r:0, l:None, u:None)
P: electrons_wall: 8.08 s (r:0, l:None, u:None)
P: c_bands_cpu: 6.16 s (r:0, l:None, u:None)
P: c_bands_wall: 6.34 s (r:0, l:None, u:None)
P: cegterg_cpu: 5.79 s (r:0, l:None, u:None)
P: cegterg_wall: 5.96 s (r:0, l:None, u:None)
P: calbec_cpu: 0.18 s (r:0, l:None, u:None)
P: calbec_wall: 0.18 s (r:0, l:None, u:None)
P: fft_cpu: 0.26 s (r:0, l:None, u:None)
P: fft_wall: 0.39 s (r:0, l:None, u:None)
P: ffts_cpu: 0.0 s (r:0, l:None, u:None)
P: ffts_wall: 0.01 s (r:0, l:None, u:None)
P: fftw_cpu: 4.76 s (r:0, l:None, u:None)
P: fftw_wall: 4.91 s (r:0, l:None, u:None)
[       OK ] (12/13) EESSI_QuantumESPRESSO_PW %ecut=50 %nbnd=10 %scale=2_cores %module_name=QuantumESPRESSO/7.2-foss-2022b %compute_device=cpu /b0433671 @snellius:rome+default
P: extract_report_time: 0 s (r:0, l:None, u:None)
P: PWSCF_cpu: 15.07 s (r:0, l:None, u:None)
P: PWSCF_wall: 36.31 s (r:0, l:None, u:None)
P: electrons_cpu: 13.81 s (r:0, l:None, u:None)
P: electrons_wall: 14.15 s (r:0, l:None, u:None)
P: c_bands_cpu: 11.0 s (r:0, l:None, u:None)
P: c_bands_wall: 11.26 s (r:0, l:None, u:None)
P: cegterg_cpu: 10.29 s (r:0, l:None, u:None)
P: cegterg_wall: 10.52 s (r:0, l:None, u:None)
P: calbec_cpu: 0.3 s (r:0, l:None, u:None)
P: calbec_wall: 0.31 s (r:0, l:None, u:None)
P: fft_cpu: 0.25 s (r:0, l:None, u:None)
P: fft_wall: 0.28 s (r:0, l:None, u:None)
P: ffts_cpu: 0.0 s (r:0, l:None, u:None)
P: ffts_wall: 0.0 s (r:0, l:None, u:None)
P: fftw_cpu: 8.66 s (r:0, l:None, u:None)
P: fftw_wall: 8.86 s (r:0, l:None, u:None)
[       OK ] (13/13) EESSI_QuantumESPRESSO_PW %ecut=50 %nbnd=10 %scale=1_core %module_name=QuantumESPRESSO/7.2-foss-2022b %
compute_device=cpu /14550fd1 @snellius:rome+default
P: extract_report_time: 0 s (r:0, l:None, u:None)
P: PWSCF_cpu: 24.81 s (r:0, l:None, u:None)
P: PWSCF_wall: 27.25 s (r:0, l:None, u:None)
P: electrons_cpu: 22.72 s (r:0, l:None, u:None)
P: electrons_wall: 23.14 s (r:0, l:None, u:None)
P: c_bands_cpu: 17.82 s (r:0, l:None, u:None)
P: c_bands_wall: 18.08 s (r:0, l:None, u:None)
P: cegterg_cpu: 16.43 s (r:0, l:None, u:None)
P: cegterg_wall: 16.65 s (r:0, l:None, u:None)
P: calbec_cpu: 0.49 s (r:0, l:None, u:None)
P: calbec_wall: 0.5 s (r:0, l:None, u:None)
P: fft_cpu: 0.26 s (r:0, l:None, u:None)
P: fft_wall: 0.27 s (r:0, l:None, u:None)
P: ffts_cpu: 0.0 s (r:0, l:None, u:None)
P: ffts_wall: 0.0 s (r:0, l:None, u:None)
P: fftw_cpu: 13.62 s (r:0, l:None, u:None)
P: fftw_wall: 13.81 s (r:0, l:None, u:None)

@Crivella
Copy link
Contributor Author

Crivella commented May 7, 2024

I ran with

reframe -t CI -c test-suite/eessi/testsuite/tests/apps/QuantumESPRESSO.py --run --system=snellius:rome

on CPU nodes that have 128 cores per node (i.e. the 1_node scale will launch 128 tasks on a single node here). I've included the output here below. Could you have a look and see if this looks reasonable to you? I'm not familiar with the different computational parts that are timed. Clearly the PWSCF part doesn't scale great, I'm assuming that's because it is a very small test case. Seems the fftw part does scale really well.

I will also run without the CI tag to make sure the other configurations run succesfully as well, but I won't copy all the output here - it's too much and if it makes sense for the CI case, I'm happy to assume it also makes sense for the other cases ;-) If all of that completes, I think this test is ready to be merged :)
output

I think it make sense, the CI run is extremely small, it would expect it to stop scaling linearly after 1 node probably should stop scaling around 2~4 node.

The 16 node is definitely overkill and most likely all the extra time is spent in establishing the communication between nodes + serial part of the code, that is why it takes longer. (It is probably overkill also for the larger test implemented, in case we could add a test with ecut=250 and nbnd=300~400 if we need to properly test that).

The PWSCF represents the total time taken by the run.
The other are the time taken by specific routines.

@casparvl
Copy link
Collaborator

casparvl commented May 7, 2024

Ok, I don't mind that the tests run beyond where the problem scales - I imagine we'll still be able to see (further) performance degredation if we hit some regression.

All of the runs but one completed succesfully:

EESSI_QuantumESPRESSO_PW %ecut=150 %nbnd=200 %scale=1_core %module_name=QuantumESPRESSO/7.2-foss-2022b %compute_device=cpu /617f261f @snellius:rome+default

timed out. Understandable, it's the heaviest case, on the smallest allocation (1 core). We should probably add some hook that makes it easy to naively scale the runtime with (the inverse of the) allocation size or something (or even use some coarse performance model), but lacking that... Maybe the easiest is just to make a small addition to increase the max walltime to 1 hour for the largest test case. I.e. something like:

@run_after('init')
def set_increased_walltime(self):
    max_ecut = max(QEspressoPWCheck.ecut.values)
    max_nbnd = max(QEspressoPWCheck.nbnd.values)
    if self.ecut == max_ecut and self.nbnd == max_nbnd:
        self.time_limit = '60m'

(and then hope that it's enough - on 2 cores this case took 1475s). It's not a very elegant solution, but we can always refine later.

@Crivella
Copy link
Contributor Author

Crivella commented May 7, 2024

I've added the change.

In terms of scaling it should be ~N^2 with ecut and Nlog(N) with nbnd wich should reflects respectively on the fftw and c_bands reports, but it would be good to have more data especially for the larger benchmarks. EG the FFT are very demanding in term of MPI comunication so they will likely slow down if they cannot fit on just one node in term of parallellization, even if i do not think we will hit this limit with 128 CPU per node, unless we make an input exactly to push this limit.

@casparvl
Copy link
Collaborator

Ok, perfect:

[----------] start processing checks
[ RUN      ] EESSI_QuantumESPRESSO_PW %ecut=150 %nbnd=200 %scale=1_core %module_name=QuantumESPRESSO/7.2-foss-2022b %compute_device=cpu /617f261f @snellius:rome+default
[       OK ] (1/1) EESSI_QuantumESPRESSO_PW %ecut=150 %nbnd=200 %scale=1_core %module_name=QuantumESPRESSO/7.2-foss-2022b %compute_device=cpu /617f261f @snellius:rome+default
P: extract_report_time: 0 s (r:0, l:None, u:None)
P: PWSCF_cpu: 2355.5 s (r:0, l:None, u:None)
P: PWSCF_wall: 2403.99 s (r:0, l:None, u:None)
P: electrons_cpu: 2243.11 s (r:0, l:None, u:None)
P: electrons_wall: 2265.3 s (r:0, l:None, u:None)
P: c_bands_cpu: 2027.97 s (r:0, l:None, u:None)
P: c_bands_wall: 2048.6 s (r:0, l:None, u:None)
P: cegterg_cpu: 2017.43 s (r:0, l:None, u:None)
P: cegterg_wall: 2037.68 s (r:0, l:None, u:None)
P: calbec_cpu: 26.3 s (r:0, l:None, u:None)
P: calbec_wall: 26.41 s (r:0, l:None, u:None)
P: fft_cpu: 0.51 s (r:0, l:None, u:None)
P: fft_wall: 0.51 s (r:0, l:None, u:None)
P: ffts_cpu: 0.01 s (r:0, l:None, u:None)
P: ffts_wall: 0.01 s (r:0, l:None, u:None)
P: fftw_cpu: 969.17 s (r:0, l:None, u:None)
P: fftw_wall: 974.35 s (r:0, l:None, u:None)
[----------] all spawned checks have finished

[  PASSED  ] Ran 1/1 test case(s) from 1 check(s) (0 failure(s), 0 skipped, 0 aborted)
[==========] Finished on Thu May  9 12:38:56 2024+0200

Copy link
Collaborator

@casparvl casparvl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Will do some final tests on two more HPC systems just to make sure, but I don't see any reason why they would provide problems. I'll post results here and merge once succesful.

@casparvl
Copy link
Collaborator

Hm, one thing I didn't yet take into account is memory requirements. On one of the two HPC clusters (Vega), it defaults to 1 GB/CPU core, and one of the simulations (ecut=150, nbnd=200, with 1 or 2 cores) crashed because of that:

$ sacct -j 26926584 -o JobID,MaxRSS,MaxRSSNode,MaxRSSTask,AllocCPUS,ReqMem,ReqTRES,State,ExitCode
JobID            MaxRSS MaxRSSNode MaxRSSTask  AllocCPUS     ReqMem    ReqTRES      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- ---------- ---------- --------
26926584                                               2      1000M billing=1+ OUT_OF_ME+    0:125
26926584.ba+   2236316K     cn0775          0          2                       OUT_OF_ME+    0:125
26926584.ex+          0     cn0775          0          2                        COMPLETED      0:0

I've checked a few of the smaller jobs (1_core, 2_cores, 4_cores), required memory seems to be about 2 GB / 2 GB/ 4 GB for those cases. 256 core requires about 64 GB.

Quick & dirty solution would be to request at least 8 GB for the 1/2/4 core runs when ecut=150, nbnd=200. here is an example of how memory is requested for the OSU tests, where we also ran into OOM issues. One downside with the OSU approach is that it defines a fixed amount of memory for all test cases: it always requests that 12GB defined in the test, for each test case. That's ok(-ish) for the OSU tests, as 12GB is sufficient for all test cases, yet not excessive that you are allocating much more resources than are being used. For QE, it would depend on the core count per node what the requested memory per node should be. I.e. we cannot hardcode a constant. We could make it a function that returns the maximum of 2GB/task and 4GB. But for full node jobs, that will likely exceed the max available memory per node on some systems. E.g. On Vega, I have 256 GB/node, 256 cores/node. A full node run would request 256 tasks. With 2 GB/task, that would result in --mem=512GB. SLURM will probably complain that there are no nodes that can satisfy this memory request, as there are no nodes with 512GB/node.

I'm thinking what would be the best solution here. Really, we would want to query the max amount of memory available per node, and cap at that. Otherwise, we could just resolve the issue for the smallest core counts (set --mem=8GB for any core count up to 4 cores per node).

Anyway, here we really start to see how difficult it is to make a portable test, i.e. a test that is portable accross systems where you don't know the available resources in advance :)

@smoors @satishskamath any ideas?

@smoors
Copy link
Collaborator

smoors commented May 13, 2024

E.g. On Vega, I have 256 GB/node, 256 cores/node. A full node run would request 256 tasks. With 2 GB/task, that would result in --mem=512GB

does that mean that you can only run this test with <=128 cores per node on vega?

@Crivella
Copy link
Contributor Author

@casparvl I think it should be possible to set the minimum required memory to 8GB and than scale it as 1GB/CPU afterward?

In theory as an user if i hit memory limit when using QE i would adapt the parallelization, eg using some OpenMP and reducing the number of pools which basically increases the required memory linearly. You loose some speed but you make your job fit in the what resources you have.
Not sure it makes sense to have and ad-hoc parallelization for every system in the test-suite (also some user might just use the parallelization set by QE by default)

@casparvl
Copy link
Collaborator

@smoors well, if I would request a --mem equal to 2GB/core * core count: yes. Which is silly, because 64GB is enough to run full node (i.e. 256 tasks). That's why I don't particularly like that solution. But I don't really see any elegant solution here :\ I guess my main question is: what do you think is the least bad solution. Just setting --mem=8GB for the 1/2/4 core count cases (and not modifying the others)?

This is going to be a more general problem we need to solve. We could also consider querying the memory amount from the config (which would mean users would have to add it to the config, since it is not auto-detected as far as I know). Then, I would set it to 8GB for the lowest core counts, but to a proportional amount of memory for the higher core counts (e.g. if the core count of the test is 64, and we know from the config that a node has 128 cores & 256 GB memory, you could simply request --mem=(256*64/128)G). The downside is: it is yet another config item that people will have to set. And we'd have to decide where in the config to set it (through an extras field, maybe?).

@smoors
Copy link
Collaborator

smoors commented May 13, 2024

For QE, it would depend on the core count per node what the requested memory per node should be. I.e. we cannot hardcode a constant. We could make it a function that returns the maximum of 2GB/task and 4GB.

makes sense to me, and i think this should be the default approach: scale linearly with number of cores with a fixed minimum amount. for partitions that do not have enough memory, we can reduce the number of cores requested to still satisfy the required memory per core. this requires indeed having access to the maximum amount of memory that can be requested per node. in hydra this is determined by MaxMemPerNode in the slurm config file. otherwise we can hardcode it in the reframe settings file per partition.

$ scontrol show partition |grep -e PartitionName -e MaxMemPerNode
PartitionName=ampere_gpu
   DefMemPerCPU=7810 MaxMemPerNode=251000
PartitionName=broadwell
   DefMemPerCPU=8600 MaxMemPerNode=241000
PartitionName=broadwell_himem
   DefMemPerCPU=35800 MaxMemPerNode=1434000
PartitionName=pascal_gpu
   DefMemPerCPU=10300 MaxMemPerNode=248000
PartitionName=skylake
   DefMemPerCPU=4500 MaxMemPerNode=188000
PartitionName=skylake_mpi
   DefMemPerCPU=4500 MaxMemPerNode=188000
PartitionName=zen4
   DefMemPerCPU=5871 MaxMemPerNode=375800

@casparvl
Copy link
Collaborator

As a first start, how about something like this?

extras: {
   # Available memory per node in GB
   'available_memory_per_node': '256'
}

and then in the test something like:

    @run_after('init')
    def set_mem(self):
        # Suppose I know how my memory requirement scales with ntasks, e.g. 2 GB per task, plus 2 GB constant
         application_mem_requirement = self.num_tasks_per_node * 2 + 2
        # Skip if the memory requirement is too high
        skip_if( self.current_partition.extras['available_memory_per_node'] < application_mem_requirement, "Skipping test: nodes in this partition don't have sufficient memory to run it")
       # Else, just request this amount of memory
        self.extra_resources = {'memory': {'size': '%sGB' % application_mem_requirement}}

Optionally, we could make it more fancy by requesting the maximum of the application_mem_requirement, and the proportional amount of memory (though there is no real downside to asking less than the proportional amount of memory - enough is enough). This way, the application memory requirement would be like a 'minimum' amount of required memory (but if more is available, we ask more:

    cpu_fraction = self.num_tasks_per_node * self.num_cpus_per_task / self.current_partition.processor.num_cpus
    default_memory =  cpu_fraction * self.current_partition.extras['available_memory_per_node']
    self.extra_resources = {'memory': {'size': '%sGB' % max(application_mem_requirement, default_memory) }}

@casparvl
Copy link
Collaborator

casparvl commented May 13, 2024

I also remember a discussion with the ReFrame devs on doing partition selection based on unequal comparisons for extras. I.e. now you can do:

valid_systems = ['%memory=application_mem_requirement ']

but if we could do

valid_systems = ['%memory>application_mem_requirement ']

it would be useful for memory (we don't need the memory to be exactly equal to the application requirement, we want it to be more than that). It'd save us a skip_if, which means these test instances wouldn't even be generated.

I don't think the ReFrame devs ever got into this though, and I guess it is quite easy to polish this later on: simply replace the skip_if with an append to valid_systems.

It is another reason to use the partition extras to define this available memory though :)

@casparvl
Copy link
Collaborator

N.B. I've run this on the Karolina HPC cluster, all was succesful there:

[  PASSED  ] Ran 52/52 test case(s) from 52 check(s) (0 failure(s), 0 skipped, 0 aborted)

So we only need to resolve the memory-thing for Vega.

@smoors
Copy link
Collaborator

smoors commented May 13, 2024

As a first start, how about something like this?

extras: {
   # Available memory per node in GB
   'available_memory_per_node': '256'
}

and then in the test something like:

    @run_after('init')
    def set_mem(self):
        # Suppose I know how my memory requirement scales with ntasks, e.g. 2 GB per task, plus 2 GB constant
         application_mem_requirement = self.num_tasks_per_node * 2 + 2
        # Skip if the memory requirement is too high
        skip_if( self.current_partition.extras['available_memory_per_node'] < application_mem_requirement, "Skipping test: nodes in this partition don't have sufficient memory to run it")
       # Else, just request this amount of memory
        self.extra_resources = {'memory': {'size': '%sGB' % application_mem_requirement}}

Optionally, we could make it more fancy by requesting the maximum of the application_mem_requirement, and the proportional amount of memory (though there is no real downside to asking less than the proportional amount of memory - enough is enough). This way, the application memory requirement would be like a 'minimum' amount of required memory (but if more is available, we ask more:

    cpu_fraction = self.num_tasks_per_node * self.num_cpus_per_task / self.current_partition.processor.num_cpus
    default_memory =  cpu_fraction * self.current_partition.extras['available_memory_per_node']
    self.extra_resources = {'memory': {'size': '%sGB' % max(application_mem_requirement, default_memory) }}

that looks all good to me (apart from the terribly long variable names :). indeed better to skip if requirement is not met than to reduce number of cores. probably needs to go after the setup phase though (after assign_tasks_per_compute_unit).
we could have a set_mem(test, app_mem_req) function in the hooks that takes application_mem_requirement as an argument, so we can reuse it.

@casparvl
Copy link
Collaborator

we could have a set_mem(test, app_mem_req) function in the hooks that takes application_mem_requirement as an argument, so we can reuse it.

Good point, makes sense to have something that is easily reusable. I'll try to make something and do a PR to @Crivella 's feature branch :)

@casparvl
Copy link
Collaborator

@Crivella This PR should do the trick.

The one thing you could still change (after you merge my PR into your feature branch) is the actual memory required to run your use cases at different sizes. The current function on this line seems to do the trick for all use cases, but is probably a poor estimation of the true memory requirements. It also doesn't destinguish between the different ecut and nbnd values.

I'm even fine leaving it the way it is now, which means it'll request 4 GB plus 0.9 GB per task, regardless of ecut and nbnd. All HPC systems I know have at least 1 GB per core available, so this should basically fit on any HPC system. Only if a user might want to run it on a local system (laptop, or maybe some cloud VM with a lot of cores and not much memory) they might not have enough. Even then, it's not a disaster, since the test will just be skipped. Then again, the smallest use case can (probably) do with way less and making that distinction would at least allow anyone to run that use case, even if they have a tiny memory...

casparvl and others added 5 commits May 23, 2024 14:03
Fix CI issues

Co-authored-by: Davide Grassano <34096612+Crivella@users.noreply.github.com>
@Crivella
Copy link
Contributor Author

@casparvl Merged!

Regarding the formula for the memory I would leave it as is.
In general it is never easy to estimate QE memory consumption apriori as it will also depend on the parallelization that QE is setting automatically. What users usually do is run the input with the desired parallelization scheme for a few seconds until QE gives an estimate of the memory required (which to my experience should in general be multiplied by 1.5~2 if you want to make sure your calculation finishes)

@casparvl
Copy link
Collaborator

Perfect. With all the final changes I will run the test on final time before I press merge...

@laraPPr
Copy link
Collaborator

laraPPr commented May 24, 2024

@Crivella I opened Crivella#2 to also update the config/vsc_hortense.py file.

@casparvl
Copy link
Collaborator

Forgot to press merge after my runs, but I see it allowed Lara to get some final config changes in. All looking good to me. I had a few incidental failures in some runs, but I am 99.9% sure those were MPI issues on the respective systems, and thus unrelated to this test implementation.

Time to merge! Thanks @Crivella for your patience and hard work! :)

@casparvl casparvl merged commit b0c91e4 into EESSI:main May 29, 2024
10 checks passed
@casparvl casparvl mentioned this pull request May 29, 2024
@Crivella Crivella deleted the feature-QEpw_test branch May 29, 2024 13:33
@boegel boegel changed the title Added test for QE pw.x Add test for QuantumESPRESSO (pw.x) Jun 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants