-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add test for QuantumESPRESSO (pw.x
)
#128
Conversation
@Crivella thanks for the PR! Some general explaination (since we are still lacking proper docs on the hooks and constants we implemented):
is probably not what you wanted to do:
in the class body, and then
What this will do is essentially create two test variants of your test: one intended for CPUs, and one intended for GPUs. A call to
for test variants where
Now, at this stage, you might be wondering: why do I need seperate test variants for CPU and GPU? For some codes, you might need to alter the command line arguments to tell the code to use GPUs. But even if codes are intelligent enough, the amount of tasks you want to spawn is typically very different for CPUs and GPUs. A common case would be that on a CPU node, for a pure MPI code, you'd spawn 1 MPI rank per CPU. On a GPU node, you typically spawn 1 MPI rank per GPU. That is exactly what the
All in all, this would
Note that this also provides an easy way to skip running CPU tests on GPU nodes: in our ReFrame config, we simply don't add a CPU feature. Now, all of this is under the premisse that QE has GPU support, and that this test case would support it as well, of course :) If it doesn't, than probably just calling
and
(i.e. without defining a I'll make an official review in which I'll include a few changes that I made to your test, after which I was able to run it succesfully. |
Maybe also good to check if the generated job scripts make sense to you. E.g. on a partition with 192 cores, I get:
And on half of 128-core node:
Does that make sense? Is this how you'd run it when you'd write a batch script yourself? |
Hi @casparvl , thank you for the detailed guide! QE does have gpu support, but for now there isn't a recipe for it. For now do you think it would make sense to implement the changes you suggested for the Or maybe go hybrid and implement all changes but set
|
Up to you, but I have a small preference for the hybrid aproach. It shouldn't be too complicated right now, and allows for easier extension once we have a GPU-capable build. (you could even comment things out that aren't relevant (yet), to make enabling it later on even easier). Out of curiosity: if you had a 128 core node, how would you write the batch script normally? (e.g. would you use pure MPI? Or hybrid MPI/OpenMP? Would you do binding?) |
Ok will work on it
In general it is preferable to go MPI only with QE, especially large calculations. |
@casparvl So as far as it goes i think the test should be able to run in MPI only. Tested with QE v7.1 / v7.2 / v7.3 and they handle also the smaller testcase well, especially since after 7.0 the code started guessing the best resource subdivision if none is specified.
|
Sorry for the super long silence, I was on leave (a long one).
Since you mentioned that it QEs implementation of OpenMP parallelism doesn't typically scale beyond 8 cores, and knowing that most HPC CPU sockets have way more than that (64 cores or more per socket is no exception nowadays), I'd say let's keep it pure MPI. It's also slightly simpler, so that's a plus. The goal of the test suite is to be able to spot performance regressions for realistic use cases. Since you as an expert would run this as pure MPI, I think that makes it a realistic case. On a side note: I am introducing |
I ran with
on CPU nodes that have 128 cores per node (i.e. the I will also run without the output
|
I think it make sense, the CI run is extremely small, it would expect it to stop scaling linearly after 1 node probably should stop scaling around 2~4 node. The 16 node is definitely overkill and most likely all the extra time is spent in establishing the communication between nodes + serial part of the code, that is why it takes longer. (It is probably overkill also for the larger test implemented, in case we could add a test with ecut=250 and nbnd=300~400 if we need to properly test that). The PWSCF represents the total time taken by the run. |
Ok, I don't mind that the tests run beyond where the problem scales - I imagine we'll still be able to see (further) performance degredation if we hit some regression. All of the runs but one completed succesfully:
timed out. Understandable, it's the heaviest case, on the smallest allocation (1 core). We should probably add some hook that makes it easy to naively scale the runtime with (the inverse of the) allocation size or something (or even use some coarse performance model), but lacking that... Maybe the easiest is just to make a small addition to increase the max walltime to 1 hour for the largest test case. I.e. something like:
(and then hope that it's enough - on 2 cores this case took 1475s). It's not a very elegant solution, but we can always refine later. |
I've added the change. In terms of scaling it should be ~N^2 with |
Ok, perfect:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Will do some final tests on two more HPC systems just to make sure, but I don't see any reason why they would provide problems. I'll post results here and merge once succesful.
Hm, one thing I didn't yet take into account is memory requirements. On one of the two HPC clusters (Vega), it defaults to 1 GB/CPU core, and one of the simulations (
I've checked a few of the smaller jobs (1_core, 2_cores, 4_cores), required memory seems to be about 2 GB / 2 GB/ 4 GB for those cases. 256 core requires about 64 GB. Quick & dirty solution would be to request at least 8 GB for the 1/2/4 core runs when I'm thinking what would be the best solution here. Really, we would want to query the max amount of memory available per node, and cap at that. Otherwise, we could just resolve the issue for the smallest core counts (set Anyway, here we really start to see how difficult it is to make a portable test, i.e. a test that is portable accross systems where you don't know the available resources in advance :) @smoors @satishskamath any ideas? |
does that mean that you can only run this test with <=128 cores per node on vega? |
@casparvl I think it should be possible to set the minimum required memory to 8GB and than scale it as 1GB/CPU afterward? In theory as an user if i hit memory limit when using QE i would adapt the parallelization, eg using some OpenMP and reducing the number of pools which basically increases the required memory linearly. You loose some speed but you make your job fit in the what resources you have. |
@smoors well, if I would request a This is going to be a more general problem we need to solve. We could also consider querying the memory amount from the config (which would mean users would have to add it to the config, since it is not auto-detected as far as I know). Then, I would set it to 8GB for the lowest core counts, but to a proportional amount of memory for the higher core counts (e.g. if the core count of the test is 64, and we know from the config that a node has 128 cores & 256 GB memory, you could simply request |
makes sense to me, and i think this should be the default approach: scale linearly with number of cores with a fixed minimum amount. for partitions that do not have enough memory, we can reduce the number of cores requested to still satisfy the required memory per core. this requires indeed having access to the maximum amount of memory that can be requested per node. in hydra this is determined by
|
As a first start, how about something like this?
and then in the test something like:
Optionally, we could make it more fancy by requesting the maximum of the application_mem_requirement, and the proportional amount of memory (though there is no real downside to asking less than the proportional amount of memory - enough is enough). This way, the application memory requirement would be like a 'minimum' amount of required memory (but if more is available, we ask more:
|
I also remember a discussion with the ReFrame devs on doing partition selection based on unequal comparisons for extras. I.e. now you can do:
but if we could do
it would be useful for memory (we don't need the memory to be exactly equal to the application requirement, we want it to be more than that). It'd save us a I don't think the ReFrame devs ever got into this though, and I guess it is quite easy to polish this later on: simply replace the It is another reason to use the partition |
N.B. I've run this on the Karolina HPC cluster, all was succesful there:
So we only need to resolve the memory-thing for Vega. |
that looks all good to me (apart from the terribly long variable names :). indeed better to skip if requirement is not met than to reduce number of cores. probably needs to go after the setup phase though (after |
Good point, makes sense to have something that is easily reusable. I'll try to make something and do a PR to @Crivella 's feature branch :) |
…t slurm) don't accept fractional memory requests. Rounding in MB we introduce less error
…SLURM takes gigabytes (base-10). Torque supposedly takes mebibytes (base-2).
…e for large task counts
@Crivella This PR should do the trick. The one thing you could still change (after you merge my PR into your feature branch) is the actual memory required to run your use cases at different sizes. The current function on this line seems to do the trick for all use cases, but is probably a poor estimation of the true memory requirements. It also doesn't destinguish between the different I'm even fine leaving it the way it is now, which means it'll request 4 GB plus 0.9 GB per task, regardless of |
Fix CI issues Co-authored-by: Davide Grassano <34096612+Crivella@users.noreply.github.com>
Apply fixes
Set mem hook
@casparvl Merged! Regarding the formula for the memory I would leave it as is. |
Perfect. With all the final changes I will run the test on final time before I press merge... |
@Crivella I opened Crivella#2 to also update the |
update hortense config
Forgot to press merge after my runs, but I see it allowed Lara to get some final config changes in. All looking good to me. I had a few incidental failures in some runs, but I am 99.9% sure those were MPI issues on the respective systems, and thus unrelated to this test implementation. Time to merge! Thanks @Crivella for your patience and hard work! :) |
This tests requires for the PR to reframe hpctestlib to be merged.
I based myself on the gromacs test and adjusted from there.
One thing that should be discussed is on what configuration of machines to run the test.
The current input file and parameter generate 4 configuration that can take from ~15s to ~ 15min to run on 4 performance cores of an i9 13900k
it is possible to increase
ecut
(size of FFT grids): would not go above 200~250 as it highly unlikely anyone would ever set it higher (in an actual calculation typical values ranges around 50~150)nbnd
(size of matrices to diagonalize): should be tested how high this can be pushed without changing the input file (increasing number of atoms) as working with too many empty states could cause problemsMy idea when writing the reframe test was to have something mid weight that could be ran in a reasonable time also on a single WS (while giving control to stress either the FFT or DIAG algorithms using
ecut
andnbnd
respectively).If we want to stress multinode further, than we could also think about using the input files from https://gitlab.com/QEF/q-e/-/tree/develop/test-suite/benchmarks/pw