-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use reframe features + fixes for various issues #11
Conversation
I tested this (but only the default behaviour, no command line overriding yet). One thing I noticed is that the default behavior is different from the master branch. Given two GROMACS modules,
This PR only runs combinations 1 & 4 by default. Of course, we might wonder if running 2 & 3 are particularly useful. I'd argue yes, for the following reasons:
It's easy enough to adapt this PR to that same default behavior, but curious to know if you agree with my reasoning above :) === Edit: I see that it actually runs 1, 3, and 4 by default. I was confused by the listing, which only showed two checks:
But of course the checks for the individual partitions are only generated later, so when I run, I get:
|
Since we also discussed shortly on how to reduce the number of GPUs used in the tests: I do indeed think that this should be done in the ReFrame config file. Simply specify a virtual ReFrame partition that only has 1 GPU per node. E.g. we have GPU nodes with 4 GPUs per node. I would then specify e.g.:
if I wanted the test to use the full GPU nodes, or
to 'fake' a partition with smaller nodes. In this example, one 'node' in this virtual partition would simply be a quarter of our real node. I tested this approach for single node tests, and there it works fine. I think this approach will even work for multinode tests, but I cannot test it: our system only allows partial node allocation for single node jobs, multinode jobs have to be exclusive. |
Ok, so I also tested the
That generated the job:
Which looks fine. |
thanks for the review this is indeed a change I forgot to discuss with you on the other hand, users (hopefully) rarely do this (unless for benchmarking), and we actively discourage it at VUB I changed the code so you can now choose the behavior per partition in the config file, see 0f77a7a
what do you think of this solution? |
yes sure. it would still be nice to also have a cmd line option for this, but that can be added later. |
Yeah, makes a lot of sense to me. I like it if this behavior can be controlled from the config side, as it means you can 'flip' that switch in one go for all tests. I've noticed I'm thinking very much about 'what do we want in the EESSI CI?' (probably: test everything), so it's super useful that you also provide the point of view of 'what would an end-user/sysadmin from an HPC system want?' (i.e. more limited testing to save resources) :) I'll give this another try to check the latest changes |
I refactored the logic a bit for the set of default test combinations. As I did so, I found out I actually forgot one combination in my listing above: Why would we want to test this? It could be that for certain versions of GROMACS in the EESSI stack, we only have a CUDA-aware build (simply because no one bothered to do the non-CUDA-aware build). If a researchers wants to use that particular version, but only has access to a CPU machine, that's perfectly fine and should work. Therefore, it should be tested. So, all in all, on a system with 2 modules (one cuda aware, one non-cuda aware), and 2 partitions (one with GPUs, one without), you'd get five tests by default:
I'll do a PR with my changed logic to your branch. I also stripped out the 'skip' hooks, as those are no longer needed now that we use the features for this :) |
yeah, you're probably right.
i'm ok with it being the default, but I also want an easy way to filter those cases out without having to manually specify every combination that I do want to test. i have another idea that could work. i'll first merge your PR into mine, and then add my idea on top. |
@casparvl i've kept the logic that you proposed and added 3 optional variables that users can set on the cmd line: this ensures that all valid combinations are tested by default, and at the same time gives a lot of flexibility to run a subset of those valid combinations. i think this is a good strategy to follow. for example, if you want to run only with non-bonded forces on the GPU (which implies only GPU partitions and CUDA modules), you can add:
if you want to test only the CUDA modules, you can add:
if you want to run only in non-GPU partitions and only with non-CUDA modules, you can add (this works when both 'cpu' and 'gpu' features are set on the GPU partitions):
of course, if you don't have the 'cpu' feature set on GPU partitions, the last example becomes simpler:
there are many other possibilities, such as selecting (or skipping) one or a set of toolchain generations e.g. if you agree with this approach, i hope we can merge this soon. i will then work on moving as much logic as possible out of the gromacs test, for maximal reuse in other tests. |
note that i chose a generic variable name for |
Unless I'm missing something, any of those selections can already be made with the
I.e. all parametrisation information gets included in the name of the test, and can thus be filtered on with
and
Result in the same set of tests:
Similarly, the equivalent of
would be
And the equivalent of
would be
Also, I tested
and this also works as expected. In fact, the documentation states this explicit case can also be achieved with the more readable
as ReFrame selects a test if it matches any of the Admittedly, the syntax of doing things with
, but it does correctly select the test with I'd personally not be in favour of introducing more variables just to achieve a slightly simpler selection syntax. Or am I overlooking something and can you make selections with these variables that you can't do with |
oh, i didn't know this, thanks for pointing it out! so the full display name in your example is this, right? i'll check if those options indeed cover all use cases that we care about, but it probably will, as all info is in the name. |
with the latest update, the test:
note that you should no longer explicitly set please test :) |
the last commit changes the default behavior of num_cpus_per_task to be equal to the total number of cpus per node divided by the total number of gpus per node. for example, for a node with 48 cores and 4 gpus: i think this is a more sensible default: if you use only part of the GPUs you will probably also use only part of the CPUs. |
Ok, I tested a number of combinations. For context, our GPU nodes have 4 GPUs per node, and 72 CPU cores per node.
All good so far. But now, what about CPU tests:
I haven't had the time yet to figure out where this is set exactly, and why it's not sensible. Note that leaving everything default
does work fine. I'm guessing this was not the result of your latest commits - it probably was like that ever since we added the options of overriding these options. But maybe we should just fix that right away as well...? Anyway, nice work on the overrides for the GPU counts :) |
indeed, my original idea was that num_cpus_per_task and num_tasks_per_node should both be set at the same time, but we can do better and make sure that by default the total number of cores per node is always requested (unless both are set of course) the last change should fix this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Nice milestone @smoors , finally a merge of this PR! :)
FYI: I tested all the above combinations again, and all produce the expected and desired results now :) |
Joining a bit late to this party. Sorry for that. I have been testing the recent merge and am not exactly getting the same results as both of you are.
with |
@satishskamath currently, if you specify the reason is that unfortunately, combining now that i think about it, maybe it's possible check the features of the partitions that are specified on the cmd line, and filter the tests based on that? something to try and/or discuss.. |
i created an issue for this: #21 |
system features
support in ReFrame #10changes:
features
cpu
and/orgpu
to the partitions in the site config file--setvar modules=<modulename>
--setvar valid_systems=<comma-separated-list>
--setvar num_tasks_per_node=<x>
--setvar num_cpus_per_task=<y>
--setvar num_gpus_per_node=<z>
--setvar env_vars=<envar>:<value>