-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support always requesting GPUs on partitions that require it #116
Conversation
…uest_gpus; update set_num_gpus_per_node in osu test
@smoors can you merge main into this branch? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small changes. My most fundamental issue is that I'm not convinced the skip_if
in set_num_gpus_per_node
of osu.py
is really needed. And if it is, I think we can at least catch some cases earlier, so that we can avoid generating the tests (rather than generating them, and then skipping).
Testing results
The test requests all the |
we can indeed filter out the |
that's indeed intended. i assume you are requesting all the cores in the node (?), and some sites may require requesting all the GPUs if all the cores are requested.
yeah, i did not check the collective tests at all. |
@casparvl i created a filter for scales with < 2 GPUs, see @satishskamath i suggest that we tackle the collective test in another PR, do you agree? i also couldn’t resist doing a bit more reorganizing and cleaning up (no changed functionality), i hope you don’t mind :) |
commands in a @run_before('setup') hook if not equal to 'cpu'. | ||
Therefore, we must set device_buffers *before* the @run_before('setup') hooks. | ||
""" | ||
if self.device_type == DEVICE_TYPES[GPU]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: checking for device_type
is enough here, as the is_cude_module
check is already done in hooks.filter_valid_systems_by_device_type
Ok, so I changed my local config so that our GPU partition now has:
The
The job script is:
Note that something strange is happening here: the
That is really strange. There should be two slots, but on 2 different nodes. I'd think this just works, no clue why it doesnt. On CPU, this test variant passes without issues:
I'd like to test this interactively, but right now, I'm unable to get a GPU node, let alone two... |
@casparvl and @smoors see https://github.com/EESSI/test-suite/pull/116/files#r1495856364 . That is the reason that the function right below it is not removing the |
Hm, I can imagine this is why there is But, I'll wait for this order to be fixed before trying to debug an issue that might be solved by that... |
you're absolutely right, fixed in c1f3a89 |
Latest results with Sam's fix:
Output:
So the error is indeed gone with disappearance of |
@casparvl the following job worked for me (using
|
Sorry for the delay here. I really want to try and have another look today. I tried last week, and think I got things to pass then as well, but didn't have time to really assess properly... |
I checked scales |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from that comment which can also be changed later, I approve this PR. Waiting for @casparvl .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about my request to prepend check_always_request_gpus
with an underscore? :) I got a thumps up on that, but I don't think it was changed, right?
Btw, I tested, and everything runs fine now. I also see the correct number of GPUs requested, i.e. for
While for
That is as intended, and conforms to the |
forgot to actually do it, fixed in 82891ba |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm!
fixes #115
changes:
always_request_gpus
for partitions that require it, and check for it in hookcheck_always_request_gpus
, called at end of hookassign_tasks_per_compute_unit
to ensure it is run after task assignment2_cores
for device_typegpu
(in addition to skipping single-node tests with only 1 GPU present in the node), as this scale has only 1 GPU (pt2pt test)i did not add the new feature to the config files in the repo as there are currently no partitions that have both feature
cpu
andgpu
, although it does not hurt to add it to all GPU partitions that do not have featurecpu
(it should have no effect).