Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support always requesting GPUs on partitions that require it #116

Merged
merged 14 commits into from
Feb 29, 2024

Conversation

smoors
Copy link
Collaborator

@smoors smoors commented Feb 12, 2024

fixes #115

changes:

  • add feature always_request_gpus for partitions that require it, and check for it in hook check_always_request_gpus, called at end of hook assign_tasks_per_compute_unit to ensure it is run after task assignment
  • small hook updates to allow simplifying the OSU test (done for the pt2pt test, not yet for collective test)
  • filter scale 2_cores for device_type gpu (in addition to skipping single-node tests with only 1 GPU present in the node), as this scale has only 1 GPU (pt2pt test)
  • add flake8 configuration to setup.cfg

i did not add the new feature to the config files in the repo as there are currently no partitions that have both feature cpu and gpu, although it does not hurt to add it to all GPU partitions that do not have feature cpu (it should have no effect).

@satishskamath satishskamath self-requested a review February 12, 2024 09:53
@smoors smoors marked this pull request as draft February 13, 2024 12:38
@smoors smoors marked this pull request as ready for review February 14, 2024 08:50
eessi/testsuite/hooks.py Outdated Show resolved Hide resolved
eessi/testsuite/hooks.py Outdated Show resolved Hide resolved
@satishskamath
Copy link
Collaborator

@smoors can you merge main into this branch?

Copy link
Collaborator

@casparvl casparvl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small changes. My most fundamental issue is that I'm not convinced the skip_if in set_num_gpus_per_node of osu.py is really needed. And if it is, I think we can at least catch some cases earlier, so that we can avoid generating the tests (rather than generating them, and then skipping).

@satishskamath
Copy link
Collaborator

satishskamath commented Feb 16, 2024

Testing results

  • Without the feature always_request_gpus:
#!/bin/bash
#SBATCH --job-name="rfm_EESSI_OSU_Micro_Benchmarks_pt2pt_73ba759f"
#SBATCH --ntasks=2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=1
#SBATCH --output=rfm_job.out
#SBATCH --error=rfm_job.err
#SBATCH --time=0:30:0
#SBATCH -p gpu
#SBATCH --export=None
#SBATCH --mem=12GB
module load 2023
module load OSU-Micro-Benchmarks/7.2-gompi-2023a-CUDA-12.1.1
mpirun -np 2 osu_latency -m 8 -x 10 -i 1000 -c
  • With the feature always_request_gpus:
#!/bin/bash
#SBATCH --job-name="rfm_EESSI_OSU_Micro_Benchmarks_pt2pt_80d6becf"
#SBATCH --ntasks=2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=36
#SBATCH --output=rfm_job.out
#SBATCH --error=rfm_job.err
#SBATCH --time=0:30:0
#SBATCH -p gpu
#SBATCH --export=None
#SBATCH --mem=12GB
#SBATCH --gpus-per-node=4
module load 2023
module load OSU-Micro-Benchmarks/7.2-gompi-2023a-CUDA-12.1.1
mpirun -np 2 osu_latency -m 8 -x 10 -i 1000 -c

The test requests all the gpus in the node. @smoors Not sure if that was intended, since the cpu test can work even with 1 GPU. The GPU is not being used at all.
Apart from that, with this PR, I cannot not list any collective tests using reframe even after I put in filter_supported_scales. Not sure how that got broken.

@satishskamath
Copy link
Collaborator

@smoors and @casparvl
It seems it is not just this PR. The collective tests do not get listed in the main branch as well. :D I am creating a new PR fixing this.

@smoors
Copy link
Collaborator Author

smoors commented Feb 16, 2024

Some small changes. My most fundamental issue is that I'm not convinced the skip_if in set_num_gpus_per_node of osu.py is really needed. And if it is, I think we can at least catch some cases earlier, so that we can avoid generating the tests (rather than generating them, and then skipping).

we can indeed filter out the 2_cores scale early on to avoid skipping.
we do still need to skip for the 1_node scale in case there is only 1 GPU present in the node (which is indeed a rare case in HPC)

@smoors
Copy link
Collaborator Author

smoors commented Feb 16, 2024

The test requests all the gpus in the node. @smoors Not sure if that was intended, since the cpu test can work even with 1 GPU. The GPU is not being used at all.

that's indeed intended. i assume you are requesting all the cores in the node (?), and some sites may require requesting all the GPUs if all the cores are requested.

Apart from that, with this PR, I cannot not list any collective tests using reframe even after I put in filter_supported_scales. Not sure how that got broken.

yeah, i did not check the collective tests at all.
i first wanted to make sure we are 100% happy with the pt2pt test, and then it should be relatively easy to make the equivalent changes to collective test.

@smoors
Copy link
Collaborator Author

smoors commented Feb 17, 2024

@casparvl i created a filter for scales with < 2 GPUs, see filter_scales_2gpus, but left kept the skip_if just in case someone wants to run this on their laptop.

@satishskamath i suggest that we tackle the collective test in another PR, do you agree?

i also couldn’t resist doing a bit more reorganizing and cleaning up (no changed functionality), i hope you don’t mind :)

commands in a @run_before('setup') hook if not equal to 'cpu'.
Therefore, we must set device_buffers *before* the @run_before('setup') hooks.
"""
if self.device_type == DEVICE_TYPES[GPU]:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: checking for device_type is enough here, as the is_cude_module check is already done in hooks.filter_valid_systems_by_device_type

@casparvl
Copy link
Collaborator

Ok, so I changed my local config so that our GPU partition now has:

                    'features': [
                        FEATURES[GPU],
                        FEATURES[CPU],
                        FEATURES[ALWAYS_REQUEST_GPUS],
                    ] + valid_scales_snellius_gpu,

The 2_cores and 1_node scales run fine. The 2_nodes scale fails though. I.e.

[     FAIL ] (21/22) EESSI_OSU_Micro_Benchmarks_pt2pt %benchmark_info=mpi.pt2pt.osu_latency %scale=2_nodes %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023
a %device_type=cpu /01344f7d @snellius:gpu+default
==> test failed during 'sanity': test staged in '/scratch-shared/casparl/reframe_output/staging/snellius/gpu/default/EESSI_OSU_Micro_Benchmarks_pt2pt_01344f7
d'
[     FAIL ] (22/22) EESSI_OSU_Micro_Benchmarks_pt2pt %benchmark_info=mpi.pt2pt.osu_bw %scale=2_nodes %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %de
vice_type=cpu /83c9ecfe @snellius:gpu+default
==> test failed during 'sanity': test staged in '/scratch-shared/casparl/reframe_output/staging/snellius/gpu/default/EESSI_OSU_Micro_Benchmarks_pt2pt_83c9ecf
e'

The job script is:

#!/bin/bash
#SBATCH --job-name="rfm_EESSI_OSU_Micro_Benchmarks_pt2pt_83c9ecfe"
#SBATCH --ntasks=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=72
#SBATCH --output=rfm_job.out
#SBATCH --error=rfm_job.err
#SBATCH --time=0:30:0
#SBATCH -p gpu
#SBATCH --export=None
#SBATCH --mem=12GB
#SBATCH --gpus-per-node=4
source /cvmfs/software.eessi.io/versions/2023.06/init/bash
module load OSU-Micro-Benchmarks/7.1-1-gompi-2023a
mpirun -np 2 osu_bw -m 4194304 -x 10 -i 1000 -c D D

Note that something strange is happening here: the -c argument is D D, but the test name suggests device_type=cpu. Something seems to have gone wrong here? However, that is NOT what this test fails on:

--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 2
slots that were requested by the application:

  osu_bw

Either request fewer slots for your application, or make more slots
available for use.

A "slot" is the Open MPI term for an allocatable unit where we can
launch a process.  The number of slots available are defined by the
environment in which Open MPI processes are run:

  1. Hostfile, via "slots=N" clauses (N defaults to number of
     processor cores if not provided)
  2. The --host command line parameter, via a ":N" suffix on the
     hostname (N defaults to 1 if not provided)
  3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
  4. If none of a hostfile, the --host command line parameter, or an
     RM is present, Open MPI defaults to the number of processor cores

In all the above cases, if you want Open MPI to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.

Alternatively, you can use the --oversubscribe option to ignore the
number of available slots when deciding the number of processes to
launch.

That is really strange. There should be two slots, but on 2 different nodes. I'd think this just works, no clue why it doesnt. On CPU, this test variant passes without issues:

[       OK ] (14/22) EESSI_OSU_Micro_Benchmarks_pt2pt %benchmark_info=mpi.pt2pt.osu_bw %scale=1_cpn_2_nodes %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-202
3a %device_type=cpu /339d8b1f @snellius:genoa+default
P: bandwidth: 28108.81 MB/s (r:0, l:None, u:None)

I'd like to test this interactively, but right now, I'm unable to get a GPU node, let alone two...

@satishskamath
Copy link
Collaborator

satishskamath commented Feb 20, 2024

@casparvl and @smoors see https://github.com/EESSI/test-suite/pull/116/files#r1495856364 . That is the reason that the function right below it is not removing the D D option in the executable options which is causing your error.
This most likely occurred after the clean up.

@casparvl
Copy link
Collaborator

Hm, I can imagine this is why there is D D instead of H H. What I can not imagine is that the mpirun command would downright fail with the complaint of having too few slots...

But, I'll wait for this order to be fixed before trying to debug an issue that might be solved by that...

@smoors
Copy link
Collaborator Author

smoors commented Feb 20, 2024

@casparvl and @smoors see https://github.com/EESSI/test-suite/pull/116/files#r1495856364 . That is the reason that the function right below it is not removing the D D option in the executable options which is causing your error. This most likely occurred after the clean up.

you're absolutely right, fixed in c1f3a89

@satishskamath
Copy link
Collaborator

satishskamath commented Feb 20, 2024

Hm, I can imagine this is why there is D D instead of H H. What I can not imagine is that the mpirun command would downright fail with the complaint of having too few slots...

But, I'll wait for this order to be fixed before trying to debug an issue that might be solved by that...

Latest results with Sam's fix:
@casparvl and @smoors

#!/bin/bash
#SBATCH --job-name="rfm_EESSI_OSU_Micro_Benchmarks_pt2pt_83c9ecfe"
#SBATCH --ntasks=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=72
#SBATCH --output=rfm_job.out
#SBATCH --error=rfm_job.err
#SBATCH --time=0:30:0
#SBATCH -p gpu
#SBATCH --export=None
#SBATCH --mem=12GB
#SBATCH --gpus-per-node=4
module load 2023
module load OSU-Micro-Benchmarks/7.1-1-gompi-2023a
mpirun -np 2 osu_bw -m 4194304 -x 10 -i 1000 -c

Output:

# OSU MPI Bandwidth Test v7.1
# Size      Bandwidth (MB/s)        Validation
# Datatype: MPI_CHAR.
1                       5.60              Pass
2                      11.24              Pass
4                      22.40              Pass
8                      44.92              Pass
16                     89.70              Pass
32                    179.78              Pass
64                    345.09              Pass
128                   684.48              Pass
256                  1258.13              Pass
512                  2317.53              Pass
1024                 3827.23              Pass
2048                 6659.48              Pass
4096                11999.44              Pass
8192                16906.25              Pass
16384               18146.99              Pass
32768               32736.49              Pass
65536               37084.64              Pass
131072              39673.78              Pass
262144              46274.14              Pass
524288              48739.29              Pass
1048576             48940.37              Pass
2097152             49176.59              Pass
4194304             49298.15              Pass

JOB STATISTICS
==============
Job ID: 5303230
Cluster: snellius
User/Group: satishk/satishk
State: COMPLETED (exit code 0)
Nodes: 2
Cores per node: 72
CPU Utilized: 00:02:06
CPU Efficiency: 0.64% of 05:28:48 core-walltime
Job Wall-clock time: 00:02:17
Memory Utilized: 275.01 MB
Memory Efficiency: 1.12% of 24.00 GB

So the error is indeed gone with disappearance of D D.

@smoors
Copy link
Collaborator Author

smoors commented Feb 24, 2024

@casparvl the following job worked for me (using srun), so your failed job seems specific to mpirun or your cluster

#!/bin/bash
#SBATCH --job-name="rfm_EESSI_OSU_Micro_Benchmarks_pt2pt_cd750be1"
#SBATCH --ntasks=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --output=rfm_job.out
#SBATCH --error=rfm_job.err
#SBATCH --time=0:30:0
#SBATCH --partition=ampere_gpu
#SBATCH --mem=12GB
#SBATCH --gpus-per-node=2
module load OSU-Micro-Benchmarks/7.2-gompi-2023a-CUDA-12.1.1
srun --cpus-per-task=32 osu_latency -m 8 -x 10 -i 1000 -c -d cuda D D

@casparvl
Copy link
Collaborator

Sorry for the delay here. I really want to try and have another look today. I tried last week, and think I got things to pass then as well, but didn't have time to really assess properly...

@satishskamath
Copy link
Collaborator

I checked scales 1_node, 1_cpn_2_nodes, 2_cores and 2_nodes for OSU and it seems to be working well for all with the latest commit.

Copy link
Collaborator

@satishskamath satishskamath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from that comment which can also be changed later, I approve this PR. Waiting for @casparvl .

eessi/testsuite/tests/apps/osu.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@casparvl casparvl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about my request to prepend check_always_request_gpus with an underscore? :) I got a thumps up on that, but I don't think it was changed, right?

@casparvl
Copy link
Collaborator

Btw, I tested, and everything runs fine now. I also see the correct number of GPUs requested, i.e. for 2_cores I get:

$ cat /scratch-shared/casparl/reframe_output/staging/snellius/gpu/default/EESSI_OSU_Micro_Benchmarks_pt2pt_9a736b93/rfm_job.sh
#!/bin/bash
#SBATCH --job-name="rfm_EESSI_OSU_Micro_Benchmarks_pt2pt_9a736b93"
#SBATCH --ntasks=2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=1
#SBATCH --output=rfm_job.out
#SBATCH --error=rfm_job.err
#SBATCH --time=0:30:0
#SBATCH -p gpu
#SBATCH --export=None
#SBATCH --mem=12GB
#SBATCH --gpus-per-node=1
source /cvmfs/software.eessi.io/versions/2023.06/init/bash
module load OSU-Micro-Benchmarks/7.1-1-gompi-2023a
mpirun -np 2 osu_bw -m 4194304 -x 10 -i 1000 -c

While for 2_nodes I get:

$ cat /scratch-shared/casparl/reframe_output/staging/snellius/gpu/default/EESSI_OSU_Micro_Benchmarks_pt2pt_83c9ecfe/rfm_job.sh
#!/bin/bash
#SBATCH --job-name="rfm_EESSI_OSU_Micro_Benchmarks_pt2pt_83c9ecfe"
#SBATCH --ntasks=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=72
#SBATCH --output=rfm_job.out
#SBATCH --error=rfm_job.err
#SBATCH --time=0:30:0
#SBATCH -p gpu
#SBATCH --export=None
#SBATCH --mem=12GB
#SBATCH --gpus-per-node=4
source /cvmfs/software.eessi.io/versions/2023.06/init/bash
module load OSU-Micro-Benchmarks/7.1-1-gompi-2023a
mpirun -np 2 osu_bw -m 4194304 -x 10 -i 1000 -c

That is as intended, and conforms to the node_part for those scales.

@smoors
Copy link
Collaborator Author

smoors commented Feb 29, 2024

What about my request to prepend check_always_request_gpus with an underscore? :) I got a thumps up on that, but I don't think it was changed, right?

forgot to actually do it, fixed in 82891ba

Copy link
Collaborator

@casparvl casparvl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm!

@casparvl casparvl merged commit ba35eb2 into EESSI:main Feb 29, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

handle running non-GPU jobs on GPU partitions
3 participants