Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

idist.initialize fails in Slurm when using --ntasks-per-gpu #3259

Open
nowtryz opened this issue Jun 26, 2024 · 6 comments
Open

idist.initialize fails in Slurm when using --ntasks-per-gpu #3259

nowtryz opened this issue Jun 26, 2024 · 6 comments

Comments

@nowtryz
Copy link

nowtryz commented Jun 26, 2024

🐛 Bug description

When summoning a slurm step with multiple tasks assigning GPUs with the --ntasks-per-gpu flag instead of the --ntasks-per-node as it seems it was intended, ignite uses the SLURM_LOCALID environment as the local rank and use it as the device id to use even though the --ntasks-per-gpu already binds the MPI process with a GPU, which cause the call torch.cuda.set_device(self._local_rank) to fail.

To reproduce:

srun --ntasks-per-gpu=1 --nodes=2 --gpus-per-node=4 python -e "import ignite.distributed as idist; idist.initialize(backend='nccl')"

Which produces the following output:

    idist.initialize(backend="nccl")
  File ".../python3.11/site-packages/ignite/distributed/utils.py", line 577, in initialize
    _set_model(comp_model_cls(backend, **kwargs))
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../python3.11/site-packages/ignite/distributed/comp_models/native.py", line 92, in __init__
    self._create_from_backend(
  File ".../python3.11/site-packages/ignite/distributed/comp_models/native.py", line 127, in _create_from_backend
    torch.cuda.set_device(self._local_rank)
  File ".../python3.11/site-packages/torch/cuda/__init__.py", line 408, in set_device
    torch._C._cuda_setDevice(device)

Intended behaviour:
Either

  • Detect the presence of the --ntasks-per-gpu flag, which does not seem to be possible
  • Detect that only one GPU is available and use it
  • Allow to explicitly set the local id or the device to use even though idist is initialized in a slurm environment
  • Allow to override local rank with idist.set_local_rank(), which is never considered when SLURM_JOB_ID is detected

Environment

  • PyTorch Version (e.g., 1.4): 2.2
  • Ignite Version (e.g., 0.3.0): 0.5.0.post2
  • OS (e.g., Linux): Linux
  • How you installed Ignite (conda, pip, source): pip
  • Python version: 3.11
  • Any other relevant information: slurm 23.02.7
@nowtryz
Copy link
Author

nowtryz commented Jun 26, 2024

Also, would it be possible not to warn when only MASTER_ADDR and MASTER_PORT are provided as there are used and seem to be intended to be provided

elif any(defined_pth_ddp_env_vars):
# Let's warn user about PTH env variables that we could not taken into account
warnings.warn(
"We detected the following env variables: "
f"{[(k, v) for k, v in pth_ddp_env_vars.items() if v is not None]},\n"
"but will not take them into account as the following env vars are missing:"
f"{[k for k, v in pth_ddp_env_vars.items() if v is None]},\n"
)

@vfdev-5 vfdev-5 changed the title idist.initialize fails in Slurn when using --ntasks-per-gpu idist.initialize fails in Slurm when using --ntasks-per-gpu Jun 26, 2024
@vfdev-5
Copy link
Collaborator

vfdev-5 commented Jun 26, 2024

Thanks for reporting the issue @nowtryz ! Let me see what can be done here.

Detect the presence of the --ntasks-per-gpu flag, which does not seem to be possible

There is no env var responsible for that this argument ? By the way, why this is necessary to set it and what's the typical value, 1 ?

Allow to override local rank with idist.set_local_rank(), which is never considered when SLURM_JOB_ID is detected

Yes, this is unfortunate. IIRC, we rely on SLURM_LOCALID as the a single local rank provider...
In this case, if idist.set_local_rank() were working, how would you set it ? Is it possible to know how which MPI process is binded to which GPU ?

@nowtryz
Copy link
Author

nowtryz commented Jul 1, 2024

Hi,

There is no env var responsible for that this argument ? By the way, why this is necessary to set it and what's the typical value, 1 ?

From what I see, there is only one input environment variable and no output one. Yes --ntasks-per-gpu would typically be used with a defined number of GPUs (--gpus=N), which helps spawn a process on a defined number of GPUs instead of a defined number of nodes. It also helps to eventually use spare resources from multiple nodes whereas --nodes=N --ntasks-per-node=M --gpus-per-nodes=M requires nodes to be completely free.

In this case, if idist.set_local_rank() were working, how would you set it ? Is it possible to know how which MPI process is binded to which GPU ?

I would simply use idist.set_local_rank(0) which would use cuda:0. I don't exactly know how the binding is done but it seems to be similar to have the CUDA_VISIBLE_DEVICES flag set to the proper GPU and cuda:0 always points to the correct unit when --ntasks-per-gpu is used.

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Jul 2, 2024

Following https://slurm.schedmd.com/sbatch.html, SLURM_NTASKS_PER_GPU is set when --ntasks-per-gpu is specified.

@nowtryz can you provide the full traceback to get the exact error message?

If I understand correctly the problem, each process is seeing a single GPU and not all gpus, so torch.cuda.set_device(self._local_rank) is failing for processes with local rank > 0.

As for the fix, IMO there are two things to be done here:

  1. check here:
    if torch.cuda.is_available():
    torch.cuda.set_device(self._local_rank)
    the number of available gpus with torch.cuda.device_count():
 if torch.cuda.is_available():
     lrank = self._local_rank if self._local_rank < torch.cuda.device_count() else 0
     torch.cuda.set_device(lrank)

@nowtryz would you like to help solving this and check if this fix works ?

  1. Allow to override local rank with idist.set_local_rank(), which is never considered when SLURM_JOB_ID is detected

@nowtryz
Copy link
Author

nowtryz commented Aug 7, 2024

Hi @vfdev-5,

Following https://slurm.schedmd.com/sbatch.html, SLURM_NTASKS_PER_GPU is set when --ntasks-per-gpu is specified.

SLURM_NTASKS_PER_GPU is just set by sbatch as an output variable so that it can be used as input by srun. It may not be reliable. In my case --ntasks-per-gpu was set on the srun commad directly, inside the script submitted to sbatch.

@nowtryz can you provide the full traceback to get the exact error message

Sure, I will get the traceback ASAP

As for the fix, IMO there are two things to be done here:

  1. The first solution seems like a simple workaround, I will try it.
  2. I think this is the correct solution. Even more, I think ignite should still allow user configuration on slurm as the slurm/job may not be configured the way it was intended in ignite.
    Could it be possible to use environment only for information not specified by the user?
    Let me explain, in my case for example, my sbatch script sets MASTER_PORT and MASTER_ADDR, as the address of the node 0 can be easily retrieved from the submitted script (which is running on node 0) but it is completly ignore by ignite.

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Aug 12, 2024

This can be tricky to verify and get a correct DDP configuration when we mix slurm env vars with pytorch ddp env vars (e.g. MASTER_PORT, MASTER_ADDR, RANK, LOCAL_RANK, WORLD_SIZE).

Could it be possible to use environment only for information not specified by the user?

what do you mean exactly here, which environment?

Here is where we translate slurm vars into pth env:

def _setup_ddp_vars_from_slurm_env(environ: Dict[str, str]) -> Dict[str, Union[str, int]]:
"""Method to setup DDP env vars required by PyTorch from SLURM env"""
# 1) Tools like enroot can have hooks to translate slurm env vars to RANK, LOCAL_RANK, WORLD_SIZE etc
# See https://github.com/NVIDIA/enroot/blob/v3.1.0/conf/hooks/extra/50-slurm-pytorch.sh
# 2) User can use torch.distributed.launch tool to schedule on N local GPUs using 1 node, 1 task by SLURM
# To cover case 1), let's ensure that defined RANK == SLURM_PROCID, LOCAL_RANK == SLURM_LOCALID,
# WORLD_SIZE == SLURM_NTASKS. We will use defined MASTER_ADDR and MASTER_PORT instead of defining
# them by our means
# To cover case 2), let's check that defined RANK >= SLURM_PROCID, LOCAL_RANK >= SLURM_LOCALID,
# WORLD_SIZE >= SLURM_NTASKS, SLURM_JOB_NUM_NODES == 1
ddp_vars: Dict[str, Union[str, int, None]] = {
"RANK": int(environ["SLURM_PROCID"]),
"LOCAL_RANK": int(environ["SLURM_LOCALID"]),
"WORLD_SIZE": int(environ["SLURM_NTASKS"]),
"MASTER_ADDR": None,
"MASTER_PORT": None,
}
pth_ddp_env_vars = {key: environ.get(key, None) for key in ddp_vars}
defined_pth_ddp_env_vars = [v is not None for v in pth_ddp_env_vars.values()]
if all(defined_pth_ddp_env_vars):
nnodes = int(environ["SLURM_JOB_NUM_NODES"])
if nnodes > 1:
# ensure that all pth_ddp_env_vars are consistent with slurm vars
for key in ["RANK", "LOCAL_RANK", "WORLD_SIZE"]:
slurm_var = cast(int, ddp_vars[key])
pth_var = int(cast(str, pth_ddp_env_vars[key]))
if slurm_var != pth_var:
raise RuntimeError(
"Environment variable defined for PyTorch Distributed context is inconsistent with "
f"equivalent SLURM env variable. {key}: {pth_var} vs {slurm_var}\n"
f"SLURM vars: {ddp_vars}\n"
f"PTH vars: {pth_ddp_env_vars}\n"
)
else:
# ensure that PTH RANK >= SLURM_PROCID, PTH LOCAL_RANK >= SLURM_LOCALID,
# PTH WORLD_SIZE >= SLURM_NTASKS
for key in ["RANK", "LOCAL_RANK", "WORLD_SIZE"]:
slurm_var = cast(int, ddp_vars[key])
pth_var = int(cast(str, pth_ddp_env_vars[key]))
if pth_var < slurm_var:
raise RuntimeError(
"Environment variable defined for PyTorch Distributed context is "
"inconsistent with equivalent SLURM env variable. "
f"We expect that {key}: {pth_var} >= {slurm_var}\n"
f"SLURM vars: {ddp_vars}\n"
f"PTH vars: {pth_ddp_env_vars}\n"
)
ddp_vars[key] = pth_var
# set up MASTER_ADDR and MASTER_PORT from PTH
ddp_vars["MASTER_ADDR"] = cast(str, pth_ddp_env_vars["MASTER_ADDR"])
ddp_vars["MASTER_PORT"] = int(cast(str, pth_ddp_env_vars["MASTER_PORT"]))
elif any(defined_pth_ddp_env_vars):
# Let's warn user about PTH env variables that we could not taken into account
warnings.warn(
"We detected the following env variables: "
f"{[(k, v) for k, v in pth_ddp_env_vars.items() if v is not None]},\n"
"but will not take them into account as the following env vars are missing:"
f"{[k for k, v in pth_ddp_env_vars.items() if v is None]},\n"
)
if ddp_vars["MASTER_ADDR"] is None:
nodelist = environ["SLURM_JOB_NODELIST"]
try:
# use scontrol to expand hostname list
hostnames = subprocess.check_output(["scontrol", "show", "hostnames", nodelist])
method = "scontrol"
except FileNotFoundError:
# expand hostname list as scontrol
hostnames = " ".join(_expand_hostlist(nodelist)).encode("utf-8")
method = "ignite"
# at least one hostname should be defined
hostname_list = hostnames.split()
if len(hostname_list) < 1:
raise RuntimeError(f"No hostname detected in SLURM_JOB_NODELIST by {method} (nodelist={nodelist})")
# master address is the first hostname of nodes list
ddp_vars["MASTER_ADDR"] = str(hostname_list[0].decode("utf-8"))
if ddp_vars["MASTER_PORT"] is None:
# port should be the same over all process
slurm_port = environ["SLURM_JOB_ID"]
slurm_port = slurm_port[-4:]
ddp_vars["MASTER_PORT"] = int(slurm_port) + 15000
return cast(Dict[str, Union[str, int]], ddp_vars)

Maybe, we could relax this part:
ddp_vars: Dict[str, Union[str, int, None]] = {
"RANK": int(environ["SLURM_PROCID"]),
"LOCAL_RANK": int(environ["SLURM_LOCALID"]),
"WORLD_SIZE": int(environ["SLURM_NTASKS"]),
"MASTER_ADDR": None,
"MASTER_PORT": None,
}
and use MASTER_PORT, MASTER_ADDR, RANK, LOCAL_RANK, WORLD_SIZE from the env if user has provided. The problem here could be to be able to verify whether there is not inconsistency between slurm env vars and user provided pth env vars...

For the info, some time ago, @sdesrozis wrote this notes on how to use ignite with slurm: https://github.com/sdesrozis/why-ignite/tree/main/basics/2_slurm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants