-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DDP initialization breaks on HPC system at larger scale #10216
Comments
@proutrc Hi, thanks for reporting. You mention LSF. Could you verify that |
@awaelchli something is indeed happening when going from 13 to 14 nodes on our system. See below: 13 nodes snippet (what we expect):
14 nodes snippet:
At 14 nodes it shows |
It looks like I can confirm I see The variable to use is perhaps Somewhat dated reference, but I bet it is still relative: https://www.ibm.com/support/pages/limitation-environment-variable-lsbhosts |
Thank you @proutrc for helping out here. Quote:
Their example there shows the format:
So, should we default to LSB_HOSTS in our plugin and if not set try to get the hosts from LSB_MCPU_HOSTS instead? When this happens, we can log a debug message. The fact that the variable contains the word "CPU" is a bit strange but I think for us this does not matter. |
Could you help me verify that this fixes the issue by creating this custom cluster env: from pytorch_lightning.plugins.environments import ClusterEnvironment
class NewLSFEnvironment(LSFEnvironment):
@staticmethod
def is_using_lsf() -> bool:
required_env_vars = ("LSB_JOBID", "LSB_MCPU_HOSTS", "JSM_NAMESPACE_LOCAL_RANK", "JSM_NAMESPACE_SIZE")
return all(v in os.environ for v in required_env_vars)
@staticmethod
def _read_hosts():
hosts_config = os.environ.get("LSB_MCPU_HOSTS", "")
if not hosts_config:
raise ValueError("Could not find hosts in environment variable LSB_MCPU_HOSTS")
host_config = hosts_config.split()
if len(host_config) % 2 != 0:
raise ValueError(
"Cannot parse hosts from LSB_MCPU_HOSTS environment variable. Expected format:"
' "<node0_name> <node0_num_procs> <node1_name> ..."'
)
return host_config[::2]
def _get_master_address(self):
return self._read_hosts()[0] and adding it to your trainer like so: trainer = Trainer(plugins=NewLSFEnvironment()) The main change I made is switch to that env variable and adjusted the parsing. If things break down on your side, could you let me know the values of the two environment variables LSB_MCPU_HOSTS and LSB_HOSTS. |
When trying this override in my small program (listed above) it complains about LSFEnvironment not being defined. I thought maybe you meant |
Thank you very much for trying and the patience. import os
import torch
import torch.nn as nn
import torch.nn.functional as F
import pytorch_lightning as pl
from torch.utils.data import DataLoader
from torchvision.transforms import transforms
from pl_examples.basic_examples.mnist_datamodule import MNIST
from pytorch_lightning.plugins import DDPPlugin
class LitAutoEncoder(pl.LightningModule):
def __init__(self):
super().__init__()
self.encoder = nn.Sequential(nn.Linear(28 * 28, 64), nn.ReLU(), nn.Linear(64, 3))
self.decoder = nn.Sequential(nn.Linear(3, 64), nn.ReLU(), nn.Linear(64, 28 * 28))
def forward(self, x):
# in lightning, forward defines the prediction/inference actions
embedding = self.encoder(x)
return embedding
def training_step(self, batch, batch_idx):
# training_step defined the train loop.
# It is independent of forward
x, y = batch
x = x.view(x.size(0), -1)
z = self.encoder(x)
x_hat = self.decoder(z)
loss = F.mse_loss(x_hat, x)
# Logging to TensorBoard by default
self.log("train_loss", loss)
return loss
def configure_optimizers(self):
optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
return optimizer
from pytorch_lightning.plugins.environments import LSFEnvironment
class NewLSFEnvironment(LSFEnvironment):
@staticmethod
def is_using_lsf() -> bool:
required_env_vars = ("LSB_JOBID", "LSB_MCPU_HOSTS", "JSM_NAMESPACE_LOCAL_RANK", "JSM_NAMESPACE_SIZE")
return all(v in os.environ for v in required_env_vars)
@staticmethod
def _read_hosts():
hosts_config = os.environ.get("LSB_MCPU_HOSTS", "")
if not hosts_config:
raise ValueError("Could not find hosts in environment variable LSB_MCPU_HOSTS")
host_config = hosts_config.split()
if len(host_config) % 2 != 0:
raise ValueError(
"Cannot parse hosts from LSB_MCPU_HOSTS environment variable. Expected format:"
' "<node0_name> <node0_num_procs> <node1_name> ..."'
)
return host_config[::2]
def _get_master_address(self):
return self._read_hosts()[0]
if __name__ == "__main__":
# define datasets/dataloaders
dataset = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor())
train_dl = DataLoader(dataset)
# simulate locally
# os.environ["LSB_JOBID"] = "1234"
# os.environ["LSB_MCPU_HOSTS"] = "localhost 2 localhost 2"
# os.environ["JSM_NAMESPACE_LOCAL_RANK"] = "1"
# os.environ["JSM_NAMESPACE_RANK"] = "1"
# os.environ["JSM_NAMESPACE_SIZE"] = "4"
# train
model = LitAutoEncoder()
# simulate locally
# trainer = pl.Trainer(
# num_processes=2,
# num_nodes=2,
# max_epochs=3,
# plugins=DDPPlugin(cluster_environment=NewLSFEnvironment()),
# )
trainer = pl.Trainer(
gpus="0,1,2,3,4,5",
auto_select_gpus=True,
num_nodes=14,
max_epochs=3,
plugins=DDPPlugin(cluster_environment=NewLSFEnvironment()),
)
print("LSB_MCPU_HOSTS:", os.environ["LSB_MCPU_HOSTS"])
trainer.fit(model, train_dl) |
It looks like we are switching to the DDPPlugin now, Error with DDPPlugin method:
This seems mostly explanatory. Just wanted to confirm this is the recommended path, as it appears to be changing some underlying things (i.e. On another note, when I do this (not using DDPPlugin) it hangs and never runs the model (but it does seem to initialize all ranks on the GPUs beyond our previous limit):
If the latter method, without Thanks! |
@awaelchli Hopefully my last question made sense. Ultimately, it appeared the new env variable and parsing worked. I could see it seemingly initialize the GPUs if I don't use the The switching to the Please let me know if I can do or provide anything else. |
@proutrc apologies for the delay. I'm struggling to simulate what you are seeing because I can't try it on a LSF cluster myself. @ajtritt who is the original contributor of the LSF Environment plugin, could you review the changes in my PR over here #10354 and if you still have access to LSF maybe verify if I broke anything, that would be very appreciated. |
Dear @proutrc, Any chance you would be open to having a pair coding debugging session with the Lightning Team (@awaelchli) so we can resolve this issue? Best, |
Hi @proutrc, I have been using a custom ClusterEnvironment to run on Summit: https://github.com/exabiome/deep-taxon/blob/master/src/exabiome/nn/lsf_environment.py It's been some time since I update things though. Here's what my environment looks like: PyTorch Lightning Version (e.g., 1.3.0): 1.4.3 The rest of the environment is cloned from open-ce-1.2.0-py38-0 I've been able to run on 128 nodes successfully, I haven't tried anything past that. Let me know what that turns up. Andrew |
My apologies for delay this time :). The It seems the provided LSFEnvironment, in pytorch-lightning, could reflect your example? I see you use
Here is my trainer, FWIW:
Getting me this for the 32 nodes x 6 GPUs/node (192 ranks):
And |
Dear @proutrc, It seems your Best, |
@tchaton I am happy to help however I can, but I can't claim the implementation. @ajtritt is the one who provided the implementation for the LSFEnvironment that worked for me. I just tested it within my little example. I may have goofed something with @awaelchli fix. That method may be valid too, but the one using I think I might see what those guys think too. |
We can certainly add the LSB_RANK_HOSTFILE functionality. However, according to the documentation, this environment variable is not set by default, so we probably still want to fall back to LSB_HOSTS or LSB_MCPU_HOSTS in case it is not defined. Since I don't have the environment to properly test and debug my PR #10354, I would definitely go for what is confirmed here to work. If @ajtritt has the time, we would be happy to receive a contribution of his improved LSF code 😃 |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
Would you mind sharing your lsf submit script? I am new to LSF and would love to see how you configure your LSF script for submission to the cluster. |
We are experiencing an issue with DDP at larger scales on our HPC system (Summit at OLCF - LSF scheduler). The specific threshold is at 14 nodes, where things suddenly aren't able to initialize anymore. It appears there is all of a sudden an inability to setup ranks across nodes properly, as depicted below in the output.
Each node as 6 GPUs. So, in total, we are trying to use 84 GPUs when things are suddenly unable to initialize. At 78 GPUs (or 13 nodes) things work as expected.
Initialization output at 13 nodes (78 GPUs - this works as expected):
Failed initialization at 14 nodes (84 GPUs - this hangs at this point):
Here is the code:
1.4.9
1.9
3.7
conda
,pip
, source):source
torch.__config__.show()
:cc @awaelchli @rohitgr7
The text was updated successfully, but these errors were encountered: