NCCL when trying to train on 2 nodes #19544

waynemystir · 2024-02-27T18:31:06Z

Bug description

I am trying to run a very simple training script for 2 nodes and I always get this error:

Output:

(ve) root@442a8ba5c0c6:~/ptl# . wr.sh 
Start fitting...
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
[W Utils.hpp:164] Warning: Environment variable NCCL_BLOCKING_WAIT is deprecated; use TORCH_NCCL_BLOCKING_WAIT instead (function getCvarBool)
442a8ba5c0c6:126284:126284 [0] NCCL INFO cudaDriverVersion 12030
442a8ba5c0c6:126284:126284 [0] NCCL INFO Bootstrap : Using eth0:172.18.0.2<0>
442a8ba5c0c6:126284:126284 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
442a8ba5c0c6:126284:126284 [0] NCCL INFO init.cc:1627 Cuda Host Alloc Size 4 pointer 0x7f1756e00000
442a8ba5c0c6:126284:126391 [0] NCCL INFO Failed to open libibverbs.so[.1]
442a8ba5c0c6:126284:126391 [0] NCCL INFO NET/Socket : Using [0]eth0:172.18.0.2<0>
442a8ba5c0c6:126284:126391 [0] NCCL INFO Using non-device net plugin version 0
442a8ba5c0c6:126284:126391 [0] NCCL INFO Using network Socket

442a8ba5c0c6:126284:126391 [0] misc/socket.cc:484 NCCL WARN socketStartConnect: Connect to 192.168.240.2<39817> failed : Software caused connection abort
442a8ba5c0c6:126284:126391 [0] NCCL INFO misc/socket.cc:565 -> 2
442a8ba5c0c6:126284:126391 [0] NCCL INFO misc/socket.cc:619 -> 2
442a8ba5c0c6:126284:126391 [0] NCCL INFO bootstrap.cc:274 -> 2
442a8ba5c0c6:126284:126391 [0] NCCL INFO init.cc:1388 -> 2
442a8ba5c0c6:126284:126391 [0] NCCL INFO group.cc:64 -> 2 [Async thread]
442a8ba5c0c6:126284:126284 [0] NCCL INFO group.cc:418 -> 2
442a8ba5c0c6:126284:126284 [0] NCCL INFO group.cc:95 -> 2
[rank1]:[W Module.cpp:156] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...

Traceback (most recent call last):
  File "/root/ptl/wrain.py", line 68, in <module>
    main()
  File "/root/ptl/wrain.py", line 62, in main
    trainer.fit(model, train_dataloaders=train_data)
  File "/root/ptl/ve/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 543, in fit
    call._call_and_handle_interrupt(
  File "/root/ptl/ve/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/root/ptl/ve/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
    return function(*args, **kwargs)
  File "/root/ptl/ve/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 579, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/root/ptl/ve/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 943, in _run
    self.__setup_profiler()
  File "/root/ptl/ve/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1078, in __setup_profiler
    self.profiler.setup(stage=self.state.fn, local_rank=local_rank, log_dir=self.log_dir)
  File "/root/ptl/ve/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1248, in log_dir
    dirpath = self.strategy.broadcast(dirpath)
  File "/root/ptl/ve/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 307, in broadcast
    torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
  File "/root/ptl/ve/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
    return func(*args, **kwargs)
  File "/root/ptl/ve/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2411, in broadcast_object_list
    broadcast(object_sizes_tensor, src=src, group=group)
  File "/root/ptl/ve/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
    return func(*args, **kwargs)
  File "/root/ptl/ve/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1910, in broadcast
    work = default_pg.broadcast([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
Last error:
socketStartConnect: Connect to 192.168.240.2<39817> failed : Software caused connection abort
Exception raised from getNCCLComm at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691 (most recent call first):
C++ CapturedTraceback:
#4 c10::Error::Error(c10::SourceLocation, std::string) from ??:0
#5 c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector<c10::Device, std::allocator<c10::Device> > const&, c10d::OpType, int, bool) [clone .cold] from ProcessGroupNCCL.cpp:0
#6 c10d::ProcessGroupNCCL::broadcast(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&) from ??:0
#7 c10d::ops::(anonymous namespace)::broadcast_CUDA(c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long) from Ops.cpp:0
#8 std::decay<c10::guts::infer_function_traits<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (*)(c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long), std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > >, c10::guts::typelist::typelist<c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long> > >::type::return_type>::type c10::impl::call_functor_with_args_from_stack_<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (*)(c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long), std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > >, c10::guts::typelist::typelist<c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long> >, false, 0ul, 1ul, 2ul, 3ul, 4ul, 5ul, c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long>(c10::OperatorKernel*, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*, std::integer_sequence<unsigned long, 0ul, 1ul, 2ul, 3ul, 4ul, 5ul>, c10::guts::typelist::typelist<c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long>*) from :0
#9 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (*)(c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long), std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > >, c10::guts::typelist::typelist<c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from :0
#10 torch::autograd::basicAutogradNotImplementedFallbackImpl(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from autograd_not_implemented_fallback.cpp:0
#11 void c10::BoxedKernel::make_boxed_function<&(anonymous namespace)::autograd_fallback>(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from VariableFallbackKernel.cpp:0
#12 c10::impl::BoxedKernelWrapper<std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long), void>::call(c10::BoxedKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long) from :0
#13 c10d::ProcessGroup::broadcast(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&) from :0
#14 pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&)#1}, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup*, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&)#1}&&, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (*)(c10d::ProcessGroup*, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from :0
#15 pybind11::cpp_function::dispatcher(_object*, _object*, _object*) from :0
#16 PyObject_CallFunctionObjArgs from ??:0
#17 _PyObject_MakeTpCall from ??:0
#18 PyMethod_New from ??:0
#19 _PyEval_EvalFrameDefault from ??:0
#20 _PyFunction_Vectorcall from ??:0
#21 PyObject_Call from ??:0
#22 _PyEval_EvalFrameDefault from ??:0
#23 _PyFunction_Vectorcall from ??:0
#24 _PyEval_EvalFrameDefault from ??:0
#25 _PyFunction_Vectorcall from ??:0
#26 PyObject_Call from ??:0
#27 _PyEval_EvalFrameDefault from ??:0
#28 _PyFunction_Vectorcall from ??:0
#29 _PyEval_EvalFrameDefault from ??:0
#30 _PyFunction_Vectorcall from ??:0
#31 _PyEval_EvalFrameDefault from ??:0
#32 PyObject_IsTrue from ??:0
#33 _PyObject_GenericGetAttrWithDict from ??:0
#34 PyObject_GetAttr from ??:0
#35 _PyEval_EvalFrameDefault from ??:0
#36 _PyFunction_Vectorcall from ??:0
#37 _PyEval_EvalFrameDefault from ??:0
#38 PyMethod_New from ??:0
#39 _PyEval_EvalFrameDefault from ??:0
#40 PyMethod_New from ??:0
#41 _PyEval_EvalFrameDefault from ??:0
#42 PyMethod_New from ??:0
#43 PyObject_Call from ??:0
#44 _PyEval_EvalFrameDefault from ??:0
#45 _PyFunction_Vectorcall from ??:0
#46 _PyEval_EvalFrameDefault from ??:0
#47 PyMethod_New from ??:0
#48 _PyEval_EvalFrameDefault from ??:0
#49 _PyFunction_Vectorcall from ??:0
#50 _PyEval_EvalFrameDefault from ??:0
#51 PyEval_EvalCode from ??:0
#52 PyEval_EvalCode from ??:0
#53 PyUnicode_Tailmatch from ??:0
#54 PyInit__collections from ??:0
#55 PyUnicode_Tailmatch from ??:0
#56 _PyRun_SimpleFileObject from ??:0
#57 _PyRun_AnyFileObject from ??:0
#58 Py_RunMain from ??:0
#59 Py_BytesMain from ??:0
#60 __libc_init_first from ??:0
#61 __libc_start_main from ??:0
#62 _start from ??:0`

What version are you seeing the problem on?

v2.2

How to reproduce the bug

I'm trying to run a very simple script for 2 nodes:

`export MASTER_ADDR=194.26.196.214
export MASTER_PORT=31421
export WORLD_SIZE=2
export NODE_RANK=1
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL
export TORCH_SHOW_CPP_STACKTRACES=1
export TORCH_NCCL_BLOCKING_WAIT=1

export NCCL_P2P_DISABLE=1
export NCCL_P2P_LEVEL=NVL
export NCCL_IB_GID_INDEX=3
python train.py`

Where train.py is this:

`import os
import torch
from torch.utils.data import DataLoader, Dataset
from pytorch_lightning import LightningModule, Trainer

class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x): 
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        print("batch_idx=", batch_idx)
        loss = self(batch).sum()
        self.log(
            "train_loss",
            loss,
            on_step=True,
            on_epoch=True,
            prog_bar=True,
            logger=True,
        )
        return {"loss": loss}

    def validation_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("valid_loss", loss)

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


def main():
    train_data = DataLoader(RandomDataset(32, 64000), batch_size=8000)
    model = BoringModel()
    trainer = Trainer(
        devices=1,
        num_nodes=2,
        accelerator="gpu",
        strategy="ddp",
        limit_train_batches=100000,
        limit_val_batches=1,
        num_sanity_val_steps=0,
        max_epochs=1,
        log_every_n_steps=1,
    )   
    print(f"Start fitting...")
    trainer.fit(model, train_dataloaders=train_data)

    print("DONE")


if __name__ == "__main__":
    main()`

Error messages and logs

# Error messages and logs here please

Environment

Current environment

#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

The text was updated successfully, but these errors were encountered:

p208p2002 · 2024-02-29T01:20:01Z

Did you set difference NODE_RANK to each node ?
I currently run multi-node training with lightning v2.2.0 + deepspeed on azure's gpu cluster successfully, without manual set any env varable, (maybe it's set by the cluster system)

waynemystir added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Feb 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCCL when trying to train on 2 nodes #19544

NCCL when trying to train on 2 nodes #19544

waynemystir commented Feb 27, 2024 •

edited by awaelchli

Loading

p208p2002 commented Feb 29, 2024 •

edited

Loading

NCCL when trying to train on 2 nodes #19544

NCCL when trying to train on 2 nodes #19544

Comments

waynemystir commented Feb 27, 2024 • edited by awaelchli Loading

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

p208p2002 commented Feb 29, 2024 • edited Loading

waynemystir commented Feb 27, 2024 •

edited by awaelchli

Loading

p208p2002 commented Feb 29, 2024 •

edited

Loading