Encounter NCCL error when runing Pytorch example code #1504

Noblezhong · 2024-11-03T07:56:02Z

Hi! when I try to run a python scripts for llm inference in pipeline parallelism on single server with multi GPUs. It turned out to errors related to NCCL.

Here is my develop env

(pippy) root@678c7278cb2d:/zt/code/my_dev# pip list
Package                  Version
------------------------ -----------
accelerate               1.0.1
aiohappyeyeballs         2.4.0
aiohttp                  3.10.5
aiosignal                1.2.0
async-timeout            4.0.3
attrs                    24.2.0
autocommand              2.2.2
backports.tarfile        1.2.0
Bottleneck               1.3.7
Brotli                   1.0.9
certifi                  2024.8.30
charset-normalizer       3.3.2
datasets                 2.19.1
dill                     0.3.8
filelock                 3.13.1
frozenlist               1.4.0
fsspec                   2024.10.0
huggingface_hub          0.24.6
idna                     3.7
importlib_metadata       8.0.0
importlib_resources      6.4.0
inflect                  7.3.1
jaraco.context           5.3.0
jaraco.functools         4.0.1
jaraco.text              3.12.1
Jinja2                   3.1.4
MarkupSafe               3.0.2
mkl_fft                  1.3.10
mkl_random               1.2.7
mkl-service              2.4.0
more-itertools           10.3.0
mpmath                   1.3.0
multidict                6.0.4
multiprocess             0.70.15
networkx                 3.4.2
numexpr                  2.8.7
numpy                    1.26.4
nvidia-cublas-cu12       12.4.5.8
nvidia-cuda-cupti-cu12   12.4.127
nvidia-cuda-nvrtc-cu12   12.4.127
nvidia-cuda-runtime-cu12 12.4.127
nvidia-cudnn-cu12        9.1.0.70
nvidia-cufft-cu12        11.2.1.3
nvidia-curand-cu12       10.3.5.147
nvidia-cusolver-cu12     11.6.1.9
nvidia-cusparse-cu12     12.3.1.170
nvidia-nccl-cu12         2.21.5
nvidia-nvjitlink-cu12    12.4.127
nvidia-nvtx-cu12         12.4.127
ordered-set              4.1.0
packaging                24.1
pandas                   2.2.2
pip                      24.2
platformdirs             4.2.2
psutil                   6.1.0
pyarrow                  16.1.0
PySocks                  1.7.1
python-dateutil          2.9.0.post0
pytz                     2024.1
PyYAML                   6.0.2
regex                    2024.9.11
requests                 2.32.3
safetensors              0.4.5
setuptools               72.1.0
six                      1.16.0
sympy                    1.13.1
tokenizers               0.15.2
tomli                    2.0.1
torch                    2.5.0
torchpippy               0.2.0
tqdm                     4.66.5
transformers             4.36.2
triton                   3.1.0
typeguard                4.3.0
typing_extensions        4.11.0
tzdata                   2023.3
urllib3                  2.2.3
wheel                    0.44.0
xxhash                   2.0.2
yarl                     1.11.0
zipp                     3.19.2

Here is my command to running this python code

torchrun --nproc-per-node 4 pippy_llama.py

Here is the bug

W1103 07:04:16.199000 21556 site-packages/torch/distributed/run.py:793] 
W1103 07:04:16.199000 21556 site-packages/torch/distributed/run.py:793] *****************************************
W1103 07:04:16.199000 21556 site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1103 07:04:16.199000 21556 site-packages/torch/distributed/run.py:793] *****************************************
/root/miniconda3/envs/pippy/lib/python3.10/site-packages/transformers/utils/generic.py:441: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
  _torch_pytree._register_pytree_node(
/root/miniconda3/envs/pippy/lib/python3.10/site-packages/transformers/utils/generic.py:441: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
  _torch_pytree._register_pytree_node(
/root/miniconda3/envs/pippy/lib/python3.10/site-packages/transformers/utils/generic.py:441: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
  _torch_pytree._register_pytree_node(
/root/miniconda3/envs/pippy/lib/python3.10/site-packages/transformers/utils/generic.py:441: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
  _torch_pytree._register_pytree_node(
/root/miniconda3/envs/pippy/lib/python3.10/site-packages/transformers/utils/generic.py:309: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
  _torch_pytree._register_pytree_node(
/root/miniconda3/envs/pippy/lib/python3.10/site-packages/transformers/utils/generic.py:309: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
  _torch_pytree._register_pytree_node(
/root/miniconda3/envs/pippy/lib/python3.10/site-packages/transformers/utils/generic.py:309: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
  _torch_pytree._register_pytree_node(
/root/miniconda3/envs/pippy/lib/python3.10/site-packages/transformers/utils/generic.py:309: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
  _torch_pytree._register_pytree_node(
/root/miniconda3/envs/pippy/lib/python3.10/site-packages/transformers/utils/generic.py:309: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
  _torch_pytree._register_pytree_node(
/root/miniconda3/envs/pippy/lib/python3.10/site-packages/transformers/utils/generic.py:309: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
  _torch_pytree._register_pytree_node(
/root/miniconda3/envs/pippy/lib/python3.10/site-packages/transformers/utils/generic.py:309: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
  _torch_pytree._register_pytree_node(
/root/miniconda3/envs/pippy/lib/python3.10/site-packages/transformers/utils/generic.py:309: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
  _torch_pytree._register_pytree_node(
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:14<00:00,  7.35s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:14<00:00,  7.39s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:14<00:00,  7.39s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:19<00:00,  9.69s/it]
layers_per_rank = 8
layers_per_rank = 8
layers_per_rank = 8
layers_per_rank = 8
[rank0]:[W1103 07:04:58.102756460 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
[rank3]:[E1103 07:34:57.813334593 ProcessGroupNCCL.cpp:616] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=COALESCED, NumelIn=18446744073709551615, NumelOut=18446744073709551615, Timeout(ms)=1800000) ran for 1800007 milliseconds before timing out.
[rank3]:[E1103 07:34:57.815173839 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 3] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank1]:[E1103 07:34:57.823033599 ProcessGroupNCCL.cpp:616] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=COALESCED, NumelIn=18446744073709551615, NumelOut=18446744073709551615, Timeout(ms)=1800000) ran for 1800014 milliseconds before timing out.
[rank3]: Traceback (most recent call last):
[rank3]:   File "/zt/code/my_dev/pippy_llama.py", line 59, in <module>
[rank3]:     output = schedule.step(args)
[rank3]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 615, in step
[rank3]:     self._step_microbatches(args_split, kwargs_split, targets_split, losses)
[rank3]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 702, in _step_microbatches
[rank3]:     works = _sorted_batch_p2p(ops, desc="fwd_recv")
[rank3]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 549, in _sorted_batch_p2p
[rank3]:     work_by_peer[peer] = _batch_p2p(ops, desc=desc)
[rank3]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 524, in _batch_p2p
[rank3]:     return dist.batch_isend_irecv(p2p_ops).pop()
[rank3]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2370, in batch_isend_irecv
[rank3]:     with _coalescing_manager(group, device, async_ops=True) as cm:
[rank3]:   File "/root/miniconda3/envs/pippy/lib/python3.10/contextlib.py", line 142, in __exit__
[rank3]:     next(self.gen)
[rank3]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2317, in _coalescing_manager
[rank3]:     work = group._end_coalescing(device)
[rank3]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:4409, internal error - please report this issue to the NCCL developers, NCCL version 2.21.5
[rank3]: ncclInternalError: Internal check failed.
[rank3]: Last error:

[rank1]:[E1103 07:34:57.823718519 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank2]:[E1103 07:34:57.825270404 ProcessGroupNCCL.cpp:616] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=COALESCED, NumelIn=18446744073709551615, NumelOut=18446744073709551615, Timeout(ms)=1800000) ran for 1800015 milliseconds before timing out.
[rank2]:[E1103 07:34:57.826580065 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 2] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank2]: Traceback (most recent call last):
[rank2]:   File "/zt/code/my_dev/pippy_llama.py", line 59, in <module>
[rank2]:     output = schedule.step(args)
[rank2]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 615, in step
[rank2]:     self._step_microbatches(args_split, kwargs_split, targets_split, losses)
[rank2]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 702, in _step_microbatches
[rank2]:     works = _sorted_batch_p2p(ops, desc="fwd_recv")
[rank2]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 549, in _sorted_batch_p2p
[rank2]:     work_by_peer[peer] = _batch_p2p(ops, desc=desc)
[rank2]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 524, in _batch_p2p
[rank2]:     return dist.batch_isend_irecv(p2p_ops).pop()
[rank2]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2370, in batch_isend_irecv
[rank2]:     with _coalescing_manager(group, device, async_ops=True) as cm:
[rank2]:   File "/root/miniconda3/envs/pippy/lib/python3.10/contextlib.py", line 142, in __exit__
[rank2]:     next(self.gen)
[rank2]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2317, in _coalescing_manager
[rank2]:     work = group._end_coalescing(device)
[rank2]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:4409, internal error - please report this issue to the NCCL developers, NCCL version 2.21.5
[rank2]: ncclInternalError: Internal check failed.
[rank2]: Last error:

[rank2]:[E1103 07:34:57.096569447 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 2] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank2]:[E1103 07:34:57.096590567 ProcessGroupNCCL.cpp:630] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E1103 07:34:57.096597157 ProcessGroupNCCL.cpp:636] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E1103 07:34:57.097694314 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=COALESCED, NumelIn=18446744073709551615, NumelOut=18446744073709551615, Timeout(ms)=1800000) ran for 1800015 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7e974fc45446 in /root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7e970542a762 in /root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7e9705431ba3 in /root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7e970543360d in /root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7e974fdac5c0 in /root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x7e9750669ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7e97506fabf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)

../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [64,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [65,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [66,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [67,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [68,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [69,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [70,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [71,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [72,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [73,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [74,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [75,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [76,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [77,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [78,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [79,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [80,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [81,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [82,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [83,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [84,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [85,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [86,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [87,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [88,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [89,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [90,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [91,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [92,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [93,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [94,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [95,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [96,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [97,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [98,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [99,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [100,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [101,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [102,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [103,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [104,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [105,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [106,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [107,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [108,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [109,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [110,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [111,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [112,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [113,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [114,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [115,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [116,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [117,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [118,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [119,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [120,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [121,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [122,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [123,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [124,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [125,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [126,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [127,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [0,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [1,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [2,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [3,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [4,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [5,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [6,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [7,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [8,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [9,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [10,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [11,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [12,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [13,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [14,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [15,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [16,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [17,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [18,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [19,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [20,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [21,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [22,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [23,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [24,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [25,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [26,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [27,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [28,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [29,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [30,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [31,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [32,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [33,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [34,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [35,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [36,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [37,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [38,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [39,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [40,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [41,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [42,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [43,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [44,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [45,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [46,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [47,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [48,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [49,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [50,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [51,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [52,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [53,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [54,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [55,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [56,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [57,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [58,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [59,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [60,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [61,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [62,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [63,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
[rank1]: Traceback (most recent call last):
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/pipelining/stage.py", line 580, in forward_one_chunk
[rank1]:     output = self.forward_maybe_with_nosync(*composite_args, **composite_kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/pipelining/stage.py", line 468, in forward_maybe_with_nosync
[rank1]:     out_val = self.submod(*args, **kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/fx/graph_module.py", line 784, in call_wrapped
[rank1]:     return self._wrapped_call(self, *args, **kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/fx/graph_module.py", line 361, in __call__
[rank1]:     raise e
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/fx/graph_module.py", line 348, in __call__
[rank1]:     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "<eval_with_key>.263", line 5, in forward
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/export/unflatten.py", line 136, in forward
[rank1]:     return torch.fx.Interpreter(self, graph=self.graph).run(
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/fx/interpreter.py", line 146, in run
[rank1]:     self.env[node] = self.run_node(node)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/fx/interpreter.py", line 203, in run_node
[rank1]:     return getattr(self, n.op)(n.target, args, kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/fx/interpreter.py", line 320, in call_module
[rank1]:     return submod(*args, **kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/export/unflatten.py", line 136, in forward
[rank1]:     return torch.fx.Interpreter(self, graph=self.graph).run(
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/fx/interpreter.py", line 146, in run
[rank1]:     self.env[node] = self.run_node(node)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/fx/interpreter.py", line 203, in run_node
[rank1]:     return getattr(self, n.op)(n.target, args, kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/fx/interpreter.py", line 320, in call_module
[rank1]:     return submod(*args, **kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/export/unflatten.py", line 136, in forward
[rank1]:     return torch.fx.Interpreter(self, graph=self.graph).run(
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/fx/interpreter.py", line 146, in run
[rank1]:     self.env[node] = self.run_node(node)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/fx/interpreter.py", line 203, in run_node
[rank1]:     return getattr(self, n.op)(n.target, args, kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/fx/interpreter.py", line 275, in call_function
[rank1]:     return target(*args, **kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/_ops.py", line 716, in __call__
[rank1]:     return self._op(*args, **kwargs)
[rank1]: RuntimeError: CUDA error: device-side assert triggered
[rank1]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank1]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
[rank1]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


[rank1]: While executing %neg_default : [num_users=1] = call_function[target=torch.ops.aten.neg.default](args = (%slice_tensor_3,), kwargs = {})
[rank1]: Original traceback:
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1181, in forward
[rank1]:     outputs = self.model(
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1068, in forward
[rank1]:     layer_outputs = decoder_layer(
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/pipelining/_IR.py", line 1160, in _split_before_forward
[rank1]:     return self._orig_forward(*args, **kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 796, in forward
[rank1]:     hidden_states, self_attn_weights, present_key_value = self.self_attn(
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 704, in forward
[rank1]:     query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 234, in apply_rotary_pos_emb
[rank1]:     q_embed = (q * cos) + (rotate_half(q) * sin)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 208, in rotate_half
[rank1]:     return torch.cat((-x2, x1), dim=-1)


[rank1]: While executing %self_attn : [num_users=3] = call_module[target=self_attn](args = (%unsqueeze, %input_layernorm), kwargs = {})
[rank1]: Original traceback:
[rank1]: None

[rank1]: While executing %layers_8 : [num_users=3] = call_module[target=layers.8](args = (%unsqueeze, %add_47), kwargs = {})
[rank1]: Original traceback:
[rank1]: None

[rank1]: The above exception was the direct cause of the following exception:

[rank1]: Traceback (most recent call last):
[rank1]:   File "/zt/code/my_dev/pippy_llama.py", line 59, in <module>
[rank1]:     output = schedule.step(args)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 615, in step
[rank1]:     self._step_microbatches(args_split, kwargs_split, targets_split, losses)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 706, in _step_microbatches
[rank1]:     output = self._stage.forward_one_chunk(i, arg_mbs[i], kwarg_mbs[i])  # type: ignore[index]
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/pipelining/stage.py", line 588, in forward_one_chunk
[rank1]:     raise RuntimeError(exc_msg) from e
[rank1]: RuntimeError: 
[rank1]:             [Stage 1] failed to run forward:
[rank1]:             args: ('Tensor(torch.Size([2, 4, 4096]), grad=False, dtype=torch.float32)', 'Tensor(torch.Size([1, 4]), grad=False, dtype=torch.int64)')
[rank1]:             kwargs: {}
[rank1]:             
[rank1]:[E1103 07:34:57.460801783 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 1] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank1]:[E1103 07:34:57.460843251 ProcessGroupNCCL.cpp:630] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E1103 07:34:57.460851231 ProcessGroupNCCL.cpp:636] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E1103 07:34:57.462234381 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=COALESCED, NumelIn=18446744073709551615, NumelOut=18446744073709551615, Timeout(ms)=1800000) ran for 1800014 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x79993284a446 in /root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7998e802a762 in /root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7998e8031ba3 in /root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7998e803360d in /root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7999329b15c0 in /root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x79993326eac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7999332ffbf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)

W1103 07:34:58.055000 21556 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 21559 closing signal SIGTERM
W1103 07:34:58.056000 21556 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 21561 closing signal SIGTERM
E1103 07:34:58.688000 21556 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 2 (pid: 21560) of binary: /root/miniconda3/envs/pippy/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/envs/pippy/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.5.0', 'console_scripts', 'torchrun')())
  File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main
    run(args)
  File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
======================================================
pippy_llama.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-11-03_07:34:58
  host      : 678c7278cb2d
  rank      : 2 (local_rank: 2)
  exitcode  : -6 (pid: 21560)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 21560
======================================================

I have no exeperenice for cuda programming and implemention of NCCL. It's hard for me to fix this bug. So anyone can answer my question? Thanks!!!!

The text was updated successfully, but these errors were encountered:

kiskra-nvidia · 2024-11-04T16:26:20Z

It looks like Pytorch times out for some reason. Rerunning the job with NCCL_DEBUG=INFO might provide more information.

Noblezhong · 2024-11-06T04:50:05Z

Thanks for response my issue! I try this option and the output in Linux shell is

678c7278cb2d:26023:26023 [0] NCCL INFO Bootstrap : Using eth0:172.17.2.7<0>
678c7278cb2d:26023:26023 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
678c7278cb2d:26023:26023 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
678c7278cb2d:26023:26023 [0] NCCL INFO NET/Plugin: Using internal network plugin.
678c7278cb2d:26023:26023 [0] NCCL INFO cudaDriverVersion 12040
NCCL version 2.21.5+cuda12.4
678c7278cb2d:26024:26024 [1] NCCL INFO cudaDriverVersion 12040
678c7278cb2d:26024:26024 [1] NCCL INFO Bootstrap : Using eth0:172.17.2.7<0>
678c7278cb2d:26024:26024 [1] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
678c7278cb2d:26024:26024 [1] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
678c7278cb2d:26024:26024 [1] NCCL INFO NET/Plugin: Using internal network plugin.
678c7278cb2d:26025:26025 [2] NCCL INFO cudaDriverVersion 12040
678c7278cb2d:26026:26026 [3] NCCL INFO cudaDriverVersion 12040
678c7278cb2d:26025:26025 [2] NCCL INFO Bootstrap : Using eth0:172.17.2.7<0>
678c7278cb2d:26026:26026 [3] NCCL INFO Bootstrap : Using eth0:172.17.2.7<0>
678c7278cb2d:26025:26025 [2] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
678c7278cb2d:26025:26025 [2] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
678c7278cb2d:26025:26025 [2] NCCL INFO NET/Plugin: Using internal network plugin.
678c7278cb2d:26026:26026 [3] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
678c7278cb2d:26026:26026 [3] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
678c7278cb2d:26026:26026 [3] NCCL INFO NET/Plugin: Using internal network plugin.
678c7278cb2d:26023:27091 [0] NCCL INFO Failed to open libibverbs.so[.1]
678c7278cb2d:26023:27091 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.2.7<0>
678c7278cb2d:26023:27091 [0] NCCL INFO Using non-device net plugin version 0
678c7278cb2d:26023:27091 [0] NCCL INFO Using network Socket
678c7278cb2d:26024:27092 [1] NCCL INFO Failed to open libibverbs.so[.1]
678c7278cb2d:26024:27092 [1] NCCL INFO NET/Socket : Using [0]eth0:172.17.2.7<0>
678c7278cb2d:26024:27092 [1] NCCL INFO Using non-device net plugin version 0
678c7278cb2d:26024:27092 [1] NCCL INFO Using network Socket
678c7278cb2d:26025:27093 [2] NCCL INFO Failed to open libibverbs.so[.1]
678c7278cb2d:26026:27094 [3] NCCL INFO Failed to open libibverbs.so[.1]
678c7278cb2d:26025:27093 [2] NCCL INFO NET/Socket : Using [0]eth0:172.17.2.7<0>
678c7278cb2d:26026:27094 [3] NCCL INFO NET/Socket : Using [0]eth0:172.17.2.7<0>
678c7278cb2d:26025:27093 [2] NCCL INFO Using non-device net plugin version 0
678c7278cb2d:26026:27094 [3] NCCL INFO Using non-device net plugin version 0
678c7278cb2d:26025:27093 [2] NCCL INFO Using network Socket
678c7278cb2d:26026:27094 [3] NCCL INFO Using network Socket
678c7278cb2d:26026:27094 [3] NCCL INFO ncclCommInitRank comm 0xc937cb0 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 61000 commId 0xd666edfc49b56902 - Init START
678c7278cb2d:26025:27093 [2] NCCL INFO ncclCommInitRank comm 0xc14e730 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 41000 commId 0xd666edfc49b56902 - Init START
678c7278cb2d:26023:27091 [0] NCCL INFO ncclCommInitRank comm 0x1aacdee0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0xd666edfc49b56902 - Init START
678c7278cb2d:26024:27092 [1] NCCL INFO ncclCommInitRank comm 0xad38910 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 25000 commId 0xd666edfc49b56902 - Init START
678c7278cb2d:26024:27092 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff,ffffffff,00000000,00000000,ffffffff,ffffffff
678c7278cb2d:26024:27092 [1] NCCL INFO NVLS multicast support is not available on dev 1
678c7278cb2d:26026:27094 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,ffffffff,00000000,00000000,ffffffff,ffffffff
678c7278cb2d:26026:27094 [3] NCCL INFO NVLS multicast support is not available on dev 3
678c7278cb2d:26025:27093 [2] NCCL INFO Setting affinity for GPU 2 to ffffffff,ffffffff,00000000,00000000,ffffffff,ffffffff
678c7278cb2d:26025:27093 [2] NCCL INFO NVLS multicast support is not available on dev 2
678c7278cb2d:26023:27091 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,ffffffff,00000000,00000000,ffffffff,ffffffff
678c7278cb2d:26023:27091 [0] NCCL INFO NVLS multicast support is not available on dev 0
678c7278cb2d:26023:27091 [0] NCCL INFO comm 0x1aacdee0 rank 0 nRanks 4 nNodes 1 localRanks 4 localRank 0 MNNVL 0
678c7278cb2d:26026:27094 [3] NCCL INFO comm 0xc937cb0 rank 3 nRanks 4 nNodes 1 localRanks 4 localRank 3 MNNVL 0
678c7278cb2d:26025:27093 [2] NCCL INFO comm 0xc14e730 rank 2 nRanks 4 nNodes 1 localRanks 4 localRank 2 MNNVL 0
678c7278cb2d:26023:27091 [0] NCCL INFO Channel 00/04 :    0   1   2   3
678c7278cb2d:26024:27092 [1] NCCL INFO comm 0xad38910 rank 1 nRanks 4 nNodes 1 localRanks 4 localRank 1 MNNVL 0
678c7278cb2d:26023:27091 [0] NCCL INFO Channel 01/04 :    0   1   2   3
678c7278cb2d:26023:27091 [0] NCCL INFO Channel 02/04 :    0   1   2   3
678c7278cb2d:26023:27091 [0] NCCL INFO Channel 03/04 :    0   1   2   3
678c7278cb2d:26026:27094 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2 [2] -1/-1/-1->3->2 [3] -1/-1/-1->3->2
678c7278cb2d:26023:27091 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1
678c7278cb2d:26025:27093 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1
678c7278cb2d:26026:27094 [3] NCCL INFO P2P Chunksize set to 131072
678c7278cb2d:26023:27091 [0] NCCL INFO P2P Chunksize set to 131072
678c7278cb2d:26025:27093 [2] NCCL INFO P2P Chunksize set to 131072
678c7278cb2d:26024:27092 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0
678c7278cb2d:26024:27092 [1] NCCL INFO P2P Chunksize set to 131072
678c7278cb2d:26023:27091 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM
678c7278cb2d:26026:27094 [3] NCCL INFO Channel 00/0 : 3[3] -> 0[0] via P2P/CUMEM
678c7278cb2d:26024:27092 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/CUMEM
678c7278cb2d:26025:27093 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/CUMEM
678c7278cb2d:26023:27091 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM
678c7278cb2d:26026:27094 [3] NCCL INFO Channel 01/0 : 3[3] -> 0[0] via P2P/CUMEM
678c7278cb2d:26024:27092 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/CUMEM
678c7278cb2d:26025:27093 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/CUMEM
678c7278cb2d:26023:27091 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM
678c7278cb2d:26026:27094 [3] NCCL INFO Channel 02/0 : 3[3] -> 0[0] via P2P/CUMEM
678c7278cb2d:26024:27092 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/CUMEM
678c7278cb2d:26025:27093 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/CUMEM
678c7278cb2d:26023:27091 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM
678c7278cb2d:26026:27094 [3] NCCL INFO Channel 03/0 : 3[3] -> 0[0] via P2P/CUMEM
678c7278cb2d:26024:27092 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/CUMEM
678c7278cb2d:26025:27093 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/CUMEM
678c7278cb2d:26024:27092 [1] NCCL INFO Connected all rings
678c7278cb2d:26025:27093 [2] NCCL INFO Connected all rings
678c7278cb2d:26026:27094 [3] NCCL INFO Connected all rings
678c7278cb2d:26026:27094 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/CUMEM
678c7278cb2d:26023:27091 [0] NCCL INFO Connected all rings
678c7278cb2d:26026:27094 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/CUMEM
678c7278cb2d:26026:27094 [3] NCCL INFO Channel 02/0 : 3[3] -> 2[2] via P2P/CUMEM
678c7278cb2d:26026:27094 [3] NCCL INFO Channel 03/0 : 3[3] -> 2[2] via P2P/CUMEM
678c7278cb2d:26024:27092 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/CUMEM
678c7278cb2d:26025:27093 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/CUMEM
678c7278cb2d:26024:27092 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/CUMEM
678c7278cb2d:26025:27093 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/CUMEM
678c7278cb2d:26024:27092 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/CUMEM
678c7278cb2d:26025:27093 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/CUMEM
678c7278cb2d:26024:27092 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/CUMEM
678c7278cb2d:26025:27093 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/CUMEM
678c7278cb2d:26023:27091 [0] NCCL INFO Connected all trees
678c7278cb2d:26023:27091 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
678c7278cb2d:26023:27091 [0] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
678c7278cb2d:26024:27092 [1] NCCL INFO Connected all trees
678c7278cb2d:26024:27092 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
678c7278cb2d:26024:27092 [1] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
678c7278cb2d:26026:27094 [3] NCCL INFO Connected all trees
678c7278cb2d:26026:27094 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
678c7278cb2d:26026:27094 [3] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
678c7278cb2d:26025:27093 [2] NCCL INFO Connected all trees
678c7278cb2d:26025:27093 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
678c7278cb2d:26025:27093 [2] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
678c7278cb2d:26023:27091 [0] NCCL INFO TUNER/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so
678c7278cb2d:26026:27094 [3] NCCL INFO TUNER/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so
678c7278cb2d:26024:27092 [1] NCCL INFO TUNER/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so
678c7278cb2d:26025:27093 [2] NCCL INFO TUNER/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so
678c7278cb2d:26023:27091 [0] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
678c7278cb2d:26026:27094 [3] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
678c7278cb2d:26024:27092 [1] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
678c7278cb2d:26025:27093 [2] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
678c7278cb2d:26023:27091 [0] NCCL INFO ncclCommInitRank comm 0x1aacdee0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0xd666edfc49b56902 - Init COMPLETE
678c7278cb2d:26026:27094 [3] NCCL INFO ncclCommInitRank comm 0xc937cb0 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 61000 commId 0xd666edfc49b56902 - Init COMPLETE
678c7278cb2d:26024:27092 [1] NCCL INFO ncclCommInitRank comm 0xad38910 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 25000 commId 0xd666edfc49b56902 - Init COMPLETE
678c7278cb2d:26025:27093 [2] NCCL INFO ncclCommInitRank comm 0xc14e730 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 41000 commId 0xd666edfc49b56902 - Init COMPLETE
678c7278cb2d:26023:27107 [0] NCCL INFO Channel 02/1 : 0[0] -> 1[1] via P2P/CUMEM
678c7278cb2d:26023:27107 [0] NCCL INFO Channel 03/1 : 0[0] -> 1[1] via P2P/CUMEM
678c7278cb2d:26023:27111 [0] NCCL INFO Channel 01/1 : 0[0] -> 2[2] via P2P/CUMEM
678c7278cb2d:26023:27111 [0] NCCL INFO Channel 02/1 : 0[0] -> 2[2] via P2P/CUMEM
678c7278cb2d:26023:27112 [0] NCCL INFO Channel 00/1 : 0[0] -> 3[3] via P2P/CUMEM
678c7278cb2d:26023:27112 [0] NCCL INFO Channel 03/1 : 0[0] -> 3[3] via P2P/CUMEM
[rank0]:[W1106 04:13:02.107228334 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
678c7278cb2d:26023:27095 [0] NCCL INFO [Service thread] Connection closed by localRank 0
678c7278cb2d:26023:27115 [0] NCCL INFO comm 0x1aacdee0 rank 0 nranks 4 cudaDev 0 busId 1000 - Abort COMPLETE
[rank1]:[E1106 04:43:01.018941581 ProcessGroupNCCL.cpp:616] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=COALESCED, NumelIn=18446744073709551615, NumelOut=18446744073709551615, Timeout(ms)=1800000) ran for 1800039 milliseconds before timing out.
[rank1]:[E1106 04:43:01.020724568 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank2]:[E1106 04:43:01.036680786 ProcessGroupNCCL.cpp:616] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=COALESCED, NumelIn=18446744073709551615, NumelOut=18446744073709551615, Timeout(ms)=1800000) ran for 1800056 milliseconds before timing out.
[rank3]:[E1106 04:43:01.036737646 ProcessGroupNCCL.cpp:616] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=COALESCED, NumelIn=18446744073709551615, NumelOut=18446744073709551615, Timeout(ms)=1800000) ran for 1800057 milliseconds before timing out.
[rank2]:[E1106 04:43:01.037171455 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 2] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
678c7278cb2d:26025:27113 [2] NCCL INFO bootstrap.cc:506 -> 3
678c7278cb2d:26025:27113 [2] NCCL INFO bootstrap.cc:522 -> 3
678c7278cb2d:26025:27113 [2] NCCL INFO transport.cc:140 -> 3
678c7278cb2d:26025:27113 [2] NCCL INFO group.cc:110 -> 3
678c7278cb2d:26025:27113 [2] NCCL INFO group.cc:64 -> 3 [Async thread]
678c7278cb2d:26025:26025 [0] NCCL INFO group.cc:418 -> 3
678c7278cb2d:26025:26025 [0] NCCL INFO group.cc:95 -> 3
[rank3]:[E1106 04:43:01.038072093 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 3] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
678c7278cb2d:26026:27114 [3] NCCL INFO bootstrap.cc:506 -> 3
678c7278cb2d:26026:27114 [3] NCCL INFO bootstrap.cc:522 -> 3
678c7278cb2d:26026:27114 [3] NCCL INFO transport.cc:140 -> 3
678c7278cb2d:26026:27114 [3] NCCL INFO group.cc:110 -> 3
678c7278cb2d:26026:27114 [3] NCCL INFO group.cc:64 -> 3 [Async thread]
678c7278cb2d:26026:26026 [0] NCCL INFO group.cc:418 -> 3
678c7278cb2d:26026:26026 [0] NCCL INFO group.cc:95 -> 3
[rank2]: Traceback (most recent call last):
[rank2]:   File "/zt/code/my_dev/pippy_llama.py", line 59, in <module>
[rank2]:     output = schedule.step(args)
[rank2]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 615, in step
[rank2]:     self._step_microbatches(args_split, kwargs_split, targets_split, losses)
[rank2]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 702, in _step_microbatches
[rank2]:     works = _sorted_batch_p2p(ops, desc="fwd_recv")
[rank2]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 549, in _sorted_batch_p2p
[rank2]:     work_by_peer[peer] = _batch_p2p(ops, desc=desc)
[rank2]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 524, in _batch_p2p
[rank2]:     return dist.batch_isend_irecv(p2p_ops).pop()
[rank2]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2370, in batch_isend_irecv
[rank2]:     with _coalescing_manager(group, device, async_ops=True) as cm:
[rank2]:   File "/root/miniconda3/envs/pippy/lib/python3.10/contextlib.py", line 142, in __exit__
[rank2]:     next(self.gen)
[rank2]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2317, in _coalescing_manager
[rank2]:     work = group._end_coalescing(device)
[rank2]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:4409, internal error - please report this issue to the NCCL developers, NCCL version 2.21.5
[rank2]: ncclInternalError: Internal check failed.
[rank2]: Last error:

[rank3]: Traceback (most recent call last):
[rank3]:   File "/zt/code/my_dev/pippy_llama.py", line 59, in <module>
[rank3]:     output = schedule.step(args)
[rank3]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 615, in step
[rank3]:     self._step_microbatches(args_split, kwargs_split, targets_split, losses)
[rank3]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 702, in _step_microbatches
[rank3]:     works = _sorted_batch_p2p(ops, desc="fwd_recv")
[rank3]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 549, in _sorted_batch_p2p
[rank3]:     work_by_peer[peer] = _batch_p2p(ops, desc=desc)
[rank3]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 524, in _batch_p2p
[rank3]:     return dist.batch_isend_irecv(p2p_ops).pop()
[rank3]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2370, in batch_isend_irecv
[rank3]:     with _coalescing_manager(group, device, async_ops=True) as cm:
[rank3]:   File "/root/miniconda3/envs/pippy/lib/python3.10/contextlib.py", line 142, in __exit__
[rank3]:     next(self.gen)
[rank3]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2317, in _coalescing_manager
[rank3]:     work = group._end_coalescing(device)
[rank3]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:4409, internal error - please report this issue to the NCCL developers, NCCL version 2.21.5
[rank3]: ncclInternalError: Internal check failed.
[rank3]: Last error:

678c7278cb2d:26024:27098 [1] NCCL INFO [Service thread] Connection closed by localRank 1
678c7278cb2d:26025:27097 [2] NCCL INFO [Service thread] Connection closed by localRank 2
678c7278cb2d:26026:27096 [3] NCCL INFO [Service thread] Connection closed by localRank 3
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [32,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [33,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [34,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [35,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [36,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [37,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [38,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [39,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [40,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [41,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [42,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [43,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [44,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [45,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [46,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [47,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [48,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [49,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [50,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [51,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [52,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [53,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [54,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [55,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [56,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [57,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [58,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [59,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [60,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [61,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [62,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [63,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [64,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [65,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [66,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [67,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [68,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [69,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [70,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [71,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [72,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [73,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [74,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [75,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [76,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [77,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [78,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [79,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [80,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [81,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [82,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [83,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [84,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [85,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [86,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [87,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [88,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [89,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [90,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [91,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [92,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [93,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [94,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [95,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [96,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [97,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [98,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [99,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [100,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [101,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [102,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [103,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [104,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [105,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [106,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [107,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [108,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [109,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [110,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [111,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [112,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [113,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [114,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [115,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [116,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [117,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [118,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [119,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [120,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [121,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [122,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [123,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [124,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [125,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [126,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [127,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [0,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [1,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [2,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [3,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [4,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [5,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [6,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [7,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [8,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [9,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [10,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [11,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [12,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [13,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [14,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [15,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [16,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [17,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [18,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [19,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [20,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [21,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [22,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [23,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [24,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [25,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [26,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [27,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [28,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [29,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [30,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [31,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
[rank1]: Traceback (most recent call last):
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/pipelining/stage.py", line 580, in forward_one_chunk
[rank1]:     output = self.forward_maybe_with_nosync(*composite_args, **composite_kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/pipelining/stage.py", line 468, in forward_maybe_with_nosync
[rank1]:     out_val = self.submod(*args, **kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/fx/graph_module.py", line 784, in call_wrapped
[rank1]:     return self._wrapped_call(self, *args, **kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/fx/graph_module.py", line 361, in __call__
[rank1]:     raise e
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/fx/graph_module.py", line 348, in __call__
[rank1]:     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "<eval_with_key>.263", line 5, in forward
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/export/unflatten.py", line 136, in forward
[rank1]:     return torch.fx.Interpreter(self, graph=self.graph).run(
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/fx/interpreter.py", line 146, in run
[rank1]:     self.env[node] = self.run_node(node)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/fx/interpreter.py", line 203, in run_node
[rank1]:     return getattr(self, n.op)(n.target, args, kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/fx/interpreter.py", line 320, in call_module
[rank1]:     return submod(*args, **kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/export/unflatten.py", line 136, in forward
[rank1]:     return torch.fx.Interpreter(self, graph=self.graph).run(
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/fx/interpreter.py", line 146, in run
[rank1]:     self.env[node] = self.run_node(node)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/fx/interpreter.py", line 203, in run_node
[rank1]:     return getattr(self, n.op)(n.target, args, kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/fx/interpreter.py", line 320, in call_module
[rank1]:     return submod(*args, **kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/export/unflatten.py", line 136, in forward
[rank1]:     return torch.fx.Interpreter(self, graph=self.graph).run(
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/fx/interpreter.py", line 146, in run
[rank1]:     self.env[node] = self.run_node(node)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/fx/interpreter.py", line 203, in run_node
[rank1]:     return getattr(self, n.op)(n.target, args, kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/fx/interpreter.py", line 275, in call_function
[rank1]:     return target(*args, **kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/_ops.py", line 716, in __call__
[rank1]:     return self._op(*args, **kwargs)
[rank1]: RuntimeError: CUDA error: device-side assert triggered
[rank1]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank1]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
[rank1]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


[rank1]: While executing %neg_default : [num_users=1] = call_function[target=torch.ops.aten.neg.default](args = (%slice_tensor_3,), kwargs = {})
[rank1]: Original traceback:
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1181, in forward
[rank1]:     outputs = self.model(
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1068, in forward
[rank1]:     layer_outputs = decoder_layer(
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/pipelining/_IR.py", line 1160, in _split_before_forward
[rank1]:     return self._orig_forward(*args, **kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 796, in forward
[rank1]:     hidden_states, self_attn_weights, present_key_value = self.self_attn(
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 704, in forward
[rank1]:     query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 234, in apply_rotary_pos_emb
[rank1]:     q_embed = (q * cos) + (rotate_half(q) * sin)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 208, in rotate_half
[rank1]:     return torch.cat((-x2, x1), dim=-1)


[rank1]: While executing %self_attn : [num_users=3] = call_module[target=self_attn](args = (%unsqueeze, %input_layernorm), kwargs = {})
[rank1]: Original traceback:
[rank1]: None

[rank1]: While executing %layers_8 : [num_users=3] = call_module[target=layers.8](args = (%unsqueeze, %add_47), kwargs = {})
[rank1]: Original traceback:
[rank1]: None

[rank1]: The above exception was the direct cause of the following exception:

[rank1]: Traceback (most recent call last):
[rank1]:   File "/zt/code/my_dev/pippy_llama.py", line 59, in <module>
[rank1]:     output = schedule.step(args)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 615, in step
[rank1]:     self._step_microbatches(args_split, kwargs_split, targets_split, losses)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 706, in _step_microbatches
[rank1]:     output = self._stage.forward_one_chunk(i, arg_mbs[i], kwarg_mbs[i])  # type: ignore[index]
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/pipelining/stage.py", line 588, in forward_one_chunk
[rank1]:     raise RuntimeError(exc_msg) from e
[rank1]: RuntimeError: 
[rank1]:             [Stage 1] failed to run forward:
[rank1]:             args: ('Tensor(torch.Size([2, 4, 4096]), grad=False, dtype=torch.float32)', 'Tensor(torch.Size([1, 4]), grad=False, dtype=torch.int64)')
[rank1]:             kwargs: {}
[rank1]:             
678c7278cb2d:26026:26045 [0] NCCL INFO comm 0xc937cb0 rank 3 nranks 4 cudaDev 3 busId 61000 - Abort COMPLETE
[rank3]:[E1106 04:43:01.274671839 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 3] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank3]:[E1106 04:43:01.274684379 ProcessGroupNCCL.cpp:630] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E1106 04:43:01.274695519 ProcessGroupNCCL.cpp:636] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E1106 04:43:01.275795006 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=COALESCED, NumelIn=18446744073709551615, NumelOut=18446744073709551615, Timeout(ms)=1800000) ran for 1800057 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x771134f6c446 in /root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7710ea82a762 in /root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7710ea831ba3 in /root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7710ea83360d in /root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7711350d35c0 in /root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x771135990ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x771135a21bf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)


678c7278cb2d:26024:27098 [1] include/alloc.h:125 NCCL WARN Cuda failure 710 'device-side assert triggered'
678c7278cb2d:26024:27098 [1] NCCL INFO include/alloc.h:246 -> 1

678c7278cb2d:26024:27098 [1] include/alloc.h:125 NCCL WARN Cuda failure 710 'device-side assert triggered'
678c7278cb2d:26024:27098 [1] NCCL INFO include/alloc.h:246 -> 1

678c7278cb2d:26024:27098 [1] include/alloc.h:125 NCCL WARN Cuda failure 710 'device-side assert triggered'
678c7278cb2d:26024:27098 [1] NCCL INFO include/alloc.h:246 -> 1

678c7278cb2d:26024:27098 [1] include/alloc.h:125 NCCL WARN Cuda failure 710 'device-side assert triggered'
678c7278cb2d:26024:27098 [1] NCCL INFO include/alloc.h:246 -> 1

678c7278cb2d:26024:27098 [1] include/alloc.h:125 NCCL WARN Cuda failure 710 'device-side assert triggered'
678c7278cb2d:26024:27098 [1] NCCL INFO include/alloc.h:246 -> 1

678c7278cb2d:26024:27098 [1] include/alloc.h:125 NCCL WARN Cuda failure 710 'device-side assert triggered'
678c7278cb2d:26024:27098 [1] NCCL INFO include/alloc.h:246 -> 1

678c7278cb2d:26024:27098 [1] include/alloc.h:125 NCCL WARN Cuda failure 710 'device-side assert triggered'
678c7278cb2d:26024:27098 [1] NCCL INFO include/alloc.h:246 -> 1

678c7278cb2d:26024:27098 [1] include/alloc.h:125 NCCL WARN Cuda failure 710 'device-side assert triggered'
678c7278cb2d:26024:27098 [1] NCCL INFO include/alloc.h:246 -> 1

678c7278cb2d:26024:27098 [1] include/alloc.h:125 NCCL WARN Cuda failure 710 'device-side assert triggered'
678c7278cb2d:26024:27098 [1] NCCL INFO include/alloc.h:246 -> 1

678c7278cb2d:26024:27098 [1] include/alloc.h:125 NCCL WARN Cuda failure 710 'device-side assert triggered'
678c7278cb2d:26024:27098 [1] NCCL INFO include/alloc.h:246 -> 1

678c7278cb2d:26024:27098 [1] include/alloc.h:125 NCCL WARN Cuda failure 710 'device-side assert triggered'
678c7278cb2d:26024:27098 [1] NCCL INFO include/alloc.h:246 -> 1

678c7278cb2d:26024:27098 [1] include/alloc.h:125 NCCL WARN Cuda failure 710 'device-side assert triggered'
678c7278cb2d:26024:27098 [1] NCCL INFO include/alloc.h:246 -> 1

678c7278cb2d:26024:27098 [1] include/alloc.h:125 NCCL WARN Cuda failure 710 'device-side assert triggered'
678c7278cb2d:26024:27098 [1] NCCL INFO include/alloc.h:246 -> 1

678c7278cb2d:26024:27098 [1] include/alloc.h:125 NCCL WARN Cuda failure 710 'device-side assert triggered'
678c7278cb2d:26024:27098 [1] NCCL INFO include/alloc.h:246 -> 1

678c7278cb2d:26024:27098 [1] include/alloc.h:125 NCCL WARN Cuda failure 710 'device-side assert triggered'
678c7278cb2d:26024:27098 [1] NCCL INFO include/alloc.h:246 -> 1

678c7278cb2d:26024:27098 [1] include/alloc.h:125 NCCL WARN Cuda failure 710 'device-side assert triggered'
678c7278cb2d:26024:27098 [1] NCCL INFO include/alloc.h:246 -> 1

678c7278cb2d:26024:27098 [1] include/alloc.h:125 NCCL WARN Cuda failure 710 'device-side assert triggered'
678c7278cb2d:26024:27098 [1] NCCL INFO include/alloc.h:246 -> 1
678c7278cb2d:26024:27098 [1] NCCL INFO transport/net.cc:541 -> 1
678c7278cb2d:26024:27098 [1] NCCL INFO transport/net.cc:944 -> 1
678c7278cb2d:26024:27098 [1] NCCL INFO proxy.cc:984 -> 1
678c7278cb2d:26024:27098 [1] NCCL INFO proxy.cc:1000 -> 1

678c7278cb2d:26024:26048 [1] include/alloc.h:125 NCCL WARN Cuda failure 710 'device-side assert triggered'
678c7278cb2d:26024:26048 [1] NCCL INFO include/alloc.h:246 -> 1
678c7278cb2d:26024:26048 [1] NCCL INFO transport/p2p.cc:541 -> 1
678c7278cb2d:26024:26048 [1] NCCL INFO channel.cc:158 -> 1
678c7278cb2d:26024:26048 [1] NCCL INFO init.cc:210 -> 1
678c7278cb2d:26024:26048 [1] NCCL INFO init.cc:1986 -> 1

678c7278cb2d:26024:26048 [1] init.cc:2118 NCCL WARN commReclaim: cleanup comm 0xad38910 rank 1 failed in destroy/abort, error 1
678c7278cb2d:26024:26048 [1] NCCL INFO comm 0xad38910 rank 1 nranks 4 cudaDev 1 busId 25000 - Abort COMPLETE
[rank1]:[E1106 04:43:01.594405207 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 1] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank1]:[E1106 04:43:01.594419138 ProcessGroupNCCL.cpp:630] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E1106 04:43:01.594426118 ProcessGroupNCCL.cpp:636] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E1106 04:43:01.595582125 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=COALESCED, NumelIn=18446744073709551615, NumelOut=18446744073709551615, Timeout(ms)=1800000) ran for 1800039 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7e87f228b446 in /root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7e87a7a2a762 in /root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7e87a7a31ba3 in /root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7e87a7a3360d in /root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7e87f23f25c0 in /root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x7e87f2cafac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7e87f2d40bf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)

W1106 04:43:02.065000 26021 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 26024 closing signal SIGTERM
W1106 04:43:02.068000 26021 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 26025 closing signal SIGTERM
E1106 04:43:02.651000 26021 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 3 (pid: 26026) of binary: /root/miniconda3/envs/pippy/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/envs/pippy/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.5.0', 'console_scripts', 'torchrun')())
  File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main
    run(args)
  File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
======================================================
pippy_llama.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-11-06_04:43:02
  host      : 678c7278cb2d
  rank      : 3 (local_rank: 3)
  exitcode  : -6 (pid: 26026)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 26026
======================================================

It seems that I should Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. I think may be that's because I run this code in docker container, leading lack of dependency of NCCL.

kiskra-nvidia · 2024-11-06T18:59:54Z

It's a little hard to figure out from that log if there's a NCCL issue here or not. Is it the case that this application never actually fully initializes and it simply times out 1800s?

Would you happen to know when the following two lines are printed:

678c7278cb2d:26023:27095 [0] NCCL INFO [Service thread] Connection closed by localRank 0
678c7278cb2d:26023:27115 [0] NCCL INFO comm 0x1aacdee0 rank 0 nranks 4 cudaDev 0 busId 1000 - Abort COMPLETE

Specifically, I'm wondering if they are printed at startup or around the time when the 1800s timeout expires?

Does NCCL currently run on this system at all? Have you tried compiling and running our nccl-tests?

Noblezhong · 2024-11-07T02:13:31Z

The lines is printed around the time when 1800s timeout expires
I run the nccl-tests code in my develop environment, it's also can not successfully run this test, I think maybe the nccl conponents in my env is broken or just haven't be installed. So I think reinstall it can fix this bug

kiskra-nvidia · 2024-11-07T03:51:07Z

You may want to also check https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encounter NCCL error when runing Pytorch example code #1504

Encounter NCCL error when runing Pytorch example code #1504

Noblezhong commented Nov 3, 2024

kiskra-nvidia commented Nov 4, 2024

Noblezhong commented Nov 6, 2024

kiskra-nvidia commented Nov 6, 2024

Noblezhong commented Nov 7, 2024

kiskra-nvidia commented Nov 7, 2024

Encounter NCCL error when runing Pytorch example code #1504

Encounter NCCL error when runing Pytorch example code #1504

Comments

Noblezhong commented Nov 3, 2024

kiskra-nvidia commented Nov 4, 2024

Noblezhong commented Nov 6, 2024

kiskra-nvidia commented Nov 6, 2024

Noblezhong commented Nov 7, 2024

kiskra-nvidia commented Nov 7, 2024