ncclInternalError: Internal check failed #1499

whiteyn · 2024-10-30T02:51:31Z

🐛 Describe the bug

I met an error when I use torchrun for 4 GPUs training and 'nccl' backend (It runs perfect when I use 'gloo'). The environment is python3.9+pytorch2.3.0+cuda12.1.We tried to use uftrace to capture the DLRM code of 4 GPUs launched by torchrun, the command is as follows:

torchrun --nproc_per_node=4 ./multi-uftrace.py
The multi-uftrace.py file content is as follows:

import subprocess
try:  
    result = subprocess.run([
    '/mnt/yuanningbai/local/uftrace/bin/uftrace','-e','record',
    '/mnt/yuanningbai/dlrm/dlrm_s_pytorch.py', '--mini-batch-size=4','--test-mini-batch-size=16384','--test-num-workers=0',
    '--num-batches=1','--data-generation=random','--arch-mlp-bot=512-512-64','--arch-sparse-feature-size=64','--arch-embedding-size=1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000','--num-indices-per-lookup=100',
    '--arch-interaction-op=dot','--print-freq=1','--print-time','--use-gpu','--inference-only','--dist-backend=nccl'], 
    check=True,capture_output=True, text=True)#  
except subprocess.CalledProcessError as e:  
    print("error code :", e.returncode)  
    print("error info :", e.output)

The error message is as follows:

W1029 16:52:13.175227 140626680026112 torch/distributed/run.py:757]
W1029 16:52:13.175227 140626680026112 torch/distributed/run.py:757] *****************************************
W1029 16:52:13.175227 140626680026112 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W1029 16:52:13.175227 140626680026112 torch/distributed/run.py:757] *****************************************
error code ： 1
error info ： world size: 4, current rank: 1, local rank: 1
error code ： 1
error info ： world size: 4, current rank: 3, local rank: 3
error code ： 1
error info ： Running on 4 ranks using nccl backend
fail to enable all_to_all_single primitive: NCCL error in: /mnt/yuanningbai/pytorch-2.3.0/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, internal error - please report this issue to the NCCL developers, NCCL version 2.20.5
ncclInternalError: Internal check failed.
Last error:
Error : ring 8 does not loop back to start (1 != 0)
world size: 4, current rank: 0, local rank: 0
Using 1 GPU(s)...
-*-*-*-*-*-*nn.EmbeddingBag-*-*-*-*-*-*
-*-*-*-*-*-*nn.EmbeddingBag-*-*-*-*-*-*
error code ： 1
error info ： world size: 4, current rank: 2, local rank: 2

In order to capture the underlying functions of pytorch, we compile pytorch into the pg version. The above error will occur under 4 GPUs, but not under 2 GPUs. At the same time, we try to compile it into the develop version and it will run correctly. So I would like to ask if there is any solution to prevent such errors under the 4 GPUs of the pg version?

Versions

GPU： 4 x A100 80G GPU
Driver Version ：530.30.02
CUDA Version ： 12.1
OS version ：Ubuntu 22.04
python :3.9
pytorch :v2.3.0
nccl: v2.20.5

The text was updated successfully, but these errors were encountered:

AddyLaddy · 2024-10-30T03:34:11Z

We'd need to see the output of export NCCL_DEBUG=INFO to be able to analyze that failure.

sjeaugey · 2024-10-30T04:48:05Z

The WARN is likely Error : ring 8 does not loop back to start (1 != 0).

This happens when each NCCL rank within a node does not see the same intra-node topology, or if different ranks are run with different parameters. Seeing a different node topology can happen inside VMs sometimes.

Having the log with NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=GRAPH,ENV,INIT would indeed help a lot.

whiteyn · 2024-10-30T06:25:25Z

@AddyLaddy @sjeaugey Thank you very much for your answers.
The results are as follows with NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=GRAPH,ENV,INIT


W1030 14:04:53.263317 139798532572160 torch/distributed/run.py:757]
W1030 14:04:53.263317 139798532572160 torch/distributed/run.py:757] *****************************************
W1030 14:04:53.263317 139798532572160 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W1030 14:04:53.263317 139798532572160 torch/distributed/run.py:757] *****************************************
error code : 1
error info : testpc115159:2736633:2736633 [1] NCCL INFO cudaDriverVersion 12040
testpc115159:2736633:2736633 [1] NCCL INFO Bootstrap : Using eno8303:109.105.115.159<0>
testpc115159:2736633:2736633 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
testpc115159:2736633:2737365 [1] NCCL INFO NET/IB : No device found.
testpc115159:2736633:2737365 [1] NCCL INFO NET/Socket : Using [0]eno8303:109.105.115.159<0> [1]vethf4fabc7:fe80::54a7:2aff:fe92:27be%vethf4fabc7<0>
testpc115159:2736633:2737365 [1] NCCL INFO Using non-device net plugin version 0
testpc115159:2736633:2737365 [1] NCCL INFO Using network Socket
testpc115159:2736633:2737365 [1] NCCL INFO comm 0x556acd8b8800 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 41000 commId 0xd57beaec7aafe641 - Init START
testpc115159:2736633:2737365 [1] NCCL INFO === System : maxBw 80.0 totalBw 240.0 ===
testpc115159:2736633:2737365 [1] NCCL INFO CPU/1 (1/2/-1)
testpc115159:2736633:2737365 [1] NCCL INFO + PCI[5000.0] - NIC/0
testpc115159:2736633:2737365 [1] NCCL INFO + PCI[24.0] - GPU/1000 (0)
testpc115159:2736633:2737365 [1] NCCL INFO               + NVL[80.0] - GPU/41000
testpc115159:2736633:2737365 [1] NCCL INFO               + NVL[80.0] - GPU/C1000
testpc115159:2736633:2737365 [1] NCCL INFO               + NVL[80.0] - GPU/81000
testpc115159:2736633:2737365 [1] NCCL INFO + SYS[16.0] - CPU/0
testpc115159:2736633:2737365 [1] NCCL INFO + SYS[16.0] - CPU/3
testpc115159:2736633:2737365 [1] NCCL INFO + SYS[16.0] - CPU/2
testpc115159:2736633:2737365 [1] NCCL INFO CPU/0 (1/2/-1)
testpc115159:2736633:2737365 [1] NCCL INFO + PCI[24.0] - GPU/41000 (1)
testpc115159:2736633:2737365 [1] NCCL INFO               + NVL[80.0] - GPU/C1000
testpc115159:2736633:2737365 [1] NCCL INFO               + NVL[80.0] - GPU/1000
testpc115159:2736633:2737365 [1] NCCL INFO               + NVL[80.0] - GPU/81000
testpc115159:2736633:2737365 [1] NCCL INFO + SYS[16.0] - CPU/1
testpc115159:2736633:2737365 [1] NCCL INFO + SYS[16.0] - CPU/3
testpc115159:2736633:2737365 [1] NCCL INFO + SYS[16.0] - CPU/2
testpc115159:2736633:2737365 [1] NCCL INFO CPU/3 (1/2/-1)
testpc115159:2736633:2737365 [1] NCCL INFO + PCI[24.0] - GPU/81000 (2)
testpc115159:2736633:2737365 [1] NCCL INFO               + NVL[80.0] - GPU/41000
testpc115159:2736633:2737365 [1] NCCL INFO               + NVL[80.0] - GPU/C1000
testpc115159:2736633:2737365 [1] NCCL INFO               + NVL[80.0] - GPU/1000
testpc115159:2736633:2737365 [1] NCCL INFO + SYS[16.0] - CPU/1
testpc115159:2736633:2737365 [1] NCCL INFO + SYS[16.0] - CPU/0
testpc115159:2736633:2737365 [1] NCCL INFO + SYS[16.0] - CPU/2
testpc115159:2736633:2737365 [1] NCCL INFO CPU/2 (1/2/-1)
testpc115159:2736633:2737365 [1] NCCL INFO + PCI[24.0] - GPU/C1000 (3)
testpc115159:2736633:2737365 [1] NCCL INFO               + NVL[80.0] - GPU/41000
testpc115159:2736633:2737365 [1] NCCL INFO               + NVL[80.0] - GPU/1000
testpc115159:2736633:2737365 [1] NCCL INFO               + NVL[80.0] - GPU/81000
testpc115159:2736633:2737365 [1] NCCL INFO + SYS[16.0] - CPU/1
testpc115159:2736633:2737365 [1] NCCL INFO + SYS[16.0] - CPU/0
testpc115159:2736633:2737365 [1] NCCL INFO + SYS[16.0] - CPU/3
testpc115159:2736633:2737365 [1] NCCL INFO + PCI[0.4] - NIC/E1000
testpc115159:2736633:2737365 [1] NCCL INFO ==========================================
testpc115159:2736633:2737365 [1] NCCL INFO GPU/1000 :GPU/1000 (0/5000.000000/LOC) GPU/41000 (1/80.000000/NVL) GPU/81000 (1/80.000000/NVL) GPU/C1000 (1/80.000000/NVL) CPU/1 (1/24.000000/PHB) CPU/0 (2/24.000000/PHB) CPU/3 (2/24.000000/PHB) CPU/2 (2/24.000000/PHB)
testpc115159:2736633:2737365 [1] NCCL INFO GPU/41000 :GPU/1000 (1/80.000000/NVL) GPU/41000 (0/5000.000000/LOC) GPU/81000 (1/80.000000/NVL) GPU/C1000 (1/80.000000/NVL) CPU/1 (2/24.000000/PHB) CPU/0 (1/24.000000/PHB) CPU/3 (2/24.000000/PHB) CPU/2 (2/24.000000/PHB)
testpc115159:2736633:2737365 [1] NCCL INFO GPU/81000 :GPU/1000 (1/80.000000/NVL) GPU/41000 (1/80.000000/NVL) GPU/81000 (0/5000.000000/LOC) GPU/C1000 (1/80.000000/NVL) CPU/1 (2/24.000000/PHB) CPU/0 (2/24.000000/PHB) CPU/3 (1/24.000000/PHB) CPU/2 (2/24.000000/PHB)
testpc115159:2736633:2737365 [1] NCCL INFO GPU/C1000 :GPU/1000 (1/80.000000/NVL) GPU/41000 (1/80.000000/NVL) GPU/81000 (1/80.000000/NVL) GPU/C1000 (0/5000.000000/LOC) CPU/1 (2/24.000000/PHB) CPU/0 (2/24.000000/PHB) CPU/3 (2/24.000000/PHB) CPU/2 (1/24.000000/PHB)
testpc115159:2736633:2737365 [1] NCCL INFO Setting affinity for GPU 1 to ffffff,00000000,00000000,00ffffff
testpc115159:2736633:2737365 [1] NCCL INFO NVLS multicast support is not available on dev 1
testpc115159:2736633:2737365 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 14, bw 20.000000/20.000000, type NVL/PIX, sameChannels 0
testpc115159:2736633:2737365 [1] NCCL INFO  0 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736633:2737365 [1] NCCL INFO  1 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736633:2737365 [1] NCCL INFO  2 : GPU/0 GPU/3 GPU/1 GPU/2
testpc115159:2736633:2737365 [1] NCCL INFO  3 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736633:2737365 [1] NCCL INFO  4 : GPU/0 GPU/3 GPU/2 GPU/1
testpc115159:2736633:2737365 [1] NCCL INFO  5 : GPU/0 GPU/2 GPU/1 GPU/3
testpc115159:2736633:2737365 [1] NCCL INFO  6 : GPU/0 GPU/2 GPU/3 GPU/1
testpc115159:2736633:2737365 [1] NCCL INFO  7 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736633:2737365 [1] NCCL INFO  8 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736633:2737365 [1] NCCL INFO  9 : GPU/0 GPU/3 GPU/1 GPU/2
testpc115159:2736633:2737365 [1] NCCL INFO 10 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736633:2737365 [1] NCCL INFO 11 : GPU/0 GPU/3 GPU/2 GPU/1
testpc115159:2736633:2737365 [1] NCCL INFO 12 : GPU/0 GPU/2 GPU/1 GPU/3
testpc115159:2736633:2737365 [1] NCCL INFO 13 : GPU/0 GPU/2 GPU/3 GPU/1
testpc115159:2736633:2737365 [1] NCCL INFO Pattern 1, crossNic 0, nChannels 14, bw 40.000000/20.000000, type NVL/PIX, sameChannels 0
testpc115159:2736633:2737365 [1] NCCL INFO  0 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736633:2737365 [1] NCCL INFO  1 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736633:2737365 [1] NCCL INFO  2 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736633:2737365 [1] NCCL INFO  3 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736633:2737365 [1] NCCL INFO  4 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736633:2737365 [1] NCCL INFO  5 : GPU/0 GPU/3 GPU/2 GPU/1
testpc115159:2736633:2737365 [1] NCCL INFO  6 : GPU/0 GPU/3 GPU/2 GPU/1
testpc115159:2736633:2737365 [1] NCCL INFO  7 : GPU/2 GPU/0 GPU/3 GPU/1
testpc115159:2736633:2737365 [1] NCCL INFO  8 : GPU/3 GPU/1 GPU/2 GPU/0
testpc115159:2736633:2737365 [1] NCCL INFO  9 : GPU/1 GPU/0 GPU/2 GPU/3
testpc115159:2736633:2737365 [1] NCCL INFO 10 : GPU/3 GPU/2 GPU/0 GPU/1
testpc115159:2736633:2737365 [1] NCCL INFO 11 : GPU/2 GPU/0 GPU/3 GPU/1
testpc115159:2736633:2737365 [1] NCCL INFO 12 : GPU/3 GPU/0 GPU/2 GPU/1
testpc115159:2736633:2737365 [1] NCCL INFO 13 : GPU/3 GPU/0 GPU/1 GPU/2
testpc115159:2736633:2737365 [1] NCCL INFO comm 0x556acd8b8800 rank 1 nRanks 4 nNodes 1 localRanks 4 localRank 1 MNNVL 0
testpc115159:2736633:2737365 [1] NCCL INFO Tree 0 : 0 -> 1 -> 2/-1/-1
testpc115159:2736633:2737365 [1] NCCL INFO Tree 12 : 0 -> 1 -> 2/-1/-1
testpc115159:2736633:2737365 [1] NCCL INFO Tree 1 : 0 -> 1 -> 2/-1/-1
testpc115159:2736633:2737365 [1] NCCL INFO Tree 13 : 0 -> 1 -> 2/-1/-1
testpc115159:2736633:2737365 [1] NCCL INFO Tree 2 : 0 -> 1 -> 3/-1/-1
testpc115159:2736633:2737365 [1] NCCL INFO Tree 14 : 0 -> 1 -> 3/-1/-1
testpc115159:2736633:2737365 [1] NCCL INFO Tree 3 : 0 -> 1 -> 3/-1/-1
testpc115159:2736633:2737365 [1] NCCL INFO Tree 15 : 0 -> 1 -> 3/-1/-1
testpc115159:2736633:2737365 [1] NCCL INFO Tree 4 : 0 -> 1 -> 3/-1/-1
testpc115159:2736633:2737365 [1] NCCL INFO Tree 16 : 0 -> 1 -> 3/-1/-1
testpc115159:2736633:2737365 [1] NCCL INFO Tree 5 : 2 -> 1 -> -1/-1/-1
testpc115159:2736633:2737365 [1] NCCL INFO Tree 17 : 2 -> 1 -> -1/-1/-1
testpc115159:2736633:2737365 [1] NCCL INFO Tree 6 : 2 -> 1 -> -1/-1/-1
testpc115159:2736633:2737365 [1] NCCL INFO Tree 18 : 2 -> 1 -> -1/-1/-1
testpc115159:2736633:2737365 [1] NCCL INFO Tree 7 : 3 -> 1 -> -1/-1/-1
testpc115159:2736633:2737365 [1] NCCL INFO Tree 19 : 3 -> 1 -> -1/-1/-1
testpc115159:2736633:2737365 [1] NCCL INFO Tree 8 : 3 -> 1 -> 2/-1/-1
testpc115159:2736633:2737365 [1] NCCL INFO Tree 20 : 3 -> 1 -> 2/-1/-1
testpc115159:2736633:2737365 [1] NCCL INFO Tree 9 : -1 -> 1 -> 0/-1/-1
testpc115159:2736633:2737365 [1] NCCL INFO Tree 21 : -1 -> 1 -> 0/-1/-1
testpc115159:2736633:2737365 [1] NCCL INFO Tree 10 : 0 -> 1 -> -1/-1/-1
testpc115159:2736633:2737365 [1] NCCL INFO Tree 22 : 0 -> 1 -> -1/-1/-1

testpc115159:2736633:2737365 [1] graph/rings.cc:38 NCCL WARN Error : ring 2 does not loop back to start (0 != 1)
testpc115159:2736633:2737365 [1] NCCL INFO graph/connect.cc:479 -> 3
testpc115159:2736633:2737365 [1] NCCL INFO init.cc:1169 -> 3
testpc115159:2736633:2737365 [1] NCCL INFO init.cc:1501 -> 3
testpc115159:2736633:2737365 [1] NCCL INFO group.cc:64 -> 3 [Async thread]
testpc115159:2736633:2736633 [1] NCCL INFO group.cc:418 -> 3
testpc115159:2736633:2736633 [1] NCCL INFO group.cc:95 -> 3
testpc115159:2736633:2739611 [1] NCCL INFO Using non-device net plugin version 0
testpc115159:2736633:2739611 [1] NCCL INFO Using network Socket
testpc115159:2736633:2739611 [1] NCCL INFO comm 0x556ad86da000 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 41000 commId 0xa319625d1be869a3 - Init START
testpc115159:2736633:2739611 [1] NCCL INFO === System : maxBw 80.0 totalBw 240.0 ===
testpc115159:2736633:2739611 [1] NCCL INFO CPU/1 (1/2/-1)
testpc115159:2736633:2739611 [1] NCCL INFO + PCI[5000.0] - NIC/0
testpc115159:2736633:2739611 [1] NCCL INFO + PCI[24.0] - GPU/1000 (0)
testpc115159:2736633:2739611 [1] NCCL INFO               + NVL[80.0] - GPU/41000
testpc115159:2736633:2739611 [1] NCCL INFO               + NVL[80.0] - GPU/C1000
testpc115159:2736633:2739611 [1] NCCL INFO               + NVL[80.0] - GPU/81000
testpc115159:2736633:2739611 [1] NCCL INFO + SYS[16.0] - CPU/0
testpc115159:2736633:2739611 [1] NCCL INFO + SYS[16.0] - CPU/3
testpc115159:2736633:2739611 [1] NCCL INFO + SYS[16.0] - CPU/2
testpc115159:2736633:2739611 [1] NCCL INFO CPU/0 (1/2/-1)
testpc115159:2736633:2739611 [1] NCCL INFO + PCI[24.0] - GPU/41000 (1)
testpc115159:2736633:2739611 [1] NCCL INFO               + NVL[80.0] - GPU/C1000
testpc115159:2736633:2739611 [1] NCCL INFO               + NVL[80.0] - GPU/1000
testpc115159:2736633:2739611 [1] NCCL INFO               + NVL[80.0] - GPU/81000
testpc115159:2736633:2739611 [1] NCCL INFO + SYS[16.0] - CPU/1
testpc115159:2736633:2739611 [1] NCCL INFO + SYS[16.0] - CPU/3
testpc115159:2736633:2739611 [1] NCCL INFO + SYS[16.0] - CPU/2
testpc115159:2736633:2739611 [1] NCCL INFO CPU/3 (1/2/-1)
testpc115159:2736633:2739611 [1] NCCL INFO + PCI[24.0] - GPU/81000 (2)
testpc115159:2736633:2739611 [1] NCCL INFO               + NVL[80.0] - GPU/41000
testpc115159:2736633:2739611 [1] NCCL INFO               + NVL[80.0] - GPU/C1000
testpc115159:2736633:2739611 [1] NCCL INFO               + NVL[80.0] - GPU/1000
testpc115159:2736633:2739611 [1] NCCL INFO + SYS[16.0] - CPU/1
testpc115159:2736633:2739611 [1] NCCL INFO + SYS[16.0] - CPU/0
testpc115159:2736633:2739611 [1] NCCL INFO + SYS[16.0] - CPU/2
testpc115159:2736633:2739611 [1] NCCL INFO CPU/2 (1/2/-1)
testpc115159:2736633:2739611 [1] NCCL INFO + PCI[24.0] - GPU/C1000 (3)
testpc115159:2736633:2739611 [1] NCCL INFO               + NVL[80.0] - GPU/41000
testpc115159:2736633:2739611 [1] NCCL INFO               + NVL[80.0] - GPU/1000
testpc115159:2736633:2739611 [1] NCCL INFO               + NVL[80.0] - GPU/81000
testpc115159:2736633:2739611 [1] NCCL INFO + SYS[16.0] - CPU/1
testpc115159:2736633:2739611 [1] NCCL INFO + SYS[16.0] - CPU/0
testpc115159:2736633:2739611 [1] NCCL INFO + SYS[16.0] - CPU/3
testpc115159:2736633:2739611 [1] NCCL INFO + PCI[0.4] - NIC/E1000
testpc115159:2736633:2739611 [1] NCCL INFO ==========================================
testpc115159:2736633:2739611 [1] NCCL INFO GPU/1000 :GPU/1000 (0/5000.000000/LOC) GPU/41000 (1/80.000000/NVL) GPU/81000 (1/80.000000/NVL) GPU/C1000 (1/80.000000/NVL) CPU/1 (1/24.000000/PHB) CPU/0 (2/24.000000/PHB) CPU/3 (2/24.000000/PHB) CPU/2 (2/24.000000/PHB)
testpc115159:2736633:2739611 [1] NCCL INFO GPU/41000 :GPU/1000 (1/80.000000/NVL) GPU/41000 (0/5000.000000/LOC) GPU/81000 (1/80.000000/NVL) GPU/C1000 (1/80.000000/NVL) CPU/1 (2/24.000000/PHB) CPU/0 (1/24.000000/PHB) CPU/3 (2/24.000000/PHB) CPU/2 (2/24.000000/PHB)
testpc115159:2736633:2739611 [1] NCCL INFO GPU/81000 :GPU/1000 (1/80.000000/NVL) GPU/41000 (1/80.000000/NVL) GPU/81000 (0/5000.000000/LOC) GPU/C1000 (1/80.000000/NVL) CPU/1 (2/24.000000/PHB) CPU/0 (2/24.000000/PHB) CPU/3 (1/24.000000/PHB) CPU/2 (2/24.000000/PHB)
testpc115159:2736633:2739611 [1] NCCL INFO GPU/C1000 :GPU/1000 (1/80.000000/NVL) GPU/41000 (1/80.000000/NVL) GPU/81000 (1/80.000000/NVL) GPU/C1000 (0/5000.000000/LOC) CPU/1 (2/24.000000/PHB) CPU/0 (2/24.000000/PHB) CPU/3 (2/24.000000/PHB) CPU/2 (1/24.000000/PHB)
testpc115159:2736633:2739611 [1] NCCL INFO Setting affinity for GPU 1 to ffffff,00000000,00000000,00ffffff
testpc115159:2736633:2739611 [1] NCCL INFO NVLS multicast support is not available on dev 1
testpc115159:2736633:2739611 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 14, bw 20.000000/20.000000, type NVL/PIX, sameChannels 0
testpc115159:2736633:2739611 [1] NCCL INFO  0 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736633:2739611 [1] NCCL INFO  1 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736633:2739611 [1] NCCL INFO  2 : GPU/0 GPU/2 GPU/1 GPU/3
testpc115159:2736633:2739611 [1] NCCL INFO  3 : GPU/1 GPU/0 GPU/2 GPU/3
testpc115159:2736633:2739611 [1] NCCL INFO  4 : GPU/2 GPU/0 GPU/3 GPU/1
testpc115159:2736633:2739611 [1] NCCL INFO  5 : GPU/1 GPU/0 GPU/3 GPU/2
testpc115159:2736633:2739611 [1] NCCL INFO  6 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736633:2739611 [1] NCCL INFO  7 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736633:2739611 [1] NCCL INFO  8 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736633:2739611 [1] NCCL INFO  9 : GPU/0 GPU/2 GPU/1 GPU/3
testpc115159:2736633:2739611 [1] NCCL INFO 10 : GPU/1 GPU/0 GPU/2 GPU/3
testpc115159:2736633:2739611 [1] NCCL INFO 11 : GPU/2 GPU/0 GPU/3 GPU/1
testpc115159:2736633:2739611 [1] NCCL INFO 12 : GPU/1 GPU/0 GPU/3 GPU/2
testpc115159:2736633:2739611 [1] NCCL INFO 13 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736633:2739611 [1] NCCL INFO Pattern 1, crossNic 0, nChannels 14, bw 40.000000/20.000000, type NVL/PIX, sameChannels 0
testpc115159:2736633:2739611 [1] NCCL INFO  0 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736633:2739611 [1] NCCL INFO  1 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736633:2739611 [1] NCCL INFO  2 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736633:2739611 [1] NCCL INFO  3 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736633:2739611 [1] NCCL INFO  4 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736633:2739611 [1] NCCL INFO  5 : GPU/0 GPU/3 GPU/2 GPU/1
testpc115159:2736633:2739611 [1] NCCL INFO  6 : GPU/0 GPU/3 GPU/2 GPU/1
testpc115159:2736633:2739611 [1] NCCL INFO  7 : GPU/1 GPU/3 GPU/2 GPU/0
testpc115159:2736633:2739611 [1] NCCL INFO  8 : GPU/3 GPU/1 GPU/0 GPU/2
testpc115159:2736633:2739611 [1] NCCL INFO  9 : GPU/3 GPU/1 GPU/0 GPU/2
testpc115159:2736633:2739611 [1] NCCL INFO 10 : GPU/2 GPU/3 GPU/0 GPU/1
testpc115159:2736633:2739611 [1] NCCL INFO 11 : GPU/2 GPU/0 GPU/3 GPU/1
testpc115159:2736633:2739611 [1] NCCL INFO 12 : GPU/2 GPU/1 GPU/3 GPU/0
testpc115159:2736633:2739611 [1] NCCL INFO 13 : GPU/3 GPU/0 GPU/1 GPU/2
testpc115159:2736633:2739611 [1] NCCL INFO comm 0x556ad86da000 rank 1 nRanks 4 nNodes 1 localRanks 4 localRank 1 MNNVL 0
testpc115159:2736633:2739611 [1] NCCL INFO Tree 0 : 0 -> 1 -> 2/-1/-1
testpc115159:2736633:2739611 [1] NCCL INFO Tree 12 : 0 -> 1 -> 2/-1/-1
testpc115159:2736633:2739611 [1] NCCL INFO Tree 1 : 0 -> 1 -> 2/-1/-1
testpc115159:2736633:2739611 [1] NCCL INFO Tree 13 : 0 -> 1 -> 2/-1/-1
testpc115159:2736633:2739611 [1] NCCL INFO Tree 2 : 0 -> 1 -> 2/-1/-1
testpc115159:2736633:2739611 [1] NCCL INFO Tree 14 : 0 -> 1 -> 2/-1/-1
testpc115159:2736633:2739611 [1] NCCL INFO Tree 8 : 3 -> 1 -> 0/-1/-1
testpc115159:2736633:2739611 [1] NCCL INFO Tree 20 : 3 -> 1 -> 0/-1/-1
testpc115159:2736633:2739611 [1] NCCL INFO Tree 11 : 3 -> 1 -> -1/-1/-1
testpc115159:2736633:2739611 [1] NCCL INFO Tree 23 : 3 -> 1 -> -1/-1/-1

testpc115159:2736633:2739611 [1] graph/rings.cc:51 NCCL WARN Error : ring 1 does not contain rank 2
testpc115159:2736633:2739611 [1] NCCL INFO graph/connect.cc:479 -> 3
testpc115159:2736633:2739611 [1] NCCL INFO init.cc:1169 -> 3
testpc115159:2736633:2739611 [1] NCCL INFO init.cc:1501 -> 3
testpc115159:2736633:2739611 [1] NCCL INFO group.cc:64 -> 3 [Async thread]
testpc115159:2736633:2736633 [1] NCCL INFO group.cc:418 -> 3
testpc115159:2736633:2736633 [1] NCCL INFO group.cc:95 -> 3
testpc115159:2736633:2736633 [1] NCCL INFO comm 0x556ad86da000 rank 1 nranks 4 cudaDev 1 busId 41000 - Abort COMPLETE
world size: 4, current rank: 1, local rank: 1
testpc115159:2736633:2739628 [0] NCCL INFO comm 0x556acd8b8800 rank 1 nranks 4 cudaDev 1 busId 41000 - Abort COMPLETE

error code : 1
error info : testpc115159:2736635:2736635 [2] NCCL INFO cudaDriverVersion 12040
testpc115159:2736635:2736635 [2] NCCL INFO Bootstrap : Using eno8303:109.105.115.159<0>
testpc115159:2736635:2736635 [2] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
testpc115159:2736635:2737364 [2] NCCL INFO NET/IB : No device found.
testpc115159:2736635:2737364 [2] NCCL INFO NET/Socket : Using [0]eno8303:109.105.115.159<0> [1]vethf4fabc7:fe80::54a7:2aff:fe92:27be%vethf4fabc7<0>
testpc115159:2736635:2737364 [2] NCCL INFO Using non-device net plugin version 0
testpc115159:2736635:2737364 [2] NCCL INFO Using network Socket
testpc115159:2736635:2737364 [2] NCCL INFO comm 0x560c1ce0a800 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 81000 commId 0xd57beaec7aafe641 - Init START
testpc115159:2736635:2737364 [2] NCCL INFO === System : maxBw 80.0 totalBw 240.0 ===
testpc115159:2736635:2737364 [2] NCCL INFO CPU/1 (1/2/-1)
testpc115159:2736635:2737364 [2] NCCL INFO + PCI[5000.0] - NIC/0
testpc115159:2736635:2737364 [2] NCCL INFO + PCI[24.0] - GPU/1000 (0)
testpc115159:2736635:2737364 [2] NCCL INFO               + NVL[80.0] - GPU/41000
testpc115159:2736635:2737364 [2] NCCL INFO               + NVL[80.0] - GPU/C1000
testpc115159:2736635:2737364 [2] NCCL INFO               + NVL[80.0] - GPU/81000
testpc115159:2736635:2737364 [2] NCCL INFO + SYS[16.0] - CPU/0
testpc115159:2736635:2737364 [2] NCCL INFO + SYS[16.0] - CPU/3
testpc115159:2736635:2737364 [2] NCCL INFO + SYS[16.0] - CPU/2
testpc115159:2736635:2737364 [2] NCCL INFO CPU/0 (1/2/-1)
testpc115159:2736635:2737364 [2] NCCL INFO + PCI[24.0] - GPU/41000 (1)
testpc115159:2736635:2737364 [2] NCCL INFO               + NVL[80.0] - GPU/C1000
testpc115159:2736635:2737364 [2] NCCL INFO               + NVL[80.0] - GPU/1000
testpc115159:2736635:2737364 [2] NCCL INFO               + NVL[80.0] - GPU/81000
testpc115159:2736635:2737364 [2] NCCL INFO + SYS[16.0] - CPU/1
testpc115159:2736635:2737364 [2] NCCL INFO + SYS[16.0] - CPU/3
testpc115159:2736635:2737364 [2] NCCL INFO + SYS[16.0] - CPU/2
testpc115159:2736635:2737364 [2] NCCL INFO CPU/3 (1/2/-1)
testpc115159:2736635:2737364 [2] NCCL INFO + PCI[24.0] - GPU/81000 (2)
testpc115159:2736635:2737364 [2] NCCL INFO               + NVL[80.0] - GPU/41000
testpc115159:2736635:2737364 [2] NCCL INFO               + NVL[80.0] - GPU/C1000
testpc115159:2736635:2737364 [2] NCCL INFO               + NVL[80.0] - GPU/1000
testpc115159:2736635:2737364 [2] NCCL INFO + SYS[16.0] - CPU/1
testpc115159:2736635:2737364 [2] NCCL INFO + SYS[16.0] - CPU/0
testpc115159:2736635:2737364 [2] NCCL INFO + SYS[16.0] - CPU/2
testpc115159:2736635:2737364 [2] NCCL INFO CPU/2 (1/2/-1)
testpc115159:2736635:2737364 [2] NCCL INFO + PCI[24.0] - GPU/C1000 (3)
testpc115159:2736635:2737364 [2] NCCL INFO               + NVL[80.0] - GPU/41000
testpc115159:2736635:2737364 [2] NCCL INFO               + NVL[80.0] - GPU/1000
testpc115159:2736635:2737364 [2] NCCL INFO               + NVL[80.0] - GPU/81000
testpc115159:2736635:2737364 [2] NCCL INFO + SYS[16.0] - CPU/1
testpc115159:2736635:2737364 [2] NCCL INFO + SYS[16.0] - CPU/0
testpc115159:2736635:2737364 [2] NCCL INFO + SYS[16.0] - CPU/3
testpc115159:2736635:2737364 [2] NCCL INFO + PCI[0.4] - NIC/E1000
testpc115159:2736635:2737364 [2] NCCL INFO ==========================================
testpc115159:2736635:2737364 [2] NCCL INFO GPU/1000 :GPU/1000 (0/5000.000000/LOC) GPU/41000 (1/80.000000/NVL) GPU/81000 (1/80.000000/NVL) GPU/C1000 (1/80.000000/NVL) CPU/1 (1/24.000000/PHB) CPU/0 (2/24.000000/PHB) CPU/3 (2/24.000000/PHB) CPU/2 (2/24.000000/PHB)
testpc115159:2736635:2737364 [2] NCCL INFO GPU/41000 :GPU/1000 (1/80.000000/NVL) GPU/41000 (0/5000.000000/LOC) GPU/81000 (1/80.000000/NVL) GPU/C1000 (1/80.000000/NVL) CPU/1 (2/24.000000/PHB) CPU/0 (1/24.000000/PHB) CPU/3 (2/24.000000/PHB) CPU/2 (2/24.000000/PHB)
testpc115159:2736635:2737364 [2] NCCL INFO GPU/81000 :GPU/1000 (1/80.000000/NVL) GPU/41000 (1/80.000000/NVL) GPU/81000 (0/5000.000000/LOC) GPU/C1000 (1/80.000000/NVL) CPU/1 (2/24.000000/PHB) CPU/0 (2/24.000000/PHB) CPU/3 (1/24.000000/PHB) CPU/2 (2/24.000000/PHB)
testpc115159:2736635:2737364 [2] NCCL INFO GPU/C1000 :GPU/1000 (1/80.000000/NVL) GPU/41000 (1/80.000000/NVL) GPU/81000 (1/80.000000/NVL) GPU/C1000 (0/5000.000000/LOC) CPU/1 (2/24.000000/PHB) CPU/0 (2/24.000000/PHB) CPU/3 (2/24.000000/PHB) CPU/2 (1/24.000000/PHB)
testpc115159:2736635:2737364 [2] NCCL INFO Setting affinity for GPU 2 to ffffff00,00000000,00000000,ffffff00,00000000,00000000
testpc115159:2736635:2737364 [2] NCCL INFO NVLS multicast support is not available on dev 2
testpc115159:2736635:2737364 [2] NCCL INFO Pattern 4, crossNic 0, nChannels 16, bw 20.000000/20.000000, type NVL/PIX, sameChannels 0
testpc115159:2736635:2737364 [2] NCCL INFO  0 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736635:2737364 [2] NCCL INFO  1 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736635:2737364 [2] NCCL INFO  2 : GPU/3 GPU/0 GPU/2 GPU/1
testpc115159:2736635:2737364 [2] NCCL INFO  3 : GPU/0 GPU/3 GPU/2 GPU/1
testpc115159:2736635:2737364 [2] NCCL INFO  4 : GPU/3 GPU/2 GPU/0 GPU/1
testpc115159:2736635:2737364 [2] NCCL INFO  5 : GPU/2 GPU/3 GPU/1 GPU/0
testpc115159:2736635:2737364 [2] NCCL INFO  6 : GPU/3 GPU/1 GPU/0 GPU/2
testpc115159:2736635:2737364 [2] NCCL INFO  7 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736635:2737364 [2] NCCL INFO  8 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736635:2737364 [2] NCCL INFO  9 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736635:2737364 [2] NCCL INFO 10 : GPU/3 GPU/0 GPU/2 GPU/1
testpc115159:2736635:2737364 [2] NCCL INFO 11 : GPU/0 GPU/3 GPU/2 GPU/1
testpc115159:2736635:2737364 [2] NCCL INFO 12 : GPU/3 GPU/2 GPU/0 GPU/1
testpc115159:2736635:2737364 [2] NCCL INFO 13 : GPU/2 GPU/3 GPU/1 GPU/0
testpc115159:2736635:2737364 [2] NCCL INFO 14 : GPU/3 GPU/1 GPU/0 GPU/2
testpc115159:2736635:2737364 [2] NCCL INFO 15 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736635:2737364 [2] NCCL INFO Pattern 1, crossNic 0, nChannels 16, bw 40.000000/20.000000, type NVL/PIX, sameChannels 0
testpc115159:2736635:2737364 [2] NCCL INFO  0 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736635:2737364 [2] NCCL INFO  1 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736635:2737364 [2] NCCL INFO  2 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736635:2737364 [2] NCCL INFO  3 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736635:2737364 [2] NCCL INFO  4 : GPU/0 GPU/2 GPU/3 GPU/1
testpc115159:2736635:2737364 [2] NCCL INFO  5 : GPU/0 GPU/2 GPU/3 GPU/1
testpc115159:2736635:2737364 [2] NCCL INFO  6 : GPU/0 GPU/2 GPU/1 GPU/3
testpc115159:2736635:2737364 [2] NCCL INFO  7 : GPU/1 GPU/0 GPU/2 GPU/3
testpc115159:2736635:2737364 [2] NCCL INFO  8 : GPU/2 GPU/0 GPU/3 GPU/1
testpc115159:2736635:2737364 [2] NCCL INFO  9 : GPU/2 GPU/1 GPU/0 GPU/3
testpc115159:2736635:2737364 [2] NCCL INFO 10 : GPU/2 GPU/0 GPU/1 GPU/3
testpc115159:2736635:2737364 [2] NCCL INFO 11 : GPU/3 GPU/2 GPU/0 GPU/1
testpc115159:2736635:2737364 [2] NCCL INFO 12 : GPU/2 GPU/1 GPU/0 GPU/3
testpc115159:2736635:2737364 [2] NCCL INFO 13 : GPU/3 GPU/1 GPU/0 GPU/2
testpc115159:2736635:2737364 [2] NCCL INFO 14 : GPU/2 GPU/1 GPU/0 GPU/3
testpc115159:2736635:2737364 [2] NCCL INFO 15 : GPU/3 GPU/0 GPU/2 GPU/1
testpc115159:2736635:2737364 [2] NCCL INFO comm 0x560c1ce0a800 rank 2 nRanks 4 nNodes 1 localRanks 4 localRank 2 MNNVL 0
testpc115159:2736635:2737364 [2] NCCL INFO Tree 5 : 0 -> 2 -> 3/-1/-1
testpc115159:2736635:2737364 [2] NCCL INFO Tree 17 : 0 -> 2 -> 3/-1/-1
testpc115159:2736635:2737364 [2] NCCL INFO Tree 7 : 0 -> 2 -> 3/-1/-1
testpc115159:2736635:2737364 [2] NCCL INFO Tree 19 : 0 -> 2 -> 3/-1/-1
testpc115159:2736635:2737364 [2] NCCL INFO Tree 9 : -1 -> 2 -> 1/-1/-1
testpc115159:2736635:2737364 [2] NCCL INFO Tree 21 : -1 -> 2 -> 1/-1/-1

testpc115159:2736635:2737364 [2] graph/rings.cc:38 NCCL WARN Error : ring 2 does not loop back to start (1 != 2)
testpc115159:2736635:2737364 [2] NCCL INFO graph/connect.cc:479 -> 3
testpc115159:2736635:2737364 [2] NCCL INFO init.cc:1169 -> 3
testpc115159:2736635:2737364 [2] NCCL INFO init.cc:1501 -> 3
testpc115159:2736635:2737364 [2] NCCL INFO group.cc:64 -> 3 [Async thread]
testpc115159:2736635:2736635 [2] NCCL INFO group.cc:418 -> 3
testpc115159:2736635:2736635 [2] NCCL INFO group.cc:95 -> 3
testpc115159:2736635:2739610 [2] NCCL INFO Using non-device net plugin version 0
testpc115159:2736635:2739610 [2] NCCL INFO Using network Socket
testpc115159:2736635:2739610 [2] NCCL INFO comm 0x560c27c51400 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 81000 commId 0xa319625d1be869a3 - Init START
testpc115159:2736635:2739610 [2] NCCL INFO === System : maxBw 80.0 totalBw 240.0 ===
testpc115159:2736635:2739610 [2] NCCL INFO CPU/1 (1/2/-1)
testpc115159:2736635:2739610 [2] NCCL INFO + PCI[5000.0] - NIC/0
testpc115159:2736635:2739610 [2] NCCL INFO + PCI[24.0] - GPU/1000 (0)
testpc115159:2736635:2739610 [2] NCCL INFO               + NVL[80.0] - GPU/41000
testpc115159:2736635:2739610 [2] NCCL INFO               + NVL[80.0] - GPU/C1000
testpc115159:2736635:2739610 [2] NCCL INFO               + NVL[80.0] - GPU/81000
testpc115159:2736635:2739610 [2] NCCL INFO + SYS[16.0] - CPU/0
testpc115159:2736635:2739610 [2] NCCL INFO + SYS[16.0] - CPU/3
testpc115159:2736635:2739610 [2] NCCL INFO + SYS[16.0] - CPU/2
testpc115159:2736635:2739610 [2] NCCL INFO CPU/0 (1/2/-1)
testpc115159:2736635:2739610 [2] NCCL INFO + PCI[24.0] - GPU/41000 (1)
testpc115159:2736635:2739610 [2] NCCL INFO               + NVL[80.0] - GPU/C1000
testpc115159:2736635:2739610 [2] NCCL INFO               + NVL[80.0] - GPU/1000
testpc115159:2736635:2739610 [2] NCCL INFO               + NVL[80.0] - GPU/81000
testpc115159:2736635:2739610 [2] NCCL INFO + SYS[16.0] - CPU/1
testpc115159:2736635:2739610 [2] NCCL INFO + SYS[16.0] - CPU/3
testpc115159:2736635:2739610 [2] NCCL INFO + SYS[16.0] - CPU/2
testpc115159:2736635:2739610 [2] NCCL INFO CPU/3 (1/2/-1)
testpc115159:2736635:2739610 [2] NCCL INFO + PCI[24.0] - GPU/81000 (2)
testpc115159:2736635:2739610 [2] NCCL INFO               + NVL[80.0] - GPU/41000
testpc115159:2736635:2739610 [2] NCCL INFO               + NVL[80.0] - GPU/C1000
testpc115159:2736635:2739610 [2] NCCL INFO               + NVL[80.0] - GPU/1000
testpc115159:2736635:2739610 [2] NCCL INFO + SYS[16.0] - CPU/1
testpc115159:2736635:2739610 [2] NCCL INFO + SYS[16.0] - CPU/0
testpc115159:2736635:2739610 [2] NCCL INFO + SYS[16.0] - CPU/2
testpc115159:2736635:2739610 [2] NCCL INFO CPU/2 (1/2/-1)
testpc115159:2736635:2739610 [2] NCCL INFO + PCI[24.0] - GPU/C1000 (3)
testpc115159:2736635:2739610 [2] NCCL INFO               + NVL[80.0] - GPU/41000
testpc115159:2736635:2739610 [2] NCCL INFO               + NVL[80.0] - GPU/1000
testpc115159:2736635:2739610 [2] NCCL INFO               + NVL[80.0] - GPU/81000
testpc115159:2736635:2739610 [2] NCCL INFO + SYS[16.0] - CPU/1
testpc115159:2736635:2739610 [2] NCCL INFO + SYS[16.0] - CPU/0
testpc115159:2736635:2739610 [2] NCCL INFO + SYS[16.0] - CPU/3
testpc115159:2736635:2739610 [2] NCCL INFO + PCI[0.4] - NIC/E1000
testpc115159:2736635:2739610 [2] NCCL INFO ==========================================
testpc115159:2736635:2739610 [2] NCCL INFO GPU/1000 :GPU/1000 (0/5000.000000/LOC) GPU/41000 (1/80.000000/NVL) GPU/81000 (1/80.000000/NVL) GPU/C1000 (1/80.000000/NVL) CPU/1 (1/24.000000/PHB) CPU/0 (2/24.000000/PHB) CPU/3 (2/24.000000/PHB) CPU/2 (2/24.000000/PHB)
testpc115159:2736635:2739610 [2] NCCL INFO GPU/41000 :GPU/1000 (1/80.000000/NVL) GPU/41000 (0/5000.000000/LOC) GPU/81000 (1/80.000000/NVL) GPU/C1000 (1/80.000000/NVL) CPU/1 (2/24.000000/PHB) CPU/0 (1/24.000000/PHB) CPU/3 (2/24.000000/PHB) CPU/2 (2/24.000000/PHB)
testpc115159:2736635:2739610 [2] NCCL INFO GPU/81000 :GPU/1000 (1/80.000000/NVL) GPU/41000 (1/80.000000/NVL) GPU/81000 (0/5000.000000/LOC) GPU/C1000 (1/80.000000/NVL) CPU/1 (2/24.000000/PHB) CPU/0 (2/24.000000/PHB) CPU/3 (1/24.000000/PHB) CPU/2 (2/24.000000/PHB)
testpc115159:2736635:2739610 [2] NCCL INFO GPU/C1000 :GPU/1000 (1/80.000000/NVL) GPU/41000 (1/80.000000/NVL) GPU/81000 (1/80.000000/NVL) GPU/C1000 (0/5000.000000/LOC) CPU/1 (2/24.000000/PHB) CPU/0 (2/24.000000/PHB) CPU/3 (2/24.000000/PHB) CPU/2 (1/24.000000/PHB)
testpc115159:2736635:2739610 [2] NCCL INFO Setting affinity for GPU 2 to ffffff00,00000000,00000000,ffffff00,00000000,00000000
testpc115159:2736635:2739610 [2] NCCL INFO NVLS multicast support is not available on dev 2
testpc115159:2736635:2739610 [2] NCCL INFO Pattern 4, crossNic 0, nChannels 16, bw 20.000000/20.000000, type NVL/PIX, sameChannels 0
testpc115159:2736635:2739610 [2] NCCL INFO  0 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736635:2739610 [2] NCCL INFO  1 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736635:2739610 [2] NCCL INFO  2 : GPU/0 GPU/2 GPU/3 GPU/1
testpc115159:2736635:2739610 [2] NCCL INFO  3 : GPU/3 GPU/1 GPU/2 GPU/0
testpc115159:2736635:2739610 [2] NCCL INFO  4 : GPU/1 GPU/0 GPU/2 GPU/3
testpc115159:2736635:2739610 [2] NCCL INFO  5 : GPU/1 GPU/0 GPU/3 GPU/2
testpc115159:2736635:2739610 [2] NCCL INFO  6 : GPU/1 GPU/3 GPU/0 GPU/2
testpc115159:2736635:2739610 [2] NCCL INFO  7 : GPU/3 GPU/2 GPU/0 GPU/1
testpc115159:2736635:2739610 [2] NCCL INFO  8 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736635:2739610 [2] NCCL INFO  9 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736635:2739610 [2] NCCL INFO 10 : GPU/0 GPU/2 GPU/3 GPU/1
testpc115159:2736635:2739610 [2] NCCL INFO 11 : GPU/3 GPU/1 GPU/2 GPU/0
testpc115159:2736635:2739610 [2] NCCL INFO 12 : GPU/1 GPU/0 GPU/2 GPU/3
testpc115159:2736635:2739610 [2] NCCL INFO 13 : GPU/1 GPU/0 GPU/3 GPU/2
testpc115159:2736635:2739610 [2] NCCL INFO 14 : GPU/1 GPU/3 GPU/0 GPU/2
testpc115159:2736635:2739610 [2] NCCL INFO 15 : GPU/3 GPU/2 GPU/0 GPU/1
testpc115159:2736635:2739610 [2] NCCL INFO Pattern 1, crossNic 0, nChannels 16, bw 30.000000/20.000000, type NVL/PIX, sameChannels 0
testpc115159:2736635:2739610 [2] NCCL INFO  0 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736635:2739610 [2] NCCL INFO  1 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736635:2739610 [2] NCCL INFO  2 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736635:2739610 [2] NCCL INFO  3 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736635:2739610 [2] NCCL INFO  4 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736635:2739610 [2] NCCL INFO  5 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736635:2739610 [2] NCCL INFO  6 : GPU/0 GPU/2 GPU/1 GPU/3
testpc115159:2736635:2739610 [2] NCCL INFO  7 : GPU/0 GPU/2 GPU/1 GPU/3
testpc115159:2736635:2739610 [2] NCCL INFO  8 : GPU/0 GPU/2 GPU/1 GPU/3
testpc115159:2736635:2739610 [2] NCCL INFO  9 : GPU/1 GPU/3 GPU/0 GPU/2
testpc115159:2736635:2739610 [2] NCCL INFO 10 : GPU/1 GPU/3 GPU/0 GPU/2
testpc115159:2736635:2739610 [2] NCCL INFO 11 : GPU/1 GPU/0 GPU/3 GPU/2
testpc115159:2736635:2739610 [2] NCCL INFO 12 : GPU/1 GPU/0 GPU/3 GPU/2
testpc115159:2736635:2739610 [2] NCCL INFO 13 : GPU/2 GPU/0 GPU/3 GPU/1
testpc115159:2736635:2739610 [2] NCCL INFO 14 : GPU/2 GPU/0 GPU/3 GPU/1
testpc115159:2736635:2739610 [2] NCCL INFO 15 : GPU/2 GPU/0 GPU/3 GPU/1
testpc115159:2736635:2739610 [2] NCCL INFO comm 0x560c27c51400 rank 2 nRanks 4 nNodes 1 localRanks 4 localRank 2 MNNVL 0
testpc115159:2736635:2739610 [2] NCCL INFO Tree 3 : 1 -> 2 -> 3/-1/-1
testpc115159:2736635:2739610 [2] NCCL INFO Tree 15 : 1 -> 2 -> 3/-1/-1
testpc115159:2736635:2739610 [2] NCCL INFO Tree 4 : 3 -> 2 -> -1/-1/-1
testpc115159:2736635:2739610 [2] NCCL INFO Tree 16 : 3 -> 2 -> -1/-1/-1
testpc115159:2736635:2739610 [2] NCCL INFO Tree 5 : 3 -> 2 -> -1/-1/-1
testpc115159:2736635:2739610 [2] NCCL INFO Tree 17 : 3 -> 2 -> -1/-1/-1
testpc115159:2736635:2739610 [2] NCCL INFO Tree 7 : 0 -> 2 -> 1/-1/-1
testpc115159:2736635:2739610 [2] NCCL INFO Tree 19 : 0 -> 2 -> 1/-1/-1
testpc115159:2736635:2739610 [2] NCCL INFO Tree 9 : 0 -> 2 -> -1/-1/-1
testpc115159:2736635:2739610 [2] NCCL INFO Tree 21 : 0 -> 2 -> -1/-1/-1

testpc115159:2736635:2739610 [2] graph/rings.cc:38 NCCL WARN Error : ring 1 does not loop back to start (-1 != 2)
testpc115159:2736635:2739610 [2] NCCL INFO graph/connect.cc:479 -> 3
testpc115159:2736635:2739610 [2] NCCL INFO init.cc:1169 -> 3
testpc115159:2736635:2739610 [2] NCCL INFO init.cc:1501 -> 3
testpc115159:2736635:2739610 [2] NCCL INFO group.cc:64 -> 3 [Async thread]
testpc115159:2736635:2736635 [2] NCCL INFO group.cc:418 -> 3
testpc115159:2736635:2736635 [2] NCCL INFO group.cc:95 -> 3
testpc115159:2736635:2736635 [2] NCCL INFO comm 0x560c27c51400 rank 2 nranks 4 cudaDev 2 busId 81000 - Abort COMPLETE
world size: 4, current rank: 2, local rank: 2
testpc115159:2736635:2739629 [0] NCCL INFO comm 0x560c1ce0a800 rank 2 nranks 4 cudaDev 2 busId 81000 - Abort COMPLETE

error code : 1
error info : testpc115159:2736637:2736637 [3] NCCL INFO cudaDriverVersion 12040
testpc115159:2736637:2736637 [3] NCCL INFO Bootstrap : Using eno8303:109.105.115.159<0>
testpc115159:2736637:2736637 [3] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
testpc115159:2736637:2737363 [3] NCCL INFO NET/IB : No device found.
testpc115159:2736637:2737363 [3] NCCL INFO NET/Socket : Using [0]eno8303:109.105.115.159<0> [1]vethf4fabc7:fe80::54a7:2aff:fe92:27be%vethf4fabc7<0>
testpc115159:2736637:2737363 [3] NCCL INFO Using non-device net plugin version 0
testpc115159:2736637:2737363 [3] NCCL INFO Using network Socket
testpc115159:2736637:2737363 [3] NCCL INFO comm 0x5587025ac800 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId c1000 commId 0xd57beaec7aafe641 - Init START
testpc115159:2736637:2737363 [3] NCCL INFO === System : maxBw 80.0 totalBw 240.0 ===
testpc115159:2736637:2737363 [3] NCCL INFO CPU/1 (1/2/-1)
testpc115159:2736637:2737363 [3] NCCL INFO + PCI[5000.0] - NIC/0
testpc115159:2736637:2737363 [3] NCCL INFO + PCI[24.0] - GPU/1000 (0)
testpc115159:2736637:2737363 [3] NCCL INFO               + NVL[80.0] - GPU/41000
testpc115159:2736637:2737363 [3] NCCL INFO               + NVL[80.0] - GPU/C1000
testpc115159:2736637:2737363 [3] NCCL INFO               + NVL[80.0] - GPU/81000
testpc115159:2736637:2737363 [3] NCCL INFO + SYS[16.0] - CPU/0
testpc115159:2736637:2737363 [3] NCCL INFO + SYS[16.0] - CPU/3
testpc115159:2736637:2737363 [3] NCCL INFO + SYS[16.0] - CPU/2
testpc115159:2736637:2737363 [3] NCCL INFO CPU/0 (1/2/-1)
testpc115159:2736637:2737363 [3] NCCL INFO + PCI[24.0] - GPU/41000 (1)
testpc115159:2736637:2737363 [3] NCCL INFO               + NVL[80.0] - GPU/C1000
testpc115159:2736637:2737363 [3] NCCL INFO               + NVL[80.0] - GPU/1000
testpc115159:2736637:2737363 [3] NCCL INFO               + NVL[80.0] - GPU/81000
testpc115159:2736637:2737363 [3] NCCL INFO + SYS[16.0] - CPU/1
testpc115159:2736637:2737363 [3] NCCL INFO + SYS[16.0] - CPU/3
testpc115159:2736637:2737363 [3] NCCL INFO + SYS[16.0] - CPU/2
testpc115159:2736637:2737363 [3] NCCL INFO CPU/3 (1/2/-1)
testpc115159:2736637:2737363 [3] NCCL INFO + PCI[24.0] - GPU/81000 (2)
testpc115159:2736637:2737363 [3] NCCL INFO               + NVL[80.0] - GPU/41000
testpc115159:2736637:2737363 [3] NCCL INFO               + NVL[80.0] - GPU/C1000
testpc115159:2736637:2737363 [3] NCCL INFO               + NVL[80.0] - GPU/1000
testpc115159:2736637:2737363 [3] NCCL INFO + SYS[16.0] - CPU/1
testpc115159:2736637:2737363 [3] NCCL INFO + SYS[16.0] - CPU/0
testpc115159:2736637:2737363 [3] NCCL INFO + SYS[16.0] - CPU/2
testpc115159:2736637:2737363 [3] NCCL INFO CPU/2 (1/2/-1)
testpc115159:2736637:2737363 [3] NCCL INFO + PCI[24.0] - GPU/C1000 (3)
testpc115159:2736637:2737363 [3] NCCL INFO               + NVL[80.0] - GPU/41000
testpc115159:2736637:2737363 [3] NCCL INFO               + NVL[80.0] - GPU/1000
testpc115159:2736637:2737363 [3] NCCL INFO               + NVL[80.0] - GPU/81000
testpc115159:2736637:2737363 [3] NCCL INFO + SYS[16.0] - CPU/1
testpc115159:2736637:2737363 [3] NCCL INFO + SYS[16.0] - CPU/0
testpc115159:2736637:2737363 [3] NCCL INFO + SYS[16.0] - CPU/3
testpc115159:2736637:2737363 [3] NCCL INFO + PCI[0.4] - NIC/E1000
testpc115159:2736637:2737363 [3] NCCL INFO ==========================================
testpc115159:2736637:2737363 [3] NCCL INFO GPU/1000 :GPU/1000 (0/5000.000000/LOC) GPU/41000 (1/80.000000/NVL) GPU/81000 (1/80.000000/NVL) GPU/C1000 (1/80.000000/NVL) CPU/1 (1/24.000000/PHB) CPU/0 (2/24.000000/PHB) CPU/3 (2/24.000000/PHB) CPU/2 (2/24.000000/PHB)
testpc115159:2736637:2737363 [3] NCCL INFO GPU/41000 :GPU/1000 (1/80.000000/NVL) GPU/41000 (0/5000.000000/LOC) GPU/81000 (1/80.000000/NVL) GPU/C1000 (1/80.000000/NVL) CPU/1 (2/24.000000/PHB) CPU/0 (1/24.000000/PHB) CPU/3 (2/24.000000/PHB) CPU/2 (2/24.000000/PHB)
testpc115159:2736637:2737363 [3] NCCL INFO GPU/81000 :GPU/1000 (1/80.000000/NVL) GPU/41000 (1/80.000000/NVL) GPU/81000 (0/5000.000000/LOC) GPU/C1000 (1/80.000000/NVL) CPU/1 (2/24.000000/PHB) CPU/0 (2/24.000000/PHB) CPU/3 (1/24.000000/PHB) CPU/2 (2/24.000000/PHB)
testpc115159:2736637:2737363 [3] NCCL INFO GPU/C1000 :GPU/1000 (1/80.000000/NVL) GPU/41000 (1/80.000000/NVL) GPU/81000 (1/80.000000/NVL) GPU/C1000 (0/5000.000000/LOC) CPU/1 (2/24.000000/PHB) CPU/0 (2/24.000000/PHB) CPU/3 (2/24.000000/PHB) CPU/2 (1/24.000000/PHB)
testpc115159:2736637:2737363 [3] NCCL INFO Setting affinity for GPU 3 to ff,ffff0000,00000000,000000ff,ffff0000,00000000
testpc115159:2736637:2737363 [3] NCCL INFO NVLS multicast support is not available on dev 3
testpc115159:2736637:2737363 [3] NCCL INFO Pattern 4, crossNic 0, nChannels 14, bw 20.000000/20.000000, type NVL/PIX, sameChannels 0
testpc115159:2736637:2737363 [3] NCCL INFO  0 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736637:2737363 [3] NCCL INFO  1 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736637:2737363 [3] NCCL INFO  2 : GPU/0 GPU/2 GPU/1 GPU/3
testpc115159:2736637:2737363 [3] NCCL INFO  3 : GPU/3 GPU/2 GPU/1 GPU/0
testpc115159:2736637:2737363 [3] NCCL INFO  4 : GPU/3 GPU/1 GPU/0 GPU/2
testpc115159:2736637:2737363 [3] NCCL INFO  5 : GPU/1 GPU/3 GPU/0 GPU/2
testpc115159:2736637:2737363 [3] NCCL INFO  6 : GPU/0 GPU/3 GPU/1 GPU/2
testpc115159:2736637:2737363 [3] NCCL INFO  7 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736637:2737363 [3] NCCL INFO  8 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736637:2737363 [3] NCCL INFO  9 : GPU/0 GPU/2 GPU/1 GPU/3
testpc115159:2736637:2737363 [3] NCCL INFO 10 : GPU/3 GPU/2 GPU/1 GPU/0
testpc115159:2736637:2737363 [3] NCCL INFO 11 : GPU/3 GPU/1 GPU/0 GPU/2
testpc115159:2736637:2737363 [3] NCCL INFO 12 : GPU/1 GPU/3 GPU/0 GPU/2
testpc115159:2736637:2737363 [3] NCCL INFO 13 : GPU/0 GPU/3 GPU/1 GPU/2
testpc115159:2736637:2737363 [3] NCCL INFO Pattern 1, crossNic 0, nChannels 14, bw 40.000000/20.000000, type NVL/PIX, sameChannels 0
testpc115159:2736637:2737363 [3] NCCL INFO  0 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736637:2737363 [3] NCCL INFO  1 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736637:2737363 [3] NCCL INFO  2 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736637:2737363 [3] NCCL INFO  3 : GPU/0 GPU/2 GPU/3 GPU/1
testpc115159:2736637:2737363 [3] NCCL INFO  4 : GPU/0 GPU/2 GPU/3 GPU/1
testpc115159:2736637:2737363 [3] NCCL INFO  5 : GPU/0 GPU/2 GPU/1 GPU/3
testpc115159:2736637:2737363 [3] NCCL INFO  6 : GPU/0 GPU/3 GPU/2 GPU/1
testpc115159:2736637:2737363 [3] NCCL INFO  7 : GPU/2 GPU/0 GPU/3 GPU/1
testpc115159:2736637:2737363 [3] NCCL INFO  8 : GPU/2 GPU/0 GPU/3 GPU/1
testpc115159:2736637:2737363 [3] NCCL INFO  9 : GPU/1 GPU/3 GPU/2 GPU/0
testpc115159:2736637:2737363 [3] NCCL INFO 10 : GPU/1 GPU/0 GPU/3 GPU/2
testpc115159:2736637:2737363 [3] NCCL INFO 11 : GPU/1 GPU/0 GPU/3 GPU/2
testpc115159:2736637:2737363 [3] NCCL INFO 12 : GPU/3 GPU/2 GPU/1 GPU/0
testpc115159:2736637:2737363 [3] NCCL INFO 13 : GPU/1 GPU/3 GPU/0 GPU/2
testpc115159:2736637:2737363 [3] NCCL INFO comm 0x5587025ac800 rank 3 nRanks 4 nNodes 1 localRanks 4 localRank 3 MNNVL 0
testpc115159:2736637:2737363 [3] NCCL INFO Tree 6 : 0 -> 3 -> 2/-1/-1
testpc115159:2736637:2737363 [3] NCCL INFO Tree 18 : 0 -> 3 -> 2/-1/-1
testpc115159:2736637:2737363 [3] NCCL INFO Tree 8 : 0 -> 3 -> 1/-1/-1
testpc115159:2736637:2737363 [3] NCCL INFO Tree 20 : 0 -> 3 -> 1/-1/-1
testpc115159:2736637:2737363 [3] NCCL INFO Tree 10 : 0 -> 3 -> 2/-1/-1
testpc115159:2736637:2737363 [3] NCCL INFO Tree 22 : 0 -> 3 -> 2/-1/-1
testpc115159:2736637:2737363 [3] NCCL INFO Tree 11 : 0 -> 3 -> 2/-1/-1
testpc115159:2736637:2737363 [3] NCCL INFO Tree 23 : 0 -> 3 -> 2/-1/-1

testpc115159:2736637:2737363 [3] graph/rings.cc:38 NCCL WARN Error : ring 2 does not loop back to start (0 != 3)
testpc115159:2736637:2737363 [3] NCCL INFO graph/connect.cc:479 -> 3
testpc115159:2736637:2737363 [3] NCCL INFO init.cc:1169 -> 3
testpc115159:2736637:2737363 [3] NCCL INFO init.cc:1501 -> 3
testpc115159:2736637:2737363 [3] NCCL INFO group.cc:64 -> 3 [Async thread]
testpc115159:2736637:2736637 [3] NCCL INFO group.cc:418 -> 3
testpc115159:2736637:2736637 [3] NCCL INFO group.cc:95 -> 3
testpc115159:2736637:2739612 [3] NCCL INFO Using non-device net plugin version 0
testpc115159:2736637:2739612 [3] NCCL INFO Using network Socket
testpc115159:2736637:2739612 [3] NCCL INFO comm 0x55870d3ce000 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId c1000 commId 0xa319625d1be869a3 - Init START
testpc115159:2736637:2739612 [3] NCCL INFO === System : maxBw 80.0 totalBw 240.0 ===
testpc115159:2736637:2739612 [3] NCCL INFO CPU/1 (1/2/-1)
testpc115159:2736637:2739612 [3] NCCL INFO + PCI[5000.0] - NIC/0
testpc115159:2736637:2739612 [3] NCCL INFO + PCI[24.0] - GPU/1000 (0)
testpc115159:2736637:2739612 [3] NCCL INFO               + NVL[80.0] - GPU/41000
testpc115159:2736637:2739612 [3] NCCL INFO               + NVL[80.0] - GPU/C1000
testpc115159:2736637:2739612 [3] NCCL INFO               + NVL[80.0] - GPU/81000
testpc115159:2736637:2739612 [3] NCCL INFO + SYS[16.0] - CPU/0
testpc115159:2736637:2739612 [3] NCCL INFO + SYS[16.0] - CPU/3
testpc115159:2736637:2739612 [3] NCCL INFO + SYS[16.0] - CPU/2
testpc115159:2736637:2739612 [3] NCCL INFO CPU/0 (1/2/-1)
testpc115159:2736637:2739612 [3] NCCL INFO + PCI[24.0] - GPU/41000 (1)
testpc115159:2736637:2739612 [3] NCCL INFO               + NVL[80.0] - GPU/C1000
testpc115159:2736637:2739612 [3] NCCL INFO               + NVL[80.0] - GPU/1000
testpc115159:2736637:2739612 [3] NCCL INFO               + NVL[80.0] - GPU/81000
testpc115159:2736637:2739612 [3] NCCL INFO + SYS[16.0] - CPU/1
testpc115159:2736637:2739612 [3] NCCL INFO + SYS[16.0] - CPU/3
testpc115159:2736637:2739612 [3] NCCL INFO + SYS[16.0] - CPU/2
testpc115159:2736637:2739612 [3] NCCL INFO CPU/3 (1/2/-1)
testpc115159:2736637:2739612 [3] NCCL INFO + PCI[24.0] - GPU/81000 (2)
testpc115159:2736637:2739612 [3] NCCL INFO               + NVL[80.0] - GPU/41000
testpc115159:2736637:2739612 [3] NCCL INFO               + NVL[80.0] - GPU/C1000
testpc115159:2736637:2739612 [3] NCCL INFO               + NVL[80.0] - GPU/1000
testpc115159:2736637:2739612 [3] NCCL INFO + SYS[16.0] - CPU/1
testpc115159:2736637:2739612 [3] NCCL INFO + SYS[16.0] - CPU/0
testpc115159:2736637:2739612 [3] NCCL INFO + SYS[16.0] - CPU/2
testpc115159:2736637:2739612 [3] NCCL INFO CPU/2 (1/2/-1)
testpc115159:2736637:2739612 [3] NCCL INFO + PCI[24.0] - GPU/C1000 (3)
testpc115159:2736637:2739612 [3] NCCL INFO               + NVL[80.0] - GPU/41000
testpc115159:2736637:2739612 [3] NCCL INFO               + NVL[80.0] - GPU/1000
testpc115159:2736637:2739612 [3] NCCL INFO               + NVL[80.0] - GPU/81000
testpc115159:2736637:2739612 [3] NCCL INFO + SYS[16.0] - CPU/1
testpc115159:2736637:2739612 [3] NCCL INFO + SYS[16.0] - CPU/0
testpc115159:2736637:2739612 [3] NCCL INFO + SYS[16.0] - CPU/3
testpc115159:2736637:2739612 [3] NCCL INFO + PCI[0.4] - NIC/E1000
testpc115159:2736637:2739612 [3] NCCL INFO ==========================================
testpc115159:2736637:2739612 [3] NCCL INFO GPU/1000 :GPU/1000 (0/5000.000000/LOC) GPU/41000 (1/80.000000/NVL) GPU/81000 (1/80.000000/NVL) GPU/C1000 (1/80.000000/NVL) CPU/1 (1/24.000000/PHB) CPU/0 (2/24.000000/PHB) CPU/3 (2/24.000000/PHB) CPU/2 (2/24.000000/PHB)
testpc115159:2736637:2739612 [3] NCCL INFO GPU/41000 :GPU/1000 (1/80.000000/NVL) GPU/41000 (0/5000.000000/LOC) GPU/81000 (1/80.000000/NVL) GPU/C1000 (1/80.000000/NVL) CPU/1 (2/24.000000/PHB) CPU/0 (1/24.000000/PHB) CPU/3 (2/24.000000/PHB) CPU/2 (2/24.000000/PHB)
testpc115159:2736637:2739612 [3] NCCL INFO GPU/81000 :GPU/1000 (1/80.000000/NVL) GPU/41000 (1/80.000000/NVL) GPU/81000 (0/5000.000000/LOC) GPU/C1000 (1/80.000000/NVL) CPU/1 (2/24.000000/PHB) CPU/0 (2/24.000000/PHB) CPU/3 (1/24.000000/PHB) CPU/2 (2/24.000000/PHB)
testpc115159:2736637:2739612 [3] NCCL INFO GPU/C1000 :GPU/1000 (1/80.000000/NVL) GPU/41000 (1/80.000000/NVL) GPU/81000 (1/80.000000/NVL) GPU/C1000 (0/5000.000000/LOC) CPU/1 (2/24.000000/PHB) CPU/0 (2/24.000000/PHB) CPU/3 (2/24.000000/PHB) CPU/2 (1/24.000000/PHB)
testpc115159:2736637:2739612 [3] NCCL INFO Setting affinity for GPU 3 to ff,ffff0000,00000000,000000ff,ffff0000,00000000
testpc115159:2736637:2739612 [3] NCCL INFO NVLS multicast support is not available on dev 3
testpc115159:2736637:2739612 [3] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 0
testpc115159:2736637:2739612 [3] NCCL INFO  0 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736637:2739612 [3] NCCL INFO  1 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736637:2739612 [3] NCCL INFO  2 : GPU/3 GPU/1 GPU/0 GPU/2
testpc115159:2736637:2739612 [3] NCCL INFO  3 : GPU/2 GPU/1 GPU/3 GPU/0
testpc115159:2736637:2739612 [3] NCCL INFO  4 : GPU/2 GPU/1 GPU/0 GPU/3
testpc115159:2736637:2739612 [3] NCCL INFO  5 : GPU/1 GPU/2 GPU/0 GPU/3
testpc115159:2736637:2739612 [3] NCCL INFO  6 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736637:2739612 [3] NCCL INFO  7 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736637:2739612 [3] NCCL INFO  8 : GPU/3 GPU/1 GPU/0 GPU/2
testpc115159:2736637:2739612 [3] NCCL INFO  9 : GPU/2 GPU/1 GPU/3 GPU/0
testpc115159:2736637:2739612 [3] NCCL INFO 10 : GPU/2 GPU/1 GPU/0 GPU/3
testpc115159:2736637:2739612 [3] NCCL INFO 11 : GPU/1 GPU/2 GPU/0 GPU/3
testpc115159:2736637:2739612 [3] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/20.000000, type NVL/PIX, sameChannels 0
testpc115159:2736637:2739612 [3] NCCL INFO  0 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736637:2739612 [3] NCCL INFO  1 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736637:2739612 [3] NCCL INFO  2 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736637:2739612 [3] NCCL INFO  3 : GPU/0 GPU/2 GPU/3 GPU/1
testpc115159:2736637:2739612 [3] NCCL INFO  4 : GPU/0 GPU/2 GPU/1 GPU/3
testpc115159:2736637:2739612 [3] NCCL INFO  5 : GPU/0 GPU/3 GPU/2 GPU/1
testpc115159:2736637:2739612 [3] NCCL INFO  6 : GPU/1 GPU/0 GPU/3 GPU/2
testpc115159:2736637:2739612 [3] NCCL INFO  7 : GPU/1 GPU/3 GPU/0 GPU/2
testpc115159:2736637:2739612 [3] NCCL INFO  8 : GPU/3 GPU/1 GPU/0 GPU/2
testpc115159:2736637:2739612 [3] NCCL INFO  9 : GPU/1 GPU/3 GPU/2 GPU/0
testpc115159:2736637:2739612 [3] NCCL INFO 10 : GPU/0 GPU/3 GPU/1 GPU/2
testpc115159:2736637:2739612 [3] NCCL INFO 11 : GPU/2 GPU/0 GPU/3 GPU/1
testpc115159:2736637:2739612 [3] NCCL INFO comm 0x55870d3ce000 rank 3 nRanks 4 nNodes 1 localRanks 4 localRank 3 MNNVL 0
testpc115159:2736637:2739612 [3] NCCL INFO Tree 6 : 0 -> 3 -> 2/-1/-1
testpc115159:2736637:2739612 [3] NCCL INFO Tree 18 : 0 -> 3 -> 2/-1/-1
testpc115159:2736637:2739612 [3] NCCL INFO Tree 8 : -1 -> 3 -> 1/-1/-1
testpc115159:2736637:2739612 [3] NCCL INFO Tree 20 : -1 -> 3 -> 1/-1/-1
testpc115159:2736637:2739612 [3] NCCL INFO Tree 9 : 1 -> 3 -> 2/-1/-1
testpc115159:2736637:2739612 [3] NCCL INFO Tree 21 : 1 -> 3 -> 2/-1/-1
testpc115159:2736637:2739612 [3] NCCL INFO Tree 10 : 0 -> 3 -> 1/-1/-1
testpc115159:2736637:2739612 [3] NCCL INFO Tree 22 : 0 -> 3 -> 1/-1/-1
testpc115159:2736637:2739612 [3] NCCL INFO Tree 11 : 0 -> 3 -> 1/-1/-1
testpc115159:2736637:2739612 [3] NCCL INFO Tree 23 : 0 -> 3 -> 1/-1/-1

testpc115159:2736637:2739612 [3] graph/rings.cc:51 NCCL WARN Error : ring 1 does not contain rank 2
testpc115159:2736637:2739612 [3] NCCL INFO graph/connect.cc:479 -> 3
testpc115159:2736637:2739612 [3] NCCL INFO init.cc:1169 -> 3
testpc115159:2736637:2739612 [3] NCCL INFO init.cc:1501 -> 3
testpc115159:2736637:2739612 [3] NCCL INFO group.cc:64 -> 3 [Async thread]
testpc115159:2736637:2736637 [3] NCCL INFO group.cc:418 -> 3
testpc115159:2736637:2736637 [3] NCCL INFO group.cc:95 -> 3
testpc115159:2736637:2736637 [3] NCCL INFO comm 0x55870d3ce000 rank 3 nranks 4 cudaDev 3 busId c1000 - Abort COMPLETE
world size: 4, current rank: 3, local rank: 3
testpc115159:2736637:2739627 [0] NCCL INFO comm 0x5587025ac800 rank 3 nranks 4 cudaDev 3 busId c1000 - Abort COMPLETE

error code : 1
error info : testpc115159:2736632:2736632 [0] NCCL INFO Bootstrap : Using eno8303:109.105.115.159<0>
testpc115159:2736632:2736632 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
testpc115159:2736632:2736632 [0] NCCL INFO cudaDriverVersion 12040
NCCL version 2.20.5+cuda12.1
testpc115159:2736632:2737362 [0] NCCL INFO NET/IB : No device found.
testpc115159:2736632:2737362 [0] NCCL INFO NET/Socket : Using [0]eno8303:109.105.115.159<0> [1]vethf4fabc7:fe80::54a7:2aff:fe92:27be%vethf4fabc7<0>
testpc115159:2736632:2737362 [0] NCCL INFO Using non-device net plugin version 0
testpc115159:2736632:2737362 [0] NCCL INFO Using network Socket
testpc115159:2736632:2737362 [0] NCCL INFO comm 0x55a0bbb78800 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0xd57beaec7aafe641 - Init START
testpc115159:2736632:2737362 [0] NCCL INFO === System : maxBw 80.0 totalBw 240.0 ===
testpc115159:2736632:2737362 [0] NCCL INFO CPU/1 (1/2/-1)
testpc115159:2736632:2737362 [0] NCCL INFO + PCI[5000.0] - NIC/0
testpc115159:2736632:2737362 [0] NCCL INFO + PCI[24.0] - GPU/1000 (0)
testpc115159:2736632:2737362 [0] NCCL INFO               + NVL[80.0] - GPU/41000
testpc115159:2736632:2737362 [0] NCCL INFO               + NVL[80.0] - GPU/C1000
testpc115159:2736632:2737362 [0] NCCL INFO               + NVL[80.0] - GPU/81000
testpc115159:2736632:2737362 [0] NCCL INFO + SYS[16.0] - CPU/0
testpc115159:2736632:2737362 [0] NCCL INFO + SYS[16.0] - CPU/3
testpc115159:2736632:2737362 [0] NCCL INFO + SYS[16.0] - CPU/2
testpc115159:2736632:2737362 [0] NCCL INFO CPU/0 (1/2/-1)
testpc115159:2736632:2737362 [0] NCCL INFO + PCI[24.0] - GPU/41000 (1)
testpc115159:2736632:2737362 [0] NCCL INFO               + NVL[80.0] - GPU/C1000
testpc115159:2736632:2737362 [0] NCCL INFO               + NVL[80.0] - GPU/1000
testpc115159:2736632:2737362 [0] NCCL INFO               + NVL[80.0] - GPU/81000
testpc115159:2736632:2737362 [0] NCCL INFO + SYS[16.0] - CPU/1
testpc115159:2736632:2737362 [0] NCCL INFO + SYS[16.0] - CPU/3
testpc115159:2736632:2737362 [0] NCCL INFO + SYS[16.0] - CPU/2
testpc115159:2736632:2737362 [0] NCCL INFO CPU/3 (1/2/-1)
testpc115159:2736632:2737362 [0] NCCL INFO + PCI[24.0] - GPU/81000 (2)
testpc115159:2736632:2737362 [0] NCCL INFO               + NVL[80.0] - GPU/41000
testpc115159:2736632:2737362 [0] NCCL INFO               + NVL[80.0] - GPU/C1000
testpc115159:2736632:2737362 [0] NCCL INFO               + NVL[80.0] - GPU/1000
testpc115159:2736632:2737362 [0] NCCL INFO + SYS[16.0] - CPU/1
testpc115159:2736632:2737362 [0] NCCL INFO + SYS[16.0] - CPU/0
testpc115159:2736632:2737362 [0] NCCL INFO + SYS[16.0] - CPU/2
testpc115159:2736632:2737362 [0] NCCL INFO CPU/2 (1/2/-1)
testpc115159:2736632:2737362 [0] NCCL INFO + PCI[24.0] - GPU/C1000 (3)
testpc115159:2736632:2737362 [0] NCCL INFO               + NVL[80.0] - GPU/41000
testpc115159:2736632:2737362 [0] NCCL INFO               + NVL[80.0] - GPU/1000
testpc115159:2736632:2737362 [0] NCCL INFO               + NVL[80.0] - GPU/81000
testpc115159:2736632:2737362 [0] NCCL INFO + SYS[16.0] - CPU/1
testpc115159:2736632:2737362 [0] NCCL INFO + SYS[16.0] - CPU/0
testpc115159:2736632:2737362 [0] NCCL INFO + SYS[16.0] - CPU/3
testpc115159:2736632:2737362 [0] NCCL INFO + PCI[0.4] - NIC/E1000
testpc115159:2736632:2737362 [0] NCCL INFO ==========================================
testpc115159:2736632:2737362 [0] NCCL INFO GPU/1000 :GPU/1000 (0/5000.000000/LOC) GPU/41000 (1/80.000000/NVL) GPU/81000 (1/80.000000/NVL) GPU/C1000 (1/80.000000/NVL) CPU/1 (1/24.000000/PHB) CPU/0 (2/24.000000/PHB) CPU/3 (2/24.000000/PHB) CPU/2 (2/24.000000/PHB)
testpc115159:2736632:2737362 [0] NCCL INFO GPU/41000 :GPU/1000 (1/80.000000/NVL) GPU/41000 (0/5000.000000/LOC) GPU/81000 (1/80.000000/NVL) GPU/C1000 (1/80.000000/NVL) CPU/1 (2/24.000000/PHB) CPU/0 (1/24.000000/PHB) CPU/3 (2/24.000000/PHB) CPU/2 (2/24.000000/PHB)
testpc115159:2736632:2737362 [0] NCCL INFO GPU/81000 :GPU/1000 (1/80.000000/NVL) GPU/41000 (1/80.000000/NVL) GPU/81000 (0/5000.000000/LOC) GPU/C1000 (1/80.000000/NVL) CPU/1 (2/24.000000/PHB) CPU/0 (2/24.000000/PHB) CPU/3 (1/24.000000/PHB) CPU/2 (2/24.000000/PHB)
testpc115159:2736632:2737362 [0] NCCL INFO GPU/C1000 :GPU/1000 (1/80.000000/NVL) GPU/41000 (1/80.000000/NVL) GPU/81000 (1/80.000000/NVL) GPU/C1000 (0/5000.000000/LOC) CPU/1 (2/24.000000/PHB) CPU/0 (2/24.000000/PHB) CPU/3 (2/24.000000/PHB) CPU/2 (1/24.000000/PHB)
testpc115159:2736632:2737362 [0] NCCL INFO Setting affinity for GPU 0 to ffff,ff000000,00000000,0000ffff,ff000000
testpc115159:2736632:2737362 [0] NCCL INFO NVLS multicast support is not available on dev 0
testpc115159:2736632:2737362 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 0
testpc115159:2736632:2737362 [0] NCCL INFO  0 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736632:2737362 [0] NCCL INFO  1 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736632:2737362 [0] NCCL INFO  2 : GPU/0 GPU/2 GPU/3 GPU/1
testpc115159:2736632:2737362 [0] NCCL INFO  3 : GPU/0 GPU/2 GPU/1 GPU/3
testpc115159:2736632:2737362 [0] NCCL INFO  4 : GPU/0 GPU/3 GPU/1 GPU/2
testpc115159:2736632:2737362 [0] NCCL INFO  5 : GPU/0 GPU/3 GPU/2 GPU/1
testpc115159:2736632:2737362 [0] NCCL INFO  6 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736632:2737362 [0] NCCL INFO  7 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736632:2737362 [0] NCCL INFO  8 : GPU/0 GPU/2 GPU/3 GPU/1
testpc115159:2736632:2737362 [0] NCCL INFO  9 : GPU/0 GPU/2 GPU/1 GPU/3
testpc115159:2736632:2737362 [0] NCCL INFO 10 : GPU/0 GPU/3 GPU/1 GPU/2
testpc115159:2736632:2737362 [0] NCCL INFO 11 : GPU/0 GPU/3 GPU/2 GPU/1
testpc115159:2736632:2737362 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/20.000000, type NVL/PIX, sameChannels 0
testpc115159:2736632:2737362 [0] NCCL INFO  0 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736632:2737362 [0] NCCL INFO  1 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736632:2737362 [0] NCCL INFO  2 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736632:2737362 [0] NCCL INFO  3 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736632:2737362 [0] NCCL INFO  4 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736632:2737362 [0] NCCL INFO  5 : GPU/2 GPU/1 GPU/0 GPU/3
testpc115159:2736632:2737362 [0] NCCL INFO  6 : GPU/3 GPU/1 GPU/0 GPU/2
testpc115159:2736632:2737362 [0] NCCL INFO  7 : GPU/1 GPU/2 GPU/3 GPU/0
testpc115159:2736632:2737362 [0] NCCL INFO  8 : GPU/3 GPU/1 GPU/0 GPU/2
testpc115159:2736632:2737362 [0] NCCL INFO  9 : GPU/1 GPU/2 GPU/0 GPU/3
testpc115159:2736632:2737362 [0] NCCL INFO 10 : GPU/1 GPU/3 GPU/0 GPU/2
testpc115159:2736632:2737362 [0] NCCL INFO 11 : GPU/3 GPU/0 GPU/2 GPU/1
testpc115159:2736632:2737362 [0] NCCL INFO comm 0x55a0bbb78800 rank 0 nRanks 4 nNodes 1 localRanks 4 localRank 0 MNNVL 0
testpc115159:2736632:2737362 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1
testpc115159:2736632:2737362 [0] NCCL INFO Tree 12 : -1 -> 0 -> 1/-1/-1
testpc115159:2736632:2737362 [0] NCCL INFO Tree 1 : -1 -> 0 -> 1/-1/-1
testpc115159:2736632:2737362 [0] NCCL INFO Tree 13 : -1 -> 0 -> 1/-1/-1
testpc115159:2736632:2737362 [0] NCCL INFO Tree 2 : -1 -> 0 -> 1/-1/-1
testpc115159:2736632:2737362 [0] NCCL INFO Tree 14 : -1 -> 0 -> 1/-1/-1
testpc115159:2736632:2737362 [0] NCCL INFO Tree 3 : -1 -> 0 -> 1/-1/-1
testpc115159:2736632:2737362 [0] NCCL INFO Tree 15 : -1 -> 0 -> 1/-1/-1
testpc115159:2736632:2737362 [0] NCCL INFO Tree 4 : -1 -> 0 -> 1/-1/-1
testpc115159:2736632:2737362 [0] NCCL INFO Tree 16 : -1 -> 0 -> 1/-1/-1
testpc115159:2736632:2737362 [0] NCCL INFO Tree 11 : 3 -> 0 -> 2/-1/-1
testpc115159:2736632:2737362 [0] NCCL INFO Tree 23 : 3 -> 0 -> 2/-1/-1
testpc115159:2736632:2737362 [0] NCCL INFO Channel 00/24 :    0   1   2   3
testpc115159:2736632:2737362 [0] NCCL INFO Channel 01/24 :    0   1   3   2
testpc115159:2736632:2737362 [0] NCCL INFO Channel 02/24 :    0   2   1   0

testpc115159:2736632:2737362 [0] graph/rings.cc:38 NCCL WARN Error : ring 2 does not loop back to start (2 != 0)
testpc115159:2736632:2737362 [0] NCCL INFO graph/connect.cc:479 -> 3
testpc115159:2736632:2737362 [0] NCCL INFO init.cc:1169 -> 3
testpc115159:2736632:2737362 [0] NCCL INFO init.cc:1501 -> 3
testpc115159:2736632:2737362 [0] NCCL INFO group.cc:64 -> 3 [Async thread]
testpc115159:2736632:2736632 [0] NCCL INFO group.cc:418 -> 3
testpc115159:2736632:2736632 [0] NCCL INFO group.cc:95 -> 3
testpc115159:2736632:2739609 [0] NCCL INFO Using non-device net plugin version 0
testpc115159:2736632:2739609 [0] NCCL INFO Using network Socket
testpc115159:2736632:2739609 [0] NCCL INFO comm 0x55a0c69ac000 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0xa319625d1be869a3 - Init START
testpc115159:2736632:2739609 [0] NCCL INFO === System : maxBw 80.0 totalBw 240.0 ===
testpc115159:2736632:2739609 [0] NCCL INFO CPU/1 (1/2/-1)
testpc115159:2736632:2739609 [0] NCCL INFO + PCI[5000.0] - NIC/0
testpc115159:2736632:2739609 [0] NCCL INFO + PCI[24.0] - GPU/1000 (0)
testpc115159:2736632:2739609 [0] NCCL INFO               + NVL[80.0] - GPU/41000
testpc115159:2736632:2739609 [0] NCCL INFO               + NVL[80.0] - GPU/C1000
testpc115159:2736632:2739609 [0] NCCL INFO               + NVL[80.0] - GPU/81000
testpc115159:2736632:2739609 [0] NCCL INFO + SYS[16.0] - CPU/0
testpc115159:2736632:2739609 [0] NCCL INFO + SYS[16.0] - CPU/3
testpc115159:2736632:2739609 [0] NCCL INFO + SYS[16.0] - CPU/2
testpc115159:2736632:2739609 [0] NCCL INFO CPU/0 (1/2/-1)
testpc115159:2736632:2739609 [0] NCCL INFO + PCI[24.0] - GPU/41000 (1)
testpc115159:2736632:2739609 [0] NCCL INFO               + NVL[80.0] - GPU/C1000
testpc115159:2736632:2739609 [0] NCCL INFO               + NVL[80.0] - GPU/1000
testpc115159:2736632:2739609 [0] NCCL INFO               + NVL[80.0] - GPU/81000
testpc115159:2736632:2739609 [0] NCCL INFO + SYS[16.0] - CPU/1
testpc115159:2736632:2739609 [0] NCCL INFO + SYS[16.0] - CPU/3
testpc115159:2736632:2739609 [0] NCCL INFO + SYS[16.0] - CPU/2
testpc115159:2736632:2739609 [0] NCCL INFO CPU/3 (1/2/-1)
testpc115159:2736632:2739609 [0] NCCL INFO + PCI[24.0] - GPU/81000 (2)
testpc115159:2736632:2739609 [0] NCCL INFO               + NVL[80.0] - GPU/41000
testpc115159:2736632:2739609 [0] NCCL INFO               + NVL[80.0] - GPU/C1000
testpc115159:2736632:2739609 [0] NCCL INFO               + NVL[80.0] - GPU/1000
testpc115159:2736632:2739609 [0] NCCL INFO + SYS[16.0] - CPU/1
testpc115159:2736632:2739609 [0] NCCL INFO + SYS[16.0] - CPU/0
testpc115159:2736632:2739609 [0] NCCL INFO + SYS[16.0] - CPU/2
testpc115159:2736632:2739609 [0] NCCL INFO CPU/2 (1/2/-1)
testpc115159:2736632:2739609 [0] NCCL INFO + PCI[24.0] - GPU/C1000 (3)
testpc115159:2736632:2739609 [0] NCCL INFO               + NVL[80.0] - GPU/41000
testpc115159:2736632:2739609 [0] NCCL INFO               + NVL[80.0] - GPU/1000
testpc115159:2736632:2739609 [0] NCCL INFO               + NVL[80.0] - GPU/81000
testpc115159:2736632:2739609 [0] NCCL INFO + SYS[16.0] - CPU/1
testpc115159:2736632:2739609 [0] NCCL INFO + SYS[16.0] - CPU/0
testpc115159:2736632:2739609 [0] NCCL INFO + SYS[16.0] - CPU/3
testpc115159:2736632:2739609 [0] NCCL INFO + PCI[0.4] - NIC/E1000
testpc115159:2736632:2739609 [0] NCCL INFO ==========================================
testpc115159:2736632:2739609 [0] NCCL INFO GPU/1000 :GPU/1000 (0/5000.000000/LOC) GPU/41000 (1/80.000000/NVL) GPU/81000 (1/80.000000/NVL) GPU/C1000 (1/80.000000/NVL) CPU/1 (1/24.000000/PHB) CPU/0 (2/24.000000/PHB) CPU/3 (2/24.000000/PHB) CPU/2 (2/24.000000/PHB)
testpc115159:2736632:2739609 [0] NCCL INFO GPU/41000 :GPU/1000 (1/80.000000/NVL) GPU/41000 (0/5000.000000/LOC) GPU/81000 (1/80.000000/NVL) GPU/C1000 (1/80.000000/NVL) CPU/1 (2/24.000000/PHB) CPU/0 (1/24.000000/PHB) CPU/3 (2/24.000000/PHB) CPU/2 (2/24.000000/PHB)
testpc115159:2736632:2739609 [0] NCCL INFO GPU/81000 :GPU/1000 (1/80.000000/NVL) GPU/41000 (1/80.000000/NVL) GPU/81000 (0/5000.000000/LOC) GPU/C1000 (1/80.000000/NVL) CPU/1 (2/24.000000/PHB) CPU/0 (2/24.000000/PHB) CPU/3 (1/24.000000/PHB) CPU/2 (2/24.000000/PHB)
testpc115159:2736632:2739609 [0] NCCL INFO GPU/C1000 :GPU/1000 (1/80.000000/NVL) GPU/41000 (1/80.000000/NVL) GPU/81000 (1/80.000000/NVL) GPU/C1000 (0/5000.000000/LOC) CPU/1 (2/24.000000/PHB) CPU/0 (2/24.000000/PHB) CPU/3 (2/24.000000/PHB) CPU/2 (1/24.000000/PHB)
testpc115159:2736632:2739609 [0] NCCL INFO Setting affinity for GPU 0 to ffff,ff000000,00000000,0000ffff,ff000000
testpc115159:2736632:2739609 [0] NCCL INFO NVLS multicast support is not available on dev 0
testpc115159:2736632:2739609 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 16, bw 20.000000/20.000000, type NVL/PIX, sameChannels 0
testpc115159:2736632:2739609 [0] NCCL INFO  0 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736632:2739609 [0] NCCL INFO  1 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736632:2739609 [0] NCCL INFO  2 : GPU/0 GPU/2 GPU/3 GPU/1
testpc115159:2736632:2739609 [0] NCCL INFO  3 : GPU/1 GPU/3 GPU/0 GPU/2
testpc115159:2736632:2739609 [0] NCCL INFO  4 : GPU/0 GPU/2 GPU/3 GPU/1
testpc115159:2736632:2739609 [0] NCCL INFO  5 : GPU/2 GPU/3 GPU/1 GPU/0
testpc115159:2736632:2739609 [0] NCCL INFO  6 : GPU/1 GPU/3 GPU/2 GPU/0
testpc115159:2736632:2739609 [0] NCCL INFO  7 : GPU/2 GPU/1 GPU/0 GPU/3
testpc115159:2736632:2739609 [0] NCCL INFO  8 : GPU/2 GPU/0 GPU/3 GPU/1
testpc115159:2736632:2739609 [0] NCCL INFO  9 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736632:2739609 [0] NCCL INFO 10 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736632:2739609 [0] NCCL INFO 11 : GPU/0 GPU/2 GPU/3 GPU/1
testpc115159:2736632:2739609 [0] NCCL INFO 12 : GPU/1 GPU/3 GPU/0 GPU/2
testpc115159:2736632:2739609 [0] NCCL INFO 13 : GPU/0 GPU/2 GPU/3 GPU/1
testpc115159:2736632:2739609 [0] NCCL INFO 14 : GPU/2 GPU/3 GPU/1 GPU/0
testpc115159:2736632:2739609 [0] NCCL INFO 15 : GPU/1 GPU/3 GPU/2 GPU/0
testpc115159:2736632:2739609 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 16, bw 40.000000/20.000000, type NVL/PIX, sameChannels 0
testpc115159:2736632:2739609 [0] NCCL INFO  0 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736632:2739609 [0] NCCL INFO  1 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736632:2739609 [0] NCCL INFO  2 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736632:2739609 [0] NCCL INFO  3 : GPU/0 GPU/2 GPU/3 GPU/1
testpc115159:2736632:2739609 [0] NCCL INFO  4 : GPU/0 GPU/2 GPU/1 GPU/3
testpc115159:2736632:2739609 [0] NCCL INFO  5 : GPU/0 GPU/2 GPU/1 GPU/3
testpc115159:2736632:2739609 [0] NCCL INFO  6 : GPU/0 GPU/3 GPU/1 GPU/2
testpc115159:2736632:2739609 [0] NCCL INFO  7 : GPU/2 GPU/0 GPU/3 GPU/1
testpc115159:2736632:2739609 [0] NCCL INFO  8 : GPU/1 GPU/3 GPU/2 GPU/0
testpc115159:2736632:2739609 [0] NCCL INFO  9 : GPU/3 GPU/2 GPU/0 GPU/1
testpc115159:2736632:2739609 [0] NCCL INFO 10 : GPU/3 GPU/0 GPU/2 GPU/1
testpc115159:2736632:2739609 [0] NCCL INFO 11 : GPU/3 GPU/1 GPU/0 GPU/2
testpc115159:2736632:2739609 [0] NCCL INFO 12 : GPU/1 GPU/0 GPU/3 GPU/2
testpc115159:2736632:2739609 [0] NCCL INFO 13 : GPU/3 GPU/1 GPU/0 GPU/2
testpc115159:2736632:2739609 [0] NCCL INFO 14 : GPU/3 GPU/1 GPU/0 GPU/2
testpc115159:2736632:2739609 [0] NCCL INFO 15 : GPU/3 GPU/2 GPU/0 GPU/1
testpc115159:2736632:2739609 [0] NCCL INFO comm 0x55a0c69ac000 rank 0 nRanks 4 nNodes 1 localRanks 4 localRank 0 MNNVL 0
testpc115159:2736632:2739609 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1
testpc115159:2736632:2739609 [0] NCCL INFO Tree 12 : -1 -> 0 -> 1/-1/-1
testpc115159:2736632:2739609 [0] NCCL INFO Tree 1 : -1 -> 0 -> 1/-1/-1
testpc115159:2736632:2739609 [0] NCCL INFO Tree 13 : -1 -> 0 -> 1/-1/-1
testpc115159:2736632:2739609 [0] NCCL INFO Tree 2 : -1 -> 0 -> 1/-1/-1
testpc115159:2736632:2739609 [0] NCCL INFO Tree 14 : -1 -> 0 -> 1/-1/-1
testpc115159:2736632:2739609 [0] NCCL INFO Tree 3 : -1 -> 0 -> 2/-1/-1
testpc115159:2736632:2739609 [0] NCCL INFO Tree 15 : -1 -> 0 -> 2/-1/-1
testpc115159:2736632:2739609 [0] NCCL INFO Tree 4 : -1 -> 0 -> 2/-1/-1
testpc115159:2736632:2739609 [0] NCCL INFO Tree 16 : -1 -> 0 -> 2/-1/-1
testpc115159:2736632:2739609 [0] NCCL INFO Tree 5 : -1 -> 0 -> 2/-1/-1
testpc115159:2736632:2739609 [0] NCCL INFO Tree 17 : -1 -> 0 -> 2/-1/-1
testpc115159:2736632:2739609 [0] NCCL INFO Tree 6 : -1 -> 0 -> 3/-1/-1
testpc115159:2736632:2739609 [0] NCCL INFO Tree 18 : -1 -> 0 -> 3/-1/-1
testpc115159:2736632:2739609 [0] NCCL INFO Tree 7 : 2 -> 0 -> 3/-1/-1
testpc115159:2736632:2739609 [0] NCCL INFO Tree 19 : 2 -> 0 -> 3/-1/-1
testpc115159:2736632:2739609 [0] NCCL INFO Tree 10 : 3 -> 0 -> 2/-1/-1
testpc115159:2736632:2739609 [0] NCCL INFO Tree 22 : 3 -> 0 -> 2/-1/-1
testpc115159:2736632:2739609 [0] NCCL INFO Channel 00/24 :    0   1   2   3
testpc115159:2736632:2739609 [0] NCCL INFO Channel 01/24 :    0   1   3  -1

testpc115159:2736632:2739609 [0] graph/rings.cc:51 NCCL WARN Error : ring 1 does not contain rank 2
testpc115159:2736632:2739609 [0] NCCL INFO graph/connect.cc:479 -> 3
testpc115159:2736632:2739609 [0] NCCL INFO init.cc:1169 -> 3
testpc115159:2736632:2739609 [0] NCCL INFO init.cc:1501 -> 3
testpc115159:2736632:2739609 [0] NCCL INFO group.cc:64 -> 3 [Async thread]
testpc115159:2736632:2736632 [0] NCCL INFO group.cc:418 -> 3
testpc115159:2736632:2736632 [0] NCCL INFO group.cc:95 -> 3
testpc115159:2736632:2736632 [0] NCCL INFO comm 0x55a0c69ac000 rank 0 nranks 4 cudaDev 0 busId 1000 - Abort COMPLETE
Running on 4 ranks using nccl backend
fail to enable all_to_all_single primitive: NCCL error in: /mnt/yuanningbai/pytorch-2.3.0/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, internal error - please report this issue to the NCCL developers, NCCL version 2.20.5
ncclInternalError: Internal check failed.
Last error:
Error : ring 2 does not loop back to start (2 != 0)
world size: 4, current rank: 0, local rank: 0
Using 1 GPU(s)...
-*-*-*-*-*-*nn.EmbeddingBag-*-*-*-*-*-*
-*-*-*-*-*-*nn.EmbeddingBag-*-*-*-*-*-*
testpc115159:2736632:2739631 [0] NCCL INFO comm 0x55a0bbb78800 rank 0 nranks 4 cudaDev 0 busId 1000 - Abort COMPLETE

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ncclInternalError: Internal check failed #1499

ncclInternalError: Internal check failed #1499

whiteyn commented Oct 30, 2024

AddyLaddy commented Oct 30, 2024

sjeaugey commented Oct 30, 2024

whiteyn commented Oct 30, 2024 •

edited

Loading

ncclInternalError: Internal check failed #1499

ncclInternalError: Internal check failed #1499

Comments

whiteyn commented Oct 30, 2024

🐛 Describe the bug

The error message is as follows:

Versions

AddyLaddy commented Oct 30, 2024

sjeaugey commented Oct 30, 2024

whiteyn commented Oct 30, 2024 • edited Loading

whiteyn commented Oct 30, 2024 •

edited

Loading