Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ncclInternalError: Internal check failed #1499

Open
whiteyn opened this issue Oct 30, 2024 · 3 comments
Open

ncclInternalError: Internal check failed #1499

whiteyn opened this issue Oct 30, 2024 · 3 comments

Comments

@whiteyn
Copy link

whiteyn commented Oct 30, 2024

🐛 Describe the bug

I met an error when I use torchrun for 4 GPUs training and 'nccl' backend (It runs perfect when I use 'gloo'). The environment is python3.9+pytorch2.3.0+cuda12.1.We tried to use uftrace to capture the DLRM code of 4 GPUs launched by torchrun, the command is as follows:

torchrun --nproc_per_node=4 ./multi-uftrace.py
The multi-uftrace.py file content is as follows:

import subprocess
try:  
    result = subprocess.run([
    '/mnt/yuanningbai/local/uftrace/bin/uftrace','-e','record',
    '/mnt/yuanningbai/dlrm/dlrm_s_pytorch.py', '--mini-batch-size=4','--test-mini-batch-size=16384','--test-num-workers=0',
    '--num-batches=1','--data-generation=random','--arch-mlp-bot=512-512-64','--arch-sparse-feature-size=64','--arch-embedding-size=1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000','--num-indices-per-lookup=100',
    '--arch-interaction-op=dot','--print-freq=1','--print-time','--use-gpu','--inference-only','--dist-backend=nccl'], 
    check=True,capture_output=True, text=True)#  
except subprocess.CalledProcessError as e:  
    print("error code :", e.returncode)  
    print("error info :", e.output)  

The error message is as follows:

W1029 16:52:13.175227 140626680026112 torch/distributed/run.py:757]
W1029 16:52:13.175227 140626680026112 torch/distributed/run.py:757] *****************************************
W1029 16:52:13.175227 140626680026112 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W1029 16:52:13.175227 140626680026112 torch/distributed/run.py:757] *****************************************
error code : 1
error info : world size: 4, current rank: 1, local rank: 1
error code : 1
error info : world size: 4, current rank: 3, local rank: 3
error code : 1
error info : Running on 4 ranks using nccl backend
fail to enable all_to_all_single primitive: NCCL error in: /mnt/yuanningbai/pytorch-2.3.0/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, internal error - please report this issue to the NCCL developers, NCCL version 2.20.5
ncclInternalError: Internal check failed.
Last error:
Error : ring 8 does not loop back to start (1 != 0)
world size: 4, current rank: 0, local rank: 0
Using 1 GPU(s)...
-*-*-*-*-*-*nn.EmbeddingBag-*-*-*-*-*-*
-*-*-*-*-*-*nn.EmbeddingBag-*-*-*-*-*-*
error code : 1
error info : world size: 4, current rank: 2, local rank: 2

In order to capture the underlying functions of pytorch, we compile pytorch into the pg version. The above error will occur under 4 GPUs, but not under 2 GPUs. At the same time, we try to compile it into the develop version and it will run correctly. So I would like to ask if there is any solution to prevent such errors under the 4 GPUs of the pg version?

Versions

GPU: 4 x A100 80G GPU
Driver Version :530.30.02
CUDA Version : 12.1
OS version :Ubuntu 22.04
python :3.9
pytorch :v2.3.0
nccl: v2.20.5

@AddyLaddy
Copy link
Collaborator

We'd need to see the output of export NCCL_DEBUG=INFO to be able to analyze that failure.

@sjeaugey
Copy link
Member

The WARN is likely Error : ring 8 does not loop back to start (1 != 0).

This happens when each NCCL rank within a node does not see the same intra-node topology, or if different ranks are run with different parameters. Seeing a different node topology can happen inside VMs sometimes.

Having the log with NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=GRAPH,ENV,INIT would indeed help a lot.

@whiteyn
Copy link
Author

whiteyn commented Oct 30, 2024

@AddyLaddy @sjeaugey Thank you very much for your answers.
The results are as follows with NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=GRAPH,ENV,INIT


W1030 14:04:53.263317 139798532572160 torch/distributed/run.py:757]
W1030 14:04:53.263317 139798532572160 torch/distributed/run.py:757] *****************************************
W1030 14:04:53.263317 139798532572160 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W1030 14:04:53.263317 139798532572160 torch/distributed/run.py:757] *****************************************
error code : 1
error info : testpc115159:2736633:2736633 [1] NCCL INFO cudaDriverVersion 12040
testpc115159:2736633:2736633 [1] NCCL INFO Bootstrap : Using eno8303:109.105.115.159<0>
testpc115159:2736633:2736633 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
testpc115159:2736633:2737365 [1] NCCL INFO NET/IB : No device found.
testpc115159:2736633:2737365 [1] NCCL INFO NET/Socket : Using [0]eno8303:109.105.115.159<0> [1]vethf4fabc7:fe80::54a7:2aff:fe92:27be%vethf4fabc7<0>
testpc115159:2736633:2737365 [1] NCCL INFO Using non-device net plugin version 0
testpc115159:2736633:2737365 [1] NCCL INFO Using network Socket
testpc115159:2736633:2737365 [1] NCCL INFO comm 0x556acd8b8800 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 41000 commId 0xd57beaec7aafe641 - Init START
testpc115159:2736633:2737365 [1] NCCL INFO === System : maxBw 80.0 totalBw 240.0 ===
testpc115159:2736633:2737365 [1] NCCL INFO CPU/1 (1/2/-1)
testpc115159:2736633:2737365 [1] NCCL INFO + PCI[5000.0] - NIC/0
testpc115159:2736633:2737365 [1] NCCL INFO + PCI[24.0] - GPU/1000 (0)
testpc115159:2736633:2737365 [1] NCCL INFO               + NVL[80.0] - GPU/41000
testpc115159:2736633:2737365 [1] NCCL INFO               + NVL[80.0] - GPU/C1000
testpc115159:2736633:2737365 [1] NCCL INFO               + NVL[80.0] - GPU/81000
testpc115159:2736633:2737365 [1] NCCL INFO + SYS[16.0] - CPU/0
testpc115159:2736633:2737365 [1] NCCL INFO + SYS[16.0] - CPU/3
testpc115159:2736633:2737365 [1] NCCL INFO + SYS[16.0] - CPU/2
testpc115159:2736633:2737365 [1] NCCL INFO CPU/0 (1/2/-1)
testpc115159:2736633:2737365 [1] NCCL INFO + PCI[24.0] - GPU/41000 (1)
testpc115159:2736633:2737365 [1] NCCL INFO               + NVL[80.0] - GPU/C1000
testpc115159:2736633:2737365 [1] NCCL INFO               + NVL[80.0] - GPU/1000
testpc115159:2736633:2737365 [1] NCCL INFO               + NVL[80.0] - GPU/81000
testpc115159:2736633:2737365 [1] NCCL INFO + SYS[16.0] - CPU/1
testpc115159:2736633:2737365 [1] NCCL INFO + SYS[16.0] - CPU/3
testpc115159:2736633:2737365 [1] NCCL INFO + SYS[16.0] - CPU/2
testpc115159:2736633:2737365 [1] NCCL INFO CPU/3 (1/2/-1)
testpc115159:2736633:2737365 [1] NCCL INFO + PCI[24.0] - GPU/81000 (2)
testpc115159:2736633:2737365 [1] NCCL INFO               + NVL[80.0] - GPU/41000
testpc115159:2736633:2737365 [1] NCCL INFO               + NVL[80.0] - GPU/C1000
testpc115159:2736633:2737365 [1] NCCL INFO               + NVL[80.0] - GPU/1000
testpc115159:2736633:2737365 [1] NCCL INFO + SYS[16.0] - CPU/1
testpc115159:2736633:2737365 [1] NCCL INFO + SYS[16.0] - CPU/0
testpc115159:2736633:2737365 [1] NCCL INFO + SYS[16.0] - CPU/2
testpc115159:2736633:2737365 [1] NCCL INFO CPU/2 (1/2/-1)
testpc115159:2736633:2737365 [1] NCCL INFO + PCI[24.0] - GPU/C1000 (3)
testpc115159:2736633:2737365 [1] NCCL INFO               + NVL[80.0] - GPU/41000
testpc115159:2736633:2737365 [1] NCCL INFO               + NVL[80.0] - GPU/1000
testpc115159:2736633:2737365 [1] NCCL INFO               + NVL[80.0] - GPU/81000
testpc115159:2736633:2737365 [1] NCCL INFO + SYS[16.0] - CPU/1
testpc115159:2736633:2737365 [1] NCCL INFO + SYS[16.0] - CPU/0
testpc115159:2736633:2737365 [1] NCCL INFO + SYS[16.0] - CPU/3
testpc115159:2736633:2737365 [1] NCCL INFO + PCI[0.4] - NIC/E1000
testpc115159:2736633:2737365 [1] NCCL INFO ==========================================
testpc115159:2736633:2737365 [1] NCCL INFO GPU/1000 :GPU/1000 (0/5000.000000/LOC) GPU/41000 (1/80.000000/NVL) GPU/81000 (1/80.000000/NVL) GPU/C1000 (1/80.000000/NVL) CPU/1 (1/24.000000/PHB) CPU/0 (2/24.000000/PHB) CPU/3 (2/24.000000/PHB) CPU/2 (2/24.000000/PHB)
testpc115159:2736633:2737365 [1] NCCL INFO GPU/41000 :GPU/1000 (1/80.000000/NVL) GPU/41000 (0/5000.000000/LOC) GPU/81000 (1/80.000000/NVL) GPU/C1000 (1/80.000000/NVL) CPU/1 (2/24.000000/PHB) CPU/0 (1/24.000000/PHB) CPU/3 (2/24.000000/PHB) CPU/2 (2/24.000000/PHB)
testpc115159:2736633:2737365 [1] NCCL INFO GPU/81000 :GPU/1000 (1/80.000000/NVL) GPU/41000 (1/80.000000/NVL) GPU/81000 (0/5000.000000/LOC) GPU/C1000 (1/80.000000/NVL) CPU/1 (2/24.000000/PHB) CPU/0 (2/24.000000/PHB) CPU/3 (1/24.000000/PHB) CPU/2 (2/24.000000/PHB)
testpc115159:2736633:2737365 [1] NCCL INFO GPU/C1000 :GPU/1000 (1/80.000000/NVL) GPU/41000 (1/80.000000/NVL) GPU/81000 (1/80.000000/NVL) GPU/C1000 (0/5000.000000/LOC) CPU/1 (2/24.000000/PHB) CPU/0 (2/24.000000/PHB) CPU/3 (2/24.000000/PHB) CPU/2 (1/24.000000/PHB)
testpc115159:2736633:2737365 [1] NCCL INFO Setting affinity for GPU 1 to ffffff,00000000,00000000,00ffffff
testpc115159:2736633:2737365 [1] NCCL INFO NVLS multicast support is not available on dev 1
testpc115159:2736633:2737365 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 14, bw 20.000000/20.000000, type NVL/PIX, sameChannels 0
testpc115159:2736633:2737365 [1] NCCL INFO  0 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736633:2737365 [1] NCCL INFO  1 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736633:2737365 [1] NCCL INFO  2 : GPU/0 GPU/3 GPU/1 GPU/2
testpc115159:2736633:2737365 [1] NCCL INFO  3 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736633:2737365 [1] NCCL INFO  4 : GPU/0 GPU/3 GPU/2 GPU/1
testpc115159:2736633:2737365 [1] NCCL INFO  5 : GPU/0 GPU/2 GPU/1 GPU/3
testpc115159:2736633:2737365 [1] NCCL INFO  6 : GPU/0 GPU/2 GPU/3 GPU/1
testpc115159:2736633:2737365 [1] NCCL INFO  7 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736633:2737365 [1] NCCL INFO  8 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736633:2737365 [1] NCCL INFO  9 : GPU/0 GPU/3 GPU/1 GPU/2
testpc115159:2736633:2737365 [1] NCCL INFO 10 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736633:2737365 [1] NCCL INFO 11 : GPU/0 GPU/3 GPU/2 GPU/1
testpc115159:2736633:2737365 [1] NCCL INFO 12 : GPU/0 GPU/2 GPU/1 GPU/3
testpc115159:2736633:2737365 [1] NCCL INFO 13 : GPU/0 GPU/2 GPU/3 GPU/1
testpc115159:2736633:2737365 [1] NCCL INFO Pattern 1, crossNic 0, nChannels 14, bw 40.000000/20.000000, type NVL/PIX, sameChannels 0
testpc115159:2736633:2737365 [1] NCCL INFO  0 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736633:2737365 [1] NCCL INFO  1 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736633:2737365 [1] NCCL INFO  2 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736633:2737365 [1] NCCL INFO  3 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736633:2737365 [1] NCCL INFO  4 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736633:2737365 [1] NCCL INFO  5 : GPU/0 GPU/3 GPU/2 GPU/1
testpc115159:2736633:2737365 [1] NCCL INFO  6 : GPU/0 GPU/3 GPU/2 GPU/1
testpc115159:2736633:2737365 [1] NCCL INFO  7 : GPU/2 GPU/0 GPU/3 GPU/1
testpc115159:2736633:2737365 [1] NCCL INFO  8 : GPU/3 GPU/1 GPU/2 GPU/0
testpc115159:2736633:2737365 [1] NCCL INFO  9 : GPU/1 GPU/0 GPU/2 GPU/3
testpc115159:2736633:2737365 [1] NCCL INFO 10 : GPU/3 GPU/2 GPU/0 GPU/1
testpc115159:2736633:2737365 [1] NCCL INFO 11 : GPU/2 GPU/0 GPU/3 GPU/1
testpc115159:2736633:2737365 [1] NCCL INFO 12 : GPU/3 GPU/0 GPU/2 GPU/1
testpc115159:2736633:2737365 [1] NCCL INFO 13 : GPU/3 GPU/0 GPU/1 GPU/2
testpc115159:2736633:2737365 [1] NCCL INFO comm 0x556acd8b8800 rank 1 nRanks 4 nNodes 1 localRanks 4 localRank 1 MNNVL 0
testpc115159:2736633:2737365 [1] NCCL INFO Tree 0 : 0 -> 1 -> 2/-1/-1
testpc115159:2736633:2737365 [1] NCCL INFO Tree 12 : 0 -> 1 -> 2/-1/-1
testpc115159:2736633:2737365 [1] NCCL INFO Tree 1 : 0 -> 1 -> 2/-1/-1
testpc115159:2736633:2737365 [1] NCCL INFO Tree 13 : 0 -> 1 -> 2/-1/-1
testpc115159:2736633:2737365 [1] NCCL INFO Tree 2 : 0 -> 1 -> 3/-1/-1
testpc115159:2736633:2737365 [1] NCCL INFO Tree 14 : 0 -> 1 -> 3/-1/-1
testpc115159:2736633:2737365 [1] NCCL INFO Tree 3 : 0 -> 1 -> 3/-1/-1
testpc115159:2736633:2737365 [1] NCCL INFO Tree 15 : 0 -> 1 -> 3/-1/-1
testpc115159:2736633:2737365 [1] NCCL INFO Tree 4 : 0 -> 1 -> 3/-1/-1
testpc115159:2736633:2737365 [1] NCCL INFO Tree 16 : 0 -> 1 -> 3/-1/-1
testpc115159:2736633:2737365 [1] NCCL INFO Tree 5 : 2 -> 1 -> -1/-1/-1
testpc115159:2736633:2737365 [1] NCCL INFO Tree 17 : 2 -> 1 -> -1/-1/-1
testpc115159:2736633:2737365 [1] NCCL INFO Tree 6 : 2 -> 1 -> -1/-1/-1
testpc115159:2736633:2737365 [1] NCCL INFO Tree 18 : 2 -> 1 -> -1/-1/-1
testpc115159:2736633:2737365 [1] NCCL INFO Tree 7 : 3 -> 1 -> -1/-1/-1
testpc115159:2736633:2737365 [1] NCCL INFO Tree 19 : 3 -> 1 -> -1/-1/-1
testpc115159:2736633:2737365 [1] NCCL INFO Tree 8 : 3 -> 1 -> 2/-1/-1
testpc115159:2736633:2737365 [1] NCCL INFO Tree 20 : 3 -> 1 -> 2/-1/-1
testpc115159:2736633:2737365 [1] NCCL INFO Tree 9 : -1 -> 1 -> 0/-1/-1
testpc115159:2736633:2737365 [1] NCCL INFO Tree 21 : -1 -> 1 -> 0/-1/-1
testpc115159:2736633:2737365 [1] NCCL INFO Tree 10 : 0 -> 1 -> -1/-1/-1
testpc115159:2736633:2737365 [1] NCCL INFO Tree 22 : 0 -> 1 -> -1/-1/-1

testpc115159:2736633:2737365 [1] graph/rings.cc:38 NCCL WARN Error : ring 2 does not loop back to start (0 != 1)
testpc115159:2736633:2737365 [1] NCCL INFO graph/connect.cc:479 -> 3
testpc115159:2736633:2737365 [1] NCCL INFO init.cc:1169 -> 3
testpc115159:2736633:2737365 [1] NCCL INFO init.cc:1501 -> 3
testpc115159:2736633:2737365 [1] NCCL INFO group.cc:64 -> 3 [Async thread]
testpc115159:2736633:2736633 [1] NCCL INFO group.cc:418 -> 3
testpc115159:2736633:2736633 [1] NCCL INFO group.cc:95 -> 3
testpc115159:2736633:2739611 [1] NCCL INFO Using non-device net plugin version 0
testpc115159:2736633:2739611 [1] NCCL INFO Using network Socket
testpc115159:2736633:2739611 [1] NCCL INFO comm 0x556ad86da000 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 41000 commId 0xa319625d1be869a3 - Init START
testpc115159:2736633:2739611 [1] NCCL INFO === System : maxBw 80.0 totalBw 240.0 ===
testpc115159:2736633:2739611 [1] NCCL INFO CPU/1 (1/2/-1)
testpc115159:2736633:2739611 [1] NCCL INFO + PCI[5000.0] - NIC/0
testpc115159:2736633:2739611 [1] NCCL INFO + PCI[24.0] - GPU/1000 (0)
testpc115159:2736633:2739611 [1] NCCL INFO               + NVL[80.0] - GPU/41000
testpc115159:2736633:2739611 [1] NCCL INFO               + NVL[80.0] - GPU/C1000
testpc115159:2736633:2739611 [1] NCCL INFO               + NVL[80.0] - GPU/81000
testpc115159:2736633:2739611 [1] NCCL INFO + SYS[16.0] - CPU/0
testpc115159:2736633:2739611 [1] NCCL INFO + SYS[16.0] - CPU/3
testpc115159:2736633:2739611 [1] NCCL INFO + SYS[16.0] - CPU/2
testpc115159:2736633:2739611 [1] NCCL INFO CPU/0 (1/2/-1)
testpc115159:2736633:2739611 [1] NCCL INFO + PCI[24.0] - GPU/41000 (1)
testpc115159:2736633:2739611 [1] NCCL INFO               + NVL[80.0] - GPU/C1000
testpc115159:2736633:2739611 [1] NCCL INFO               + NVL[80.0] - GPU/1000
testpc115159:2736633:2739611 [1] NCCL INFO               + NVL[80.0] - GPU/81000
testpc115159:2736633:2739611 [1] NCCL INFO + SYS[16.0] - CPU/1
testpc115159:2736633:2739611 [1] NCCL INFO + SYS[16.0] - CPU/3
testpc115159:2736633:2739611 [1] NCCL INFO + SYS[16.0] - CPU/2
testpc115159:2736633:2739611 [1] NCCL INFO CPU/3 (1/2/-1)
testpc115159:2736633:2739611 [1] NCCL INFO + PCI[24.0] - GPU/81000 (2)
testpc115159:2736633:2739611 [1] NCCL INFO               + NVL[80.0] - GPU/41000
testpc115159:2736633:2739611 [1] NCCL INFO               + NVL[80.0] - GPU/C1000
testpc115159:2736633:2739611 [1] NCCL INFO               + NVL[80.0] - GPU/1000
testpc115159:2736633:2739611 [1] NCCL INFO + SYS[16.0] - CPU/1
testpc115159:2736633:2739611 [1] NCCL INFO + SYS[16.0] - CPU/0
testpc115159:2736633:2739611 [1] NCCL INFO + SYS[16.0] - CPU/2
testpc115159:2736633:2739611 [1] NCCL INFO CPU/2 (1/2/-1)
testpc115159:2736633:2739611 [1] NCCL INFO + PCI[24.0] - GPU/C1000 (3)
testpc115159:2736633:2739611 [1] NCCL INFO               + NVL[80.0] - GPU/41000
testpc115159:2736633:2739611 [1] NCCL INFO               + NVL[80.0] - GPU/1000
testpc115159:2736633:2739611 [1] NCCL INFO               + NVL[80.0] - GPU/81000
testpc115159:2736633:2739611 [1] NCCL INFO + SYS[16.0] - CPU/1
testpc115159:2736633:2739611 [1] NCCL INFO + SYS[16.0] - CPU/0
testpc115159:2736633:2739611 [1] NCCL INFO + SYS[16.0] - CPU/3
testpc115159:2736633:2739611 [1] NCCL INFO + PCI[0.4] - NIC/E1000
testpc115159:2736633:2739611 [1] NCCL INFO ==========================================
testpc115159:2736633:2739611 [1] NCCL INFO GPU/1000 :GPU/1000 (0/5000.000000/LOC) GPU/41000 (1/80.000000/NVL) GPU/81000 (1/80.000000/NVL) GPU/C1000 (1/80.000000/NVL) CPU/1 (1/24.000000/PHB) CPU/0 (2/24.000000/PHB) CPU/3 (2/24.000000/PHB) CPU/2 (2/24.000000/PHB)
testpc115159:2736633:2739611 [1] NCCL INFO GPU/41000 :GPU/1000 (1/80.000000/NVL) GPU/41000 (0/5000.000000/LOC) GPU/81000 (1/80.000000/NVL) GPU/C1000 (1/80.000000/NVL) CPU/1 (2/24.000000/PHB) CPU/0 (1/24.000000/PHB) CPU/3 (2/24.000000/PHB) CPU/2 (2/24.000000/PHB)
testpc115159:2736633:2739611 [1] NCCL INFO GPU/81000 :GPU/1000 (1/80.000000/NVL) GPU/41000 (1/80.000000/NVL) GPU/81000 (0/5000.000000/LOC) GPU/C1000 (1/80.000000/NVL) CPU/1 (2/24.000000/PHB) CPU/0 (2/24.000000/PHB) CPU/3 (1/24.000000/PHB) CPU/2 (2/24.000000/PHB)
testpc115159:2736633:2739611 [1] NCCL INFO GPU/C1000 :GPU/1000 (1/80.000000/NVL) GPU/41000 (1/80.000000/NVL) GPU/81000 (1/80.000000/NVL) GPU/C1000 (0/5000.000000/LOC) CPU/1 (2/24.000000/PHB) CPU/0 (2/24.000000/PHB) CPU/3 (2/24.000000/PHB) CPU/2 (1/24.000000/PHB)
testpc115159:2736633:2739611 [1] NCCL INFO Setting affinity for GPU 1 to ffffff,00000000,00000000,00ffffff
testpc115159:2736633:2739611 [1] NCCL INFO NVLS multicast support is not available on dev 1
testpc115159:2736633:2739611 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 14, bw 20.000000/20.000000, type NVL/PIX, sameChannels 0
testpc115159:2736633:2739611 [1] NCCL INFO  0 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736633:2739611 [1] NCCL INFO  1 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736633:2739611 [1] NCCL INFO  2 : GPU/0 GPU/2 GPU/1 GPU/3
testpc115159:2736633:2739611 [1] NCCL INFO  3 : GPU/1 GPU/0 GPU/2 GPU/3
testpc115159:2736633:2739611 [1] NCCL INFO  4 : GPU/2 GPU/0 GPU/3 GPU/1
testpc115159:2736633:2739611 [1] NCCL INFO  5 : GPU/1 GPU/0 GPU/3 GPU/2
testpc115159:2736633:2739611 [1] NCCL INFO  6 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736633:2739611 [1] NCCL INFO  7 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736633:2739611 [1] NCCL INFO  8 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736633:2739611 [1] NCCL INFO  9 : GPU/0 GPU/2 GPU/1 GPU/3
testpc115159:2736633:2739611 [1] NCCL INFO 10 : GPU/1 GPU/0 GPU/2 GPU/3
testpc115159:2736633:2739611 [1] NCCL INFO 11 : GPU/2 GPU/0 GPU/3 GPU/1
testpc115159:2736633:2739611 [1] NCCL INFO 12 : GPU/1 GPU/0 GPU/3 GPU/2
testpc115159:2736633:2739611 [1] NCCL INFO 13 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736633:2739611 [1] NCCL INFO Pattern 1, crossNic 0, nChannels 14, bw 40.000000/20.000000, type NVL/PIX, sameChannels 0
testpc115159:2736633:2739611 [1] NCCL INFO  0 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736633:2739611 [1] NCCL INFO  1 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736633:2739611 [1] NCCL INFO  2 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736633:2739611 [1] NCCL INFO  3 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736633:2739611 [1] NCCL INFO  4 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736633:2739611 [1] NCCL INFO  5 : GPU/0 GPU/3 GPU/2 GPU/1
testpc115159:2736633:2739611 [1] NCCL INFO  6 : GPU/0 GPU/3 GPU/2 GPU/1
testpc115159:2736633:2739611 [1] NCCL INFO  7 : GPU/1 GPU/3 GPU/2 GPU/0
testpc115159:2736633:2739611 [1] NCCL INFO  8 : GPU/3 GPU/1 GPU/0 GPU/2
testpc115159:2736633:2739611 [1] NCCL INFO  9 : GPU/3 GPU/1 GPU/0 GPU/2
testpc115159:2736633:2739611 [1] NCCL INFO 10 : GPU/2 GPU/3 GPU/0 GPU/1
testpc115159:2736633:2739611 [1] NCCL INFO 11 : GPU/2 GPU/0 GPU/3 GPU/1
testpc115159:2736633:2739611 [1] NCCL INFO 12 : GPU/2 GPU/1 GPU/3 GPU/0
testpc115159:2736633:2739611 [1] NCCL INFO 13 : GPU/3 GPU/0 GPU/1 GPU/2
testpc115159:2736633:2739611 [1] NCCL INFO comm 0x556ad86da000 rank 1 nRanks 4 nNodes 1 localRanks 4 localRank 1 MNNVL 0
testpc115159:2736633:2739611 [1] NCCL INFO Tree 0 : 0 -> 1 -> 2/-1/-1
testpc115159:2736633:2739611 [1] NCCL INFO Tree 12 : 0 -> 1 -> 2/-1/-1
testpc115159:2736633:2739611 [1] NCCL INFO Tree 1 : 0 -> 1 -> 2/-1/-1
testpc115159:2736633:2739611 [1] NCCL INFO Tree 13 : 0 -> 1 -> 2/-1/-1
testpc115159:2736633:2739611 [1] NCCL INFO Tree 2 : 0 -> 1 -> 2/-1/-1
testpc115159:2736633:2739611 [1] NCCL INFO Tree 14 : 0 -> 1 -> 2/-1/-1
testpc115159:2736633:2739611 [1] NCCL INFO Tree 8 : 3 -> 1 -> 0/-1/-1
testpc115159:2736633:2739611 [1] NCCL INFO Tree 20 : 3 -> 1 -> 0/-1/-1
testpc115159:2736633:2739611 [1] NCCL INFO Tree 11 : 3 -> 1 -> -1/-1/-1
testpc115159:2736633:2739611 [1] NCCL INFO Tree 23 : 3 -> 1 -> -1/-1/-1

testpc115159:2736633:2739611 [1] graph/rings.cc:51 NCCL WARN Error : ring 1 does not contain rank 2
testpc115159:2736633:2739611 [1] NCCL INFO graph/connect.cc:479 -> 3
testpc115159:2736633:2739611 [1] NCCL INFO init.cc:1169 -> 3
testpc115159:2736633:2739611 [1] NCCL INFO init.cc:1501 -> 3
testpc115159:2736633:2739611 [1] NCCL INFO group.cc:64 -> 3 [Async thread]
testpc115159:2736633:2736633 [1] NCCL INFO group.cc:418 -> 3
testpc115159:2736633:2736633 [1] NCCL INFO group.cc:95 -> 3
testpc115159:2736633:2736633 [1] NCCL INFO comm 0x556ad86da000 rank 1 nranks 4 cudaDev 1 busId 41000 - Abort COMPLETE
world size: 4, current rank: 1, local rank: 1
testpc115159:2736633:2739628 [0] NCCL INFO comm 0x556acd8b8800 rank 1 nranks 4 cudaDev 1 busId 41000 - Abort COMPLETE

error code : 1
error info : testpc115159:2736635:2736635 [2] NCCL INFO cudaDriverVersion 12040
testpc115159:2736635:2736635 [2] NCCL INFO Bootstrap : Using eno8303:109.105.115.159<0>
testpc115159:2736635:2736635 [2] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
testpc115159:2736635:2737364 [2] NCCL INFO NET/IB : No device found.
testpc115159:2736635:2737364 [2] NCCL INFO NET/Socket : Using [0]eno8303:109.105.115.159<0> [1]vethf4fabc7:fe80::54a7:2aff:fe92:27be%vethf4fabc7<0>
testpc115159:2736635:2737364 [2] NCCL INFO Using non-device net plugin version 0
testpc115159:2736635:2737364 [2] NCCL INFO Using network Socket
testpc115159:2736635:2737364 [2] NCCL INFO comm 0x560c1ce0a800 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 81000 commId 0xd57beaec7aafe641 - Init START
testpc115159:2736635:2737364 [2] NCCL INFO === System : maxBw 80.0 totalBw 240.0 ===
testpc115159:2736635:2737364 [2] NCCL INFO CPU/1 (1/2/-1)
testpc115159:2736635:2737364 [2] NCCL INFO + PCI[5000.0] - NIC/0
testpc115159:2736635:2737364 [2] NCCL INFO + PCI[24.0] - GPU/1000 (0)
testpc115159:2736635:2737364 [2] NCCL INFO               + NVL[80.0] - GPU/41000
testpc115159:2736635:2737364 [2] NCCL INFO               + NVL[80.0] - GPU/C1000
testpc115159:2736635:2737364 [2] NCCL INFO               + NVL[80.0] - GPU/81000
testpc115159:2736635:2737364 [2] NCCL INFO + SYS[16.0] - CPU/0
testpc115159:2736635:2737364 [2] NCCL INFO + SYS[16.0] - CPU/3
testpc115159:2736635:2737364 [2] NCCL INFO + SYS[16.0] - CPU/2
testpc115159:2736635:2737364 [2] NCCL INFO CPU/0 (1/2/-1)
testpc115159:2736635:2737364 [2] NCCL INFO + PCI[24.0] - GPU/41000 (1)
testpc115159:2736635:2737364 [2] NCCL INFO               + NVL[80.0] - GPU/C1000
testpc115159:2736635:2737364 [2] NCCL INFO               + NVL[80.0] - GPU/1000
testpc115159:2736635:2737364 [2] NCCL INFO               + NVL[80.0] - GPU/81000
testpc115159:2736635:2737364 [2] NCCL INFO + SYS[16.0] - CPU/1
testpc115159:2736635:2737364 [2] NCCL INFO + SYS[16.0] - CPU/3
testpc115159:2736635:2737364 [2] NCCL INFO + SYS[16.0] - CPU/2
testpc115159:2736635:2737364 [2] NCCL INFO CPU/3 (1/2/-1)
testpc115159:2736635:2737364 [2] NCCL INFO + PCI[24.0] - GPU/81000 (2)
testpc115159:2736635:2737364 [2] NCCL INFO               + NVL[80.0] - GPU/41000
testpc115159:2736635:2737364 [2] NCCL INFO               + NVL[80.0] - GPU/C1000
testpc115159:2736635:2737364 [2] NCCL INFO               + NVL[80.0] - GPU/1000
testpc115159:2736635:2737364 [2] NCCL INFO + SYS[16.0] - CPU/1
testpc115159:2736635:2737364 [2] NCCL INFO + SYS[16.0] - CPU/0
testpc115159:2736635:2737364 [2] NCCL INFO + SYS[16.0] - CPU/2
testpc115159:2736635:2737364 [2] NCCL INFO CPU/2 (1/2/-1)
testpc115159:2736635:2737364 [2] NCCL INFO + PCI[24.0] - GPU/C1000 (3)
testpc115159:2736635:2737364 [2] NCCL INFO               + NVL[80.0] - GPU/41000
testpc115159:2736635:2737364 [2] NCCL INFO               + NVL[80.0] - GPU/1000
testpc115159:2736635:2737364 [2] NCCL INFO               + NVL[80.0] - GPU/81000
testpc115159:2736635:2737364 [2] NCCL INFO + SYS[16.0] - CPU/1
testpc115159:2736635:2737364 [2] NCCL INFO + SYS[16.0] - CPU/0
testpc115159:2736635:2737364 [2] NCCL INFO + SYS[16.0] - CPU/3
testpc115159:2736635:2737364 [2] NCCL INFO + PCI[0.4] - NIC/E1000
testpc115159:2736635:2737364 [2] NCCL INFO ==========================================
testpc115159:2736635:2737364 [2] NCCL INFO GPU/1000 :GPU/1000 (0/5000.000000/LOC) GPU/41000 (1/80.000000/NVL) GPU/81000 (1/80.000000/NVL) GPU/C1000 (1/80.000000/NVL) CPU/1 (1/24.000000/PHB) CPU/0 (2/24.000000/PHB) CPU/3 (2/24.000000/PHB) CPU/2 (2/24.000000/PHB)
testpc115159:2736635:2737364 [2] NCCL INFO GPU/41000 :GPU/1000 (1/80.000000/NVL) GPU/41000 (0/5000.000000/LOC) GPU/81000 (1/80.000000/NVL) GPU/C1000 (1/80.000000/NVL) CPU/1 (2/24.000000/PHB) CPU/0 (1/24.000000/PHB) CPU/3 (2/24.000000/PHB) CPU/2 (2/24.000000/PHB)
testpc115159:2736635:2737364 [2] NCCL INFO GPU/81000 :GPU/1000 (1/80.000000/NVL) GPU/41000 (1/80.000000/NVL) GPU/81000 (0/5000.000000/LOC) GPU/C1000 (1/80.000000/NVL) CPU/1 (2/24.000000/PHB) CPU/0 (2/24.000000/PHB) CPU/3 (1/24.000000/PHB) CPU/2 (2/24.000000/PHB)
testpc115159:2736635:2737364 [2] NCCL INFO GPU/C1000 :GPU/1000 (1/80.000000/NVL) GPU/41000 (1/80.000000/NVL) GPU/81000 (1/80.000000/NVL) GPU/C1000 (0/5000.000000/LOC) CPU/1 (2/24.000000/PHB) CPU/0 (2/24.000000/PHB) CPU/3 (2/24.000000/PHB) CPU/2 (1/24.000000/PHB)
testpc115159:2736635:2737364 [2] NCCL INFO Setting affinity for GPU 2 to ffffff00,00000000,00000000,ffffff00,00000000,00000000
testpc115159:2736635:2737364 [2] NCCL INFO NVLS multicast support is not available on dev 2
testpc115159:2736635:2737364 [2] NCCL INFO Pattern 4, crossNic 0, nChannels 16, bw 20.000000/20.000000, type NVL/PIX, sameChannels 0
testpc115159:2736635:2737364 [2] NCCL INFO  0 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736635:2737364 [2] NCCL INFO  1 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736635:2737364 [2] NCCL INFO  2 : GPU/3 GPU/0 GPU/2 GPU/1
testpc115159:2736635:2737364 [2] NCCL INFO  3 : GPU/0 GPU/3 GPU/2 GPU/1
testpc115159:2736635:2737364 [2] NCCL INFO  4 : GPU/3 GPU/2 GPU/0 GPU/1
testpc115159:2736635:2737364 [2] NCCL INFO  5 : GPU/2 GPU/3 GPU/1 GPU/0
testpc115159:2736635:2737364 [2] NCCL INFO  6 : GPU/3 GPU/1 GPU/0 GPU/2
testpc115159:2736635:2737364 [2] NCCL INFO  7 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736635:2737364 [2] NCCL INFO  8 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736635:2737364 [2] NCCL INFO  9 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736635:2737364 [2] NCCL INFO 10 : GPU/3 GPU/0 GPU/2 GPU/1
testpc115159:2736635:2737364 [2] NCCL INFO 11 : GPU/0 GPU/3 GPU/2 GPU/1
testpc115159:2736635:2737364 [2] NCCL INFO 12 : GPU/3 GPU/2 GPU/0 GPU/1
testpc115159:2736635:2737364 [2] NCCL INFO 13 : GPU/2 GPU/3 GPU/1 GPU/0
testpc115159:2736635:2737364 [2] NCCL INFO 14 : GPU/3 GPU/1 GPU/0 GPU/2
testpc115159:2736635:2737364 [2] NCCL INFO 15 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736635:2737364 [2] NCCL INFO Pattern 1, crossNic 0, nChannels 16, bw 40.000000/20.000000, type NVL/PIX, sameChannels 0
testpc115159:2736635:2737364 [2] NCCL INFO  0 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736635:2737364 [2] NCCL INFO  1 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736635:2737364 [2] NCCL INFO  2 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736635:2737364 [2] NCCL INFO  3 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736635:2737364 [2] NCCL INFO  4 : GPU/0 GPU/2 GPU/3 GPU/1
testpc115159:2736635:2737364 [2] NCCL INFO  5 : GPU/0 GPU/2 GPU/3 GPU/1
testpc115159:2736635:2737364 [2] NCCL INFO  6 : GPU/0 GPU/2 GPU/1 GPU/3
testpc115159:2736635:2737364 [2] NCCL INFO  7 : GPU/1 GPU/0 GPU/2 GPU/3
testpc115159:2736635:2737364 [2] NCCL INFO  8 : GPU/2 GPU/0 GPU/3 GPU/1
testpc115159:2736635:2737364 [2] NCCL INFO  9 : GPU/2 GPU/1 GPU/0 GPU/3
testpc115159:2736635:2737364 [2] NCCL INFO 10 : GPU/2 GPU/0 GPU/1 GPU/3
testpc115159:2736635:2737364 [2] NCCL INFO 11 : GPU/3 GPU/2 GPU/0 GPU/1
testpc115159:2736635:2737364 [2] NCCL INFO 12 : GPU/2 GPU/1 GPU/0 GPU/3
testpc115159:2736635:2737364 [2] NCCL INFO 13 : GPU/3 GPU/1 GPU/0 GPU/2
testpc115159:2736635:2737364 [2] NCCL INFO 14 : GPU/2 GPU/1 GPU/0 GPU/3
testpc115159:2736635:2737364 [2] NCCL INFO 15 : GPU/3 GPU/0 GPU/2 GPU/1
testpc115159:2736635:2737364 [2] NCCL INFO comm 0x560c1ce0a800 rank 2 nRanks 4 nNodes 1 localRanks 4 localRank 2 MNNVL 0
testpc115159:2736635:2737364 [2] NCCL INFO Tree 5 : 0 -> 2 -> 3/-1/-1
testpc115159:2736635:2737364 [2] NCCL INFO Tree 17 : 0 -> 2 -> 3/-1/-1
testpc115159:2736635:2737364 [2] NCCL INFO Tree 7 : 0 -> 2 -> 3/-1/-1
testpc115159:2736635:2737364 [2] NCCL INFO Tree 19 : 0 -> 2 -> 3/-1/-1
testpc115159:2736635:2737364 [2] NCCL INFO Tree 9 : -1 -> 2 -> 1/-1/-1
testpc115159:2736635:2737364 [2] NCCL INFO Tree 21 : -1 -> 2 -> 1/-1/-1

testpc115159:2736635:2737364 [2] graph/rings.cc:38 NCCL WARN Error : ring 2 does not loop back to start (1 != 2)
testpc115159:2736635:2737364 [2] NCCL INFO graph/connect.cc:479 -> 3
testpc115159:2736635:2737364 [2] NCCL INFO init.cc:1169 -> 3
testpc115159:2736635:2737364 [2] NCCL INFO init.cc:1501 -> 3
testpc115159:2736635:2737364 [2] NCCL INFO group.cc:64 -> 3 [Async thread]
testpc115159:2736635:2736635 [2] NCCL INFO group.cc:418 -> 3
testpc115159:2736635:2736635 [2] NCCL INFO group.cc:95 -> 3
testpc115159:2736635:2739610 [2] NCCL INFO Using non-device net plugin version 0
testpc115159:2736635:2739610 [2] NCCL INFO Using network Socket
testpc115159:2736635:2739610 [2] NCCL INFO comm 0x560c27c51400 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 81000 commId 0xa319625d1be869a3 - Init START
testpc115159:2736635:2739610 [2] NCCL INFO === System : maxBw 80.0 totalBw 240.0 ===
testpc115159:2736635:2739610 [2] NCCL INFO CPU/1 (1/2/-1)
testpc115159:2736635:2739610 [2] NCCL INFO + PCI[5000.0] - NIC/0
testpc115159:2736635:2739610 [2] NCCL INFO + PCI[24.0] - GPU/1000 (0)
testpc115159:2736635:2739610 [2] NCCL INFO               + NVL[80.0] - GPU/41000
testpc115159:2736635:2739610 [2] NCCL INFO               + NVL[80.0] - GPU/C1000
testpc115159:2736635:2739610 [2] NCCL INFO               + NVL[80.0] - GPU/81000
testpc115159:2736635:2739610 [2] NCCL INFO + SYS[16.0] - CPU/0
testpc115159:2736635:2739610 [2] NCCL INFO + SYS[16.0] - CPU/3
testpc115159:2736635:2739610 [2] NCCL INFO + SYS[16.0] - CPU/2
testpc115159:2736635:2739610 [2] NCCL INFO CPU/0 (1/2/-1)
testpc115159:2736635:2739610 [2] NCCL INFO + PCI[24.0] - GPU/41000 (1)
testpc115159:2736635:2739610 [2] NCCL INFO               + NVL[80.0] - GPU/C1000
testpc115159:2736635:2739610 [2] NCCL INFO               + NVL[80.0] - GPU/1000
testpc115159:2736635:2739610 [2] NCCL INFO               + NVL[80.0] - GPU/81000
testpc115159:2736635:2739610 [2] NCCL INFO + SYS[16.0] - CPU/1
testpc115159:2736635:2739610 [2] NCCL INFO + SYS[16.0] - CPU/3
testpc115159:2736635:2739610 [2] NCCL INFO + SYS[16.0] - CPU/2
testpc115159:2736635:2739610 [2] NCCL INFO CPU/3 (1/2/-1)
testpc115159:2736635:2739610 [2] NCCL INFO + PCI[24.0] - GPU/81000 (2)
testpc115159:2736635:2739610 [2] NCCL INFO               + NVL[80.0] - GPU/41000
testpc115159:2736635:2739610 [2] NCCL INFO               + NVL[80.0] - GPU/C1000
testpc115159:2736635:2739610 [2] NCCL INFO               + NVL[80.0] - GPU/1000
testpc115159:2736635:2739610 [2] NCCL INFO + SYS[16.0] - CPU/1
testpc115159:2736635:2739610 [2] NCCL INFO + SYS[16.0] - CPU/0
testpc115159:2736635:2739610 [2] NCCL INFO + SYS[16.0] - CPU/2
testpc115159:2736635:2739610 [2] NCCL INFO CPU/2 (1/2/-1)
testpc115159:2736635:2739610 [2] NCCL INFO + PCI[24.0] - GPU/C1000 (3)
testpc115159:2736635:2739610 [2] NCCL INFO               + NVL[80.0] - GPU/41000
testpc115159:2736635:2739610 [2] NCCL INFO               + NVL[80.0] - GPU/1000
testpc115159:2736635:2739610 [2] NCCL INFO               + NVL[80.0] - GPU/81000
testpc115159:2736635:2739610 [2] NCCL INFO + SYS[16.0] - CPU/1
testpc115159:2736635:2739610 [2] NCCL INFO + SYS[16.0] - CPU/0
testpc115159:2736635:2739610 [2] NCCL INFO + SYS[16.0] - CPU/3
testpc115159:2736635:2739610 [2] NCCL INFO + PCI[0.4] - NIC/E1000
testpc115159:2736635:2739610 [2] NCCL INFO ==========================================
testpc115159:2736635:2739610 [2] NCCL INFO GPU/1000 :GPU/1000 (0/5000.000000/LOC) GPU/41000 (1/80.000000/NVL) GPU/81000 (1/80.000000/NVL) GPU/C1000 (1/80.000000/NVL) CPU/1 (1/24.000000/PHB) CPU/0 (2/24.000000/PHB) CPU/3 (2/24.000000/PHB) CPU/2 (2/24.000000/PHB)
testpc115159:2736635:2739610 [2] NCCL INFO GPU/41000 :GPU/1000 (1/80.000000/NVL) GPU/41000 (0/5000.000000/LOC) GPU/81000 (1/80.000000/NVL) GPU/C1000 (1/80.000000/NVL) CPU/1 (2/24.000000/PHB) CPU/0 (1/24.000000/PHB) CPU/3 (2/24.000000/PHB) CPU/2 (2/24.000000/PHB)
testpc115159:2736635:2739610 [2] NCCL INFO GPU/81000 :GPU/1000 (1/80.000000/NVL) GPU/41000 (1/80.000000/NVL) GPU/81000 (0/5000.000000/LOC) GPU/C1000 (1/80.000000/NVL) CPU/1 (2/24.000000/PHB) CPU/0 (2/24.000000/PHB) CPU/3 (1/24.000000/PHB) CPU/2 (2/24.000000/PHB)
testpc115159:2736635:2739610 [2] NCCL INFO GPU/C1000 :GPU/1000 (1/80.000000/NVL) GPU/41000 (1/80.000000/NVL) GPU/81000 (1/80.000000/NVL) GPU/C1000 (0/5000.000000/LOC) CPU/1 (2/24.000000/PHB) CPU/0 (2/24.000000/PHB) CPU/3 (2/24.000000/PHB) CPU/2 (1/24.000000/PHB)
testpc115159:2736635:2739610 [2] NCCL INFO Setting affinity for GPU 2 to ffffff00,00000000,00000000,ffffff00,00000000,00000000
testpc115159:2736635:2739610 [2] NCCL INFO NVLS multicast support is not available on dev 2
testpc115159:2736635:2739610 [2] NCCL INFO Pattern 4, crossNic 0, nChannels 16, bw 20.000000/20.000000, type NVL/PIX, sameChannels 0
testpc115159:2736635:2739610 [2] NCCL INFO  0 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736635:2739610 [2] NCCL INFO  1 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736635:2739610 [2] NCCL INFO  2 : GPU/0 GPU/2 GPU/3 GPU/1
testpc115159:2736635:2739610 [2] NCCL INFO  3 : GPU/3 GPU/1 GPU/2 GPU/0
testpc115159:2736635:2739610 [2] NCCL INFO  4 : GPU/1 GPU/0 GPU/2 GPU/3
testpc115159:2736635:2739610 [2] NCCL INFO  5 : GPU/1 GPU/0 GPU/3 GPU/2
testpc115159:2736635:2739610 [2] NCCL INFO  6 : GPU/1 GPU/3 GPU/0 GPU/2
testpc115159:2736635:2739610 [2] NCCL INFO  7 : GPU/3 GPU/2 GPU/0 GPU/1
testpc115159:2736635:2739610 [2] NCCL INFO  8 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736635:2739610 [2] NCCL INFO  9 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736635:2739610 [2] NCCL INFO 10 : GPU/0 GPU/2 GPU/3 GPU/1
testpc115159:2736635:2739610 [2] NCCL INFO 11 : GPU/3 GPU/1 GPU/2 GPU/0
testpc115159:2736635:2739610 [2] NCCL INFO 12 : GPU/1 GPU/0 GPU/2 GPU/3
testpc115159:2736635:2739610 [2] NCCL INFO 13 : GPU/1 GPU/0 GPU/3 GPU/2
testpc115159:2736635:2739610 [2] NCCL INFO 14 : GPU/1 GPU/3 GPU/0 GPU/2
testpc115159:2736635:2739610 [2] NCCL INFO 15 : GPU/3 GPU/2 GPU/0 GPU/1
testpc115159:2736635:2739610 [2] NCCL INFO Pattern 1, crossNic 0, nChannels 16, bw 30.000000/20.000000, type NVL/PIX, sameChannels 0
testpc115159:2736635:2739610 [2] NCCL INFO  0 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736635:2739610 [2] NCCL INFO  1 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736635:2739610 [2] NCCL INFO  2 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736635:2739610 [2] NCCL INFO  3 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736635:2739610 [2] NCCL INFO  4 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736635:2739610 [2] NCCL INFO  5 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736635:2739610 [2] NCCL INFO  6 : GPU/0 GPU/2 GPU/1 GPU/3
testpc115159:2736635:2739610 [2] NCCL INFO  7 : GPU/0 GPU/2 GPU/1 GPU/3
testpc115159:2736635:2739610 [2] NCCL INFO  8 : GPU/0 GPU/2 GPU/1 GPU/3
testpc115159:2736635:2739610 [2] NCCL INFO  9 : GPU/1 GPU/3 GPU/0 GPU/2
testpc115159:2736635:2739610 [2] NCCL INFO 10 : GPU/1 GPU/3 GPU/0 GPU/2
testpc115159:2736635:2739610 [2] NCCL INFO 11 : GPU/1 GPU/0 GPU/3 GPU/2
testpc115159:2736635:2739610 [2] NCCL INFO 12 : GPU/1 GPU/0 GPU/3 GPU/2
testpc115159:2736635:2739610 [2] NCCL INFO 13 : GPU/2 GPU/0 GPU/3 GPU/1
testpc115159:2736635:2739610 [2] NCCL INFO 14 : GPU/2 GPU/0 GPU/3 GPU/1
testpc115159:2736635:2739610 [2] NCCL INFO 15 : GPU/2 GPU/0 GPU/3 GPU/1
testpc115159:2736635:2739610 [2] NCCL INFO comm 0x560c27c51400 rank 2 nRanks 4 nNodes 1 localRanks 4 localRank 2 MNNVL 0
testpc115159:2736635:2739610 [2] NCCL INFO Tree 3 : 1 -> 2 -> 3/-1/-1
testpc115159:2736635:2739610 [2] NCCL INFO Tree 15 : 1 -> 2 -> 3/-1/-1
testpc115159:2736635:2739610 [2] NCCL INFO Tree 4 : 3 -> 2 -> -1/-1/-1
testpc115159:2736635:2739610 [2] NCCL INFO Tree 16 : 3 -> 2 -> -1/-1/-1
testpc115159:2736635:2739610 [2] NCCL INFO Tree 5 : 3 -> 2 -> -1/-1/-1
testpc115159:2736635:2739610 [2] NCCL INFO Tree 17 : 3 -> 2 -> -1/-1/-1
testpc115159:2736635:2739610 [2] NCCL INFO Tree 7 : 0 -> 2 -> 1/-1/-1
testpc115159:2736635:2739610 [2] NCCL INFO Tree 19 : 0 -> 2 -> 1/-1/-1
testpc115159:2736635:2739610 [2] NCCL INFO Tree 9 : 0 -> 2 -> -1/-1/-1
testpc115159:2736635:2739610 [2] NCCL INFO Tree 21 : 0 -> 2 -> -1/-1/-1

testpc115159:2736635:2739610 [2] graph/rings.cc:38 NCCL WARN Error : ring 1 does not loop back to start (-1 != 2)
testpc115159:2736635:2739610 [2] NCCL INFO graph/connect.cc:479 -> 3
testpc115159:2736635:2739610 [2] NCCL INFO init.cc:1169 -> 3
testpc115159:2736635:2739610 [2] NCCL INFO init.cc:1501 -> 3
testpc115159:2736635:2739610 [2] NCCL INFO group.cc:64 -> 3 [Async thread]
testpc115159:2736635:2736635 [2] NCCL INFO group.cc:418 -> 3
testpc115159:2736635:2736635 [2] NCCL INFO group.cc:95 -> 3
testpc115159:2736635:2736635 [2] NCCL INFO comm 0x560c27c51400 rank 2 nranks 4 cudaDev 2 busId 81000 - Abort COMPLETE
world size: 4, current rank: 2, local rank: 2
testpc115159:2736635:2739629 [0] NCCL INFO comm 0x560c1ce0a800 rank 2 nranks 4 cudaDev 2 busId 81000 - Abort COMPLETE

error code : 1
error info : testpc115159:2736637:2736637 [3] NCCL INFO cudaDriverVersion 12040
testpc115159:2736637:2736637 [3] NCCL INFO Bootstrap : Using eno8303:109.105.115.159<0>
testpc115159:2736637:2736637 [3] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
testpc115159:2736637:2737363 [3] NCCL INFO NET/IB : No device found.
testpc115159:2736637:2737363 [3] NCCL INFO NET/Socket : Using [0]eno8303:109.105.115.159<0> [1]vethf4fabc7:fe80::54a7:2aff:fe92:27be%vethf4fabc7<0>
testpc115159:2736637:2737363 [3] NCCL INFO Using non-device net plugin version 0
testpc115159:2736637:2737363 [3] NCCL INFO Using network Socket
testpc115159:2736637:2737363 [3] NCCL INFO comm 0x5587025ac800 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId c1000 commId 0xd57beaec7aafe641 - Init START
testpc115159:2736637:2737363 [3] NCCL INFO === System : maxBw 80.0 totalBw 240.0 ===
testpc115159:2736637:2737363 [3] NCCL INFO CPU/1 (1/2/-1)
testpc115159:2736637:2737363 [3] NCCL INFO + PCI[5000.0] - NIC/0
testpc115159:2736637:2737363 [3] NCCL INFO + PCI[24.0] - GPU/1000 (0)
testpc115159:2736637:2737363 [3] NCCL INFO               + NVL[80.0] - GPU/41000
testpc115159:2736637:2737363 [3] NCCL INFO               + NVL[80.0] - GPU/C1000
testpc115159:2736637:2737363 [3] NCCL INFO               + NVL[80.0] - GPU/81000
testpc115159:2736637:2737363 [3] NCCL INFO + SYS[16.0] - CPU/0
testpc115159:2736637:2737363 [3] NCCL INFO + SYS[16.0] - CPU/3
testpc115159:2736637:2737363 [3] NCCL INFO + SYS[16.0] - CPU/2
testpc115159:2736637:2737363 [3] NCCL INFO CPU/0 (1/2/-1)
testpc115159:2736637:2737363 [3] NCCL INFO + PCI[24.0] - GPU/41000 (1)
testpc115159:2736637:2737363 [3] NCCL INFO               + NVL[80.0] - GPU/C1000
testpc115159:2736637:2737363 [3] NCCL INFO               + NVL[80.0] - GPU/1000
testpc115159:2736637:2737363 [3] NCCL INFO               + NVL[80.0] - GPU/81000
testpc115159:2736637:2737363 [3] NCCL INFO + SYS[16.0] - CPU/1
testpc115159:2736637:2737363 [3] NCCL INFO + SYS[16.0] - CPU/3
testpc115159:2736637:2737363 [3] NCCL INFO + SYS[16.0] - CPU/2
testpc115159:2736637:2737363 [3] NCCL INFO CPU/3 (1/2/-1)
testpc115159:2736637:2737363 [3] NCCL INFO + PCI[24.0] - GPU/81000 (2)
testpc115159:2736637:2737363 [3] NCCL INFO               + NVL[80.0] - GPU/41000
testpc115159:2736637:2737363 [3] NCCL INFO               + NVL[80.0] - GPU/C1000
testpc115159:2736637:2737363 [3] NCCL INFO               + NVL[80.0] - GPU/1000
testpc115159:2736637:2737363 [3] NCCL INFO + SYS[16.0] - CPU/1
testpc115159:2736637:2737363 [3] NCCL INFO + SYS[16.0] - CPU/0
testpc115159:2736637:2737363 [3] NCCL INFO + SYS[16.0] - CPU/2
testpc115159:2736637:2737363 [3] NCCL INFO CPU/2 (1/2/-1)
testpc115159:2736637:2737363 [3] NCCL INFO + PCI[24.0] - GPU/C1000 (3)
testpc115159:2736637:2737363 [3] NCCL INFO               + NVL[80.0] - GPU/41000
testpc115159:2736637:2737363 [3] NCCL INFO               + NVL[80.0] - GPU/1000
testpc115159:2736637:2737363 [3] NCCL INFO               + NVL[80.0] - GPU/81000
testpc115159:2736637:2737363 [3] NCCL INFO + SYS[16.0] - CPU/1
testpc115159:2736637:2737363 [3] NCCL INFO + SYS[16.0] - CPU/0
testpc115159:2736637:2737363 [3] NCCL INFO + SYS[16.0] - CPU/3
testpc115159:2736637:2737363 [3] NCCL INFO + PCI[0.4] - NIC/E1000
testpc115159:2736637:2737363 [3] NCCL INFO ==========================================
testpc115159:2736637:2737363 [3] NCCL INFO GPU/1000 :GPU/1000 (0/5000.000000/LOC) GPU/41000 (1/80.000000/NVL) GPU/81000 (1/80.000000/NVL) GPU/C1000 (1/80.000000/NVL) CPU/1 (1/24.000000/PHB) CPU/0 (2/24.000000/PHB) CPU/3 (2/24.000000/PHB) CPU/2 (2/24.000000/PHB)
testpc115159:2736637:2737363 [3] NCCL INFO GPU/41000 :GPU/1000 (1/80.000000/NVL) GPU/41000 (0/5000.000000/LOC) GPU/81000 (1/80.000000/NVL) GPU/C1000 (1/80.000000/NVL) CPU/1 (2/24.000000/PHB) CPU/0 (1/24.000000/PHB) CPU/3 (2/24.000000/PHB) CPU/2 (2/24.000000/PHB)
testpc115159:2736637:2737363 [3] NCCL INFO GPU/81000 :GPU/1000 (1/80.000000/NVL) GPU/41000 (1/80.000000/NVL) GPU/81000 (0/5000.000000/LOC) GPU/C1000 (1/80.000000/NVL) CPU/1 (2/24.000000/PHB) CPU/0 (2/24.000000/PHB) CPU/3 (1/24.000000/PHB) CPU/2 (2/24.000000/PHB)
testpc115159:2736637:2737363 [3] NCCL INFO GPU/C1000 :GPU/1000 (1/80.000000/NVL) GPU/41000 (1/80.000000/NVL) GPU/81000 (1/80.000000/NVL) GPU/C1000 (0/5000.000000/LOC) CPU/1 (2/24.000000/PHB) CPU/0 (2/24.000000/PHB) CPU/3 (2/24.000000/PHB) CPU/2 (1/24.000000/PHB)
testpc115159:2736637:2737363 [3] NCCL INFO Setting affinity for GPU 3 to ff,ffff0000,00000000,000000ff,ffff0000,00000000
testpc115159:2736637:2737363 [3] NCCL INFO NVLS multicast support is not available on dev 3
testpc115159:2736637:2737363 [3] NCCL INFO Pattern 4, crossNic 0, nChannels 14, bw 20.000000/20.000000, type NVL/PIX, sameChannels 0
testpc115159:2736637:2737363 [3] NCCL INFO  0 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736637:2737363 [3] NCCL INFO  1 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736637:2737363 [3] NCCL INFO  2 : GPU/0 GPU/2 GPU/1 GPU/3
testpc115159:2736637:2737363 [3] NCCL INFO  3 : GPU/3 GPU/2 GPU/1 GPU/0
testpc115159:2736637:2737363 [3] NCCL INFO  4 : GPU/3 GPU/1 GPU/0 GPU/2
testpc115159:2736637:2737363 [3] NCCL INFO  5 : GPU/1 GPU/3 GPU/0 GPU/2
testpc115159:2736637:2737363 [3] NCCL INFO  6 : GPU/0 GPU/3 GPU/1 GPU/2
testpc115159:2736637:2737363 [3] NCCL INFO  7 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736637:2737363 [3] NCCL INFO  8 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736637:2737363 [3] NCCL INFO  9 : GPU/0 GPU/2 GPU/1 GPU/3
testpc115159:2736637:2737363 [3] NCCL INFO 10 : GPU/3 GPU/2 GPU/1 GPU/0
testpc115159:2736637:2737363 [3] NCCL INFO 11 : GPU/3 GPU/1 GPU/0 GPU/2
testpc115159:2736637:2737363 [3] NCCL INFO 12 : GPU/1 GPU/3 GPU/0 GPU/2
testpc115159:2736637:2737363 [3] NCCL INFO 13 : GPU/0 GPU/3 GPU/1 GPU/2
testpc115159:2736637:2737363 [3] NCCL INFO Pattern 1, crossNic 0, nChannels 14, bw 40.000000/20.000000, type NVL/PIX, sameChannels 0
testpc115159:2736637:2737363 [3] NCCL INFO  0 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736637:2737363 [3] NCCL INFO  1 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736637:2737363 [3] NCCL INFO  2 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736637:2737363 [3] NCCL INFO  3 : GPU/0 GPU/2 GPU/3 GPU/1
testpc115159:2736637:2737363 [3] NCCL INFO  4 : GPU/0 GPU/2 GPU/3 GPU/1
testpc115159:2736637:2737363 [3] NCCL INFO  5 : GPU/0 GPU/2 GPU/1 GPU/3
testpc115159:2736637:2737363 [3] NCCL INFO  6 : GPU/0 GPU/3 GPU/2 GPU/1
testpc115159:2736637:2737363 [3] NCCL INFO  7 : GPU/2 GPU/0 GPU/3 GPU/1
testpc115159:2736637:2737363 [3] NCCL INFO  8 : GPU/2 GPU/0 GPU/3 GPU/1
testpc115159:2736637:2737363 [3] NCCL INFO  9 : GPU/1 GPU/3 GPU/2 GPU/0
testpc115159:2736637:2737363 [3] NCCL INFO 10 : GPU/1 GPU/0 GPU/3 GPU/2
testpc115159:2736637:2737363 [3] NCCL INFO 11 : GPU/1 GPU/0 GPU/3 GPU/2
testpc115159:2736637:2737363 [3] NCCL INFO 12 : GPU/3 GPU/2 GPU/1 GPU/0
testpc115159:2736637:2737363 [3] NCCL INFO 13 : GPU/1 GPU/3 GPU/0 GPU/2
testpc115159:2736637:2737363 [3] NCCL INFO comm 0x5587025ac800 rank 3 nRanks 4 nNodes 1 localRanks 4 localRank 3 MNNVL 0
testpc115159:2736637:2737363 [3] NCCL INFO Tree 6 : 0 -> 3 -> 2/-1/-1
testpc115159:2736637:2737363 [3] NCCL INFO Tree 18 : 0 -> 3 -> 2/-1/-1
testpc115159:2736637:2737363 [3] NCCL INFO Tree 8 : 0 -> 3 -> 1/-1/-1
testpc115159:2736637:2737363 [3] NCCL INFO Tree 20 : 0 -> 3 -> 1/-1/-1
testpc115159:2736637:2737363 [3] NCCL INFO Tree 10 : 0 -> 3 -> 2/-1/-1
testpc115159:2736637:2737363 [3] NCCL INFO Tree 22 : 0 -> 3 -> 2/-1/-1
testpc115159:2736637:2737363 [3] NCCL INFO Tree 11 : 0 -> 3 -> 2/-1/-1
testpc115159:2736637:2737363 [3] NCCL INFO Tree 23 : 0 -> 3 -> 2/-1/-1

testpc115159:2736637:2737363 [3] graph/rings.cc:38 NCCL WARN Error : ring 2 does not loop back to start (0 != 3)
testpc115159:2736637:2737363 [3] NCCL INFO graph/connect.cc:479 -> 3
testpc115159:2736637:2737363 [3] NCCL INFO init.cc:1169 -> 3
testpc115159:2736637:2737363 [3] NCCL INFO init.cc:1501 -> 3
testpc115159:2736637:2737363 [3] NCCL INFO group.cc:64 -> 3 [Async thread]
testpc115159:2736637:2736637 [3] NCCL INFO group.cc:418 -> 3
testpc115159:2736637:2736637 [3] NCCL INFO group.cc:95 -> 3
testpc115159:2736637:2739612 [3] NCCL INFO Using non-device net plugin version 0
testpc115159:2736637:2739612 [3] NCCL INFO Using network Socket
testpc115159:2736637:2739612 [3] NCCL INFO comm 0x55870d3ce000 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId c1000 commId 0xa319625d1be869a3 - Init START
testpc115159:2736637:2739612 [3] NCCL INFO === System : maxBw 80.0 totalBw 240.0 ===
testpc115159:2736637:2739612 [3] NCCL INFO CPU/1 (1/2/-1)
testpc115159:2736637:2739612 [3] NCCL INFO + PCI[5000.0] - NIC/0
testpc115159:2736637:2739612 [3] NCCL INFO + PCI[24.0] - GPU/1000 (0)
testpc115159:2736637:2739612 [3] NCCL INFO               + NVL[80.0] - GPU/41000
testpc115159:2736637:2739612 [3] NCCL INFO               + NVL[80.0] - GPU/C1000
testpc115159:2736637:2739612 [3] NCCL INFO               + NVL[80.0] - GPU/81000
testpc115159:2736637:2739612 [3] NCCL INFO + SYS[16.0] - CPU/0
testpc115159:2736637:2739612 [3] NCCL INFO + SYS[16.0] - CPU/3
testpc115159:2736637:2739612 [3] NCCL INFO + SYS[16.0] - CPU/2
testpc115159:2736637:2739612 [3] NCCL INFO CPU/0 (1/2/-1)
testpc115159:2736637:2739612 [3] NCCL INFO + PCI[24.0] - GPU/41000 (1)
testpc115159:2736637:2739612 [3] NCCL INFO               + NVL[80.0] - GPU/C1000
testpc115159:2736637:2739612 [3] NCCL INFO               + NVL[80.0] - GPU/1000
testpc115159:2736637:2739612 [3] NCCL INFO               + NVL[80.0] - GPU/81000
testpc115159:2736637:2739612 [3] NCCL INFO + SYS[16.0] - CPU/1
testpc115159:2736637:2739612 [3] NCCL INFO + SYS[16.0] - CPU/3
testpc115159:2736637:2739612 [3] NCCL INFO + SYS[16.0] - CPU/2
testpc115159:2736637:2739612 [3] NCCL INFO CPU/3 (1/2/-1)
testpc115159:2736637:2739612 [3] NCCL INFO + PCI[24.0] - GPU/81000 (2)
testpc115159:2736637:2739612 [3] NCCL INFO               + NVL[80.0] - GPU/41000
testpc115159:2736637:2739612 [3] NCCL INFO               + NVL[80.0] - GPU/C1000
testpc115159:2736637:2739612 [3] NCCL INFO               + NVL[80.0] - GPU/1000
testpc115159:2736637:2739612 [3] NCCL INFO + SYS[16.0] - CPU/1
testpc115159:2736637:2739612 [3] NCCL INFO + SYS[16.0] - CPU/0
testpc115159:2736637:2739612 [3] NCCL INFO + SYS[16.0] - CPU/2
testpc115159:2736637:2739612 [3] NCCL INFO CPU/2 (1/2/-1)
testpc115159:2736637:2739612 [3] NCCL INFO + PCI[24.0] - GPU/C1000 (3)
testpc115159:2736637:2739612 [3] NCCL INFO               + NVL[80.0] - GPU/41000
testpc115159:2736637:2739612 [3] NCCL INFO               + NVL[80.0] - GPU/1000
testpc115159:2736637:2739612 [3] NCCL INFO               + NVL[80.0] - GPU/81000
testpc115159:2736637:2739612 [3] NCCL INFO + SYS[16.0] - CPU/1
testpc115159:2736637:2739612 [3] NCCL INFO + SYS[16.0] - CPU/0
testpc115159:2736637:2739612 [3] NCCL INFO + SYS[16.0] - CPU/3
testpc115159:2736637:2739612 [3] NCCL INFO + PCI[0.4] - NIC/E1000
testpc115159:2736637:2739612 [3] NCCL INFO ==========================================
testpc115159:2736637:2739612 [3] NCCL INFO GPU/1000 :GPU/1000 (0/5000.000000/LOC) GPU/41000 (1/80.000000/NVL) GPU/81000 (1/80.000000/NVL) GPU/C1000 (1/80.000000/NVL) CPU/1 (1/24.000000/PHB) CPU/0 (2/24.000000/PHB) CPU/3 (2/24.000000/PHB) CPU/2 (2/24.000000/PHB)
testpc115159:2736637:2739612 [3] NCCL INFO GPU/41000 :GPU/1000 (1/80.000000/NVL) GPU/41000 (0/5000.000000/LOC) GPU/81000 (1/80.000000/NVL) GPU/C1000 (1/80.000000/NVL) CPU/1 (2/24.000000/PHB) CPU/0 (1/24.000000/PHB) CPU/3 (2/24.000000/PHB) CPU/2 (2/24.000000/PHB)
testpc115159:2736637:2739612 [3] NCCL INFO GPU/81000 :GPU/1000 (1/80.000000/NVL) GPU/41000 (1/80.000000/NVL) GPU/81000 (0/5000.000000/LOC) GPU/C1000 (1/80.000000/NVL) CPU/1 (2/24.000000/PHB) CPU/0 (2/24.000000/PHB) CPU/3 (1/24.000000/PHB) CPU/2 (2/24.000000/PHB)
testpc115159:2736637:2739612 [3] NCCL INFO GPU/C1000 :GPU/1000 (1/80.000000/NVL) GPU/41000 (1/80.000000/NVL) GPU/81000 (1/80.000000/NVL) GPU/C1000 (0/5000.000000/LOC) CPU/1 (2/24.000000/PHB) CPU/0 (2/24.000000/PHB) CPU/3 (2/24.000000/PHB) CPU/2 (1/24.000000/PHB)
testpc115159:2736637:2739612 [3] NCCL INFO Setting affinity for GPU 3 to ff,ffff0000,00000000,000000ff,ffff0000,00000000
testpc115159:2736637:2739612 [3] NCCL INFO NVLS multicast support is not available on dev 3
testpc115159:2736637:2739612 [3] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 0
testpc115159:2736637:2739612 [3] NCCL INFO  0 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736637:2739612 [3] NCCL INFO  1 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736637:2739612 [3] NCCL INFO  2 : GPU/3 GPU/1 GPU/0 GPU/2
testpc115159:2736637:2739612 [3] NCCL INFO  3 : GPU/2 GPU/1 GPU/3 GPU/0
testpc115159:2736637:2739612 [3] NCCL INFO  4 : GPU/2 GPU/1 GPU/0 GPU/3
testpc115159:2736637:2739612 [3] NCCL INFO  5 : GPU/1 GPU/2 GPU/0 GPU/3
testpc115159:2736637:2739612 [3] NCCL INFO  6 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736637:2739612 [3] NCCL INFO  7 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736637:2739612 [3] NCCL INFO  8 : GPU/3 GPU/1 GPU/0 GPU/2
testpc115159:2736637:2739612 [3] NCCL INFO  9 : GPU/2 GPU/1 GPU/3 GPU/0
testpc115159:2736637:2739612 [3] NCCL INFO 10 : GPU/2 GPU/1 GPU/0 GPU/3
testpc115159:2736637:2739612 [3] NCCL INFO 11 : GPU/1 GPU/2 GPU/0 GPU/3
testpc115159:2736637:2739612 [3] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/20.000000, type NVL/PIX, sameChannels 0
testpc115159:2736637:2739612 [3] NCCL INFO  0 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736637:2739612 [3] NCCL INFO  1 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736637:2739612 [3] NCCL INFO  2 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736637:2739612 [3] NCCL INFO  3 : GPU/0 GPU/2 GPU/3 GPU/1
testpc115159:2736637:2739612 [3] NCCL INFO  4 : GPU/0 GPU/2 GPU/1 GPU/3
testpc115159:2736637:2739612 [3] NCCL INFO  5 : GPU/0 GPU/3 GPU/2 GPU/1
testpc115159:2736637:2739612 [3] NCCL INFO  6 : GPU/1 GPU/0 GPU/3 GPU/2
testpc115159:2736637:2739612 [3] NCCL INFO  7 : GPU/1 GPU/3 GPU/0 GPU/2
testpc115159:2736637:2739612 [3] NCCL INFO  8 : GPU/3 GPU/1 GPU/0 GPU/2
testpc115159:2736637:2739612 [3] NCCL INFO  9 : GPU/1 GPU/3 GPU/2 GPU/0
testpc115159:2736637:2739612 [3] NCCL INFO 10 : GPU/0 GPU/3 GPU/1 GPU/2
testpc115159:2736637:2739612 [3] NCCL INFO 11 : GPU/2 GPU/0 GPU/3 GPU/1
testpc115159:2736637:2739612 [3] NCCL INFO comm 0x55870d3ce000 rank 3 nRanks 4 nNodes 1 localRanks 4 localRank 3 MNNVL 0
testpc115159:2736637:2739612 [3] NCCL INFO Tree 6 : 0 -> 3 -> 2/-1/-1
testpc115159:2736637:2739612 [3] NCCL INFO Tree 18 : 0 -> 3 -> 2/-1/-1
testpc115159:2736637:2739612 [3] NCCL INFO Tree 8 : -1 -> 3 -> 1/-1/-1
testpc115159:2736637:2739612 [3] NCCL INFO Tree 20 : -1 -> 3 -> 1/-1/-1
testpc115159:2736637:2739612 [3] NCCL INFO Tree 9 : 1 -> 3 -> 2/-1/-1
testpc115159:2736637:2739612 [3] NCCL INFO Tree 21 : 1 -> 3 -> 2/-1/-1
testpc115159:2736637:2739612 [3] NCCL INFO Tree 10 : 0 -> 3 -> 1/-1/-1
testpc115159:2736637:2739612 [3] NCCL INFO Tree 22 : 0 -> 3 -> 1/-1/-1
testpc115159:2736637:2739612 [3] NCCL INFO Tree 11 : 0 -> 3 -> 1/-1/-1
testpc115159:2736637:2739612 [3] NCCL INFO Tree 23 : 0 -> 3 -> 1/-1/-1

testpc115159:2736637:2739612 [3] graph/rings.cc:51 NCCL WARN Error : ring 1 does not contain rank 2
testpc115159:2736637:2739612 [3] NCCL INFO graph/connect.cc:479 -> 3
testpc115159:2736637:2739612 [3] NCCL INFO init.cc:1169 -> 3
testpc115159:2736637:2739612 [3] NCCL INFO init.cc:1501 -> 3
testpc115159:2736637:2739612 [3] NCCL INFO group.cc:64 -> 3 [Async thread]
testpc115159:2736637:2736637 [3] NCCL INFO group.cc:418 -> 3
testpc115159:2736637:2736637 [3] NCCL INFO group.cc:95 -> 3
testpc115159:2736637:2736637 [3] NCCL INFO comm 0x55870d3ce000 rank 3 nranks 4 cudaDev 3 busId c1000 - Abort COMPLETE
world size: 4, current rank: 3, local rank: 3
testpc115159:2736637:2739627 [0] NCCL INFO comm 0x5587025ac800 rank 3 nranks 4 cudaDev 3 busId c1000 - Abort COMPLETE

error code : 1
error info : testpc115159:2736632:2736632 [0] NCCL INFO Bootstrap : Using eno8303:109.105.115.159<0>
testpc115159:2736632:2736632 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
testpc115159:2736632:2736632 [0] NCCL INFO cudaDriverVersion 12040
NCCL version 2.20.5+cuda12.1
testpc115159:2736632:2737362 [0] NCCL INFO NET/IB : No device found.
testpc115159:2736632:2737362 [0] NCCL INFO NET/Socket : Using [0]eno8303:109.105.115.159<0> [1]vethf4fabc7:fe80::54a7:2aff:fe92:27be%vethf4fabc7<0>
testpc115159:2736632:2737362 [0] NCCL INFO Using non-device net plugin version 0
testpc115159:2736632:2737362 [0] NCCL INFO Using network Socket
testpc115159:2736632:2737362 [0] NCCL INFO comm 0x55a0bbb78800 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0xd57beaec7aafe641 - Init START
testpc115159:2736632:2737362 [0] NCCL INFO === System : maxBw 80.0 totalBw 240.0 ===
testpc115159:2736632:2737362 [0] NCCL INFO CPU/1 (1/2/-1)
testpc115159:2736632:2737362 [0] NCCL INFO + PCI[5000.0] - NIC/0
testpc115159:2736632:2737362 [0] NCCL INFO + PCI[24.0] - GPU/1000 (0)
testpc115159:2736632:2737362 [0] NCCL INFO               + NVL[80.0] - GPU/41000
testpc115159:2736632:2737362 [0] NCCL INFO               + NVL[80.0] - GPU/C1000
testpc115159:2736632:2737362 [0] NCCL INFO               + NVL[80.0] - GPU/81000
testpc115159:2736632:2737362 [0] NCCL INFO + SYS[16.0] - CPU/0
testpc115159:2736632:2737362 [0] NCCL INFO + SYS[16.0] - CPU/3
testpc115159:2736632:2737362 [0] NCCL INFO + SYS[16.0] - CPU/2
testpc115159:2736632:2737362 [0] NCCL INFO CPU/0 (1/2/-1)
testpc115159:2736632:2737362 [0] NCCL INFO + PCI[24.0] - GPU/41000 (1)
testpc115159:2736632:2737362 [0] NCCL INFO               + NVL[80.0] - GPU/C1000
testpc115159:2736632:2737362 [0] NCCL INFO               + NVL[80.0] - GPU/1000
testpc115159:2736632:2737362 [0] NCCL INFO               + NVL[80.0] - GPU/81000
testpc115159:2736632:2737362 [0] NCCL INFO + SYS[16.0] - CPU/1
testpc115159:2736632:2737362 [0] NCCL INFO + SYS[16.0] - CPU/3
testpc115159:2736632:2737362 [0] NCCL INFO + SYS[16.0] - CPU/2
testpc115159:2736632:2737362 [0] NCCL INFO CPU/3 (1/2/-1)
testpc115159:2736632:2737362 [0] NCCL INFO + PCI[24.0] - GPU/81000 (2)
testpc115159:2736632:2737362 [0] NCCL INFO               + NVL[80.0] - GPU/41000
testpc115159:2736632:2737362 [0] NCCL INFO               + NVL[80.0] - GPU/C1000
testpc115159:2736632:2737362 [0] NCCL INFO               + NVL[80.0] - GPU/1000
testpc115159:2736632:2737362 [0] NCCL INFO + SYS[16.0] - CPU/1
testpc115159:2736632:2737362 [0] NCCL INFO + SYS[16.0] - CPU/0
testpc115159:2736632:2737362 [0] NCCL INFO + SYS[16.0] - CPU/2
testpc115159:2736632:2737362 [0] NCCL INFO CPU/2 (1/2/-1)
testpc115159:2736632:2737362 [0] NCCL INFO + PCI[24.0] - GPU/C1000 (3)
testpc115159:2736632:2737362 [0] NCCL INFO               + NVL[80.0] - GPU/41000
testpc115159:2736632:2737362 [0] NCCL INFO               + NVL[80.0] - GPU/1000
testpc115159:2736632:2737362 [0] NCCL INFO               + NVL[80.0] - GPU/81000
testpc115159:2736632:2737362 [0] NCCL INFO + SYS[16.0] - CPU/1
testpc115159:2736632:2737362 [0] NCCL INFO + SYS[16.0] - CPU/0
testpc115159:2736632:2737362 [0] NCCL INFO + SYS[16.0] - CPU/3
testpc115159:2736632:2737362 [0] NCCL INFO + PCI[0.4] - NIC/E1000
testpc115159:2736632:2737362 [0] NCCL INFO ==========================================
testpc115159:2736632:2737362 [0] NCCL INFO GPU/1000 :GPU/1000 (0/5000.000000/LOC) GPU/41000 (1/80.000000/NVL) GPU/81000 (1/80.000000/NVL) GPU/C1000 (1/80.000000/NVL) CPU/1 (1/24.000000/PHB) CPU/0 (2/24.000000/PHB) CPU/3 (2/24.000000/PHB) CPU/2 (2/24.000000/PHB)
testpc115159:2736632:2737362 [0] NCCL INFO GPU/41000 :GPU/1000 (1/80.000000/NVL) GPU/41000 (0/5000.000000/LOC) GPU/81000 (1/80.000000/NVL) GPU/C1000 (1/80.000000/NVL) CPU/1 (2/24.000000/PHB) CPU/0 (1/24.000000/PHB) CPU/3 (2/24.000000/PHB) CPU/2 (2/24.000000/PHB)
testpc115159:2736632:2737362 [0] NCCL INFO GPU/81000 :GPU/1000 (1/80.000000/NVL) GPU/41000 (1/80.000000/NVL) GPU/81000 (0/5000.000000/LOC) GPU/C1000 (1/80.000000/NVL) CPU/1 (2/24.000000/PHB) CPU/0 (2/24.000000/PHB) CPU/3 (1/24.000000/PHB) CPU/2 (2/24.000000/PHB)
testpc115159:2736632:2737362 [0] NCCL INFO GPU/C1000 :GPU/1000 (1/80.000000/NVL) GPU/41000 (1/80.000000/NVL) GPU/81000 (1/80.000000/NVL) GPU/C1000 (0/5000.000000/LOC) CPU/1 (2/24.000000/PHB) CPU/0 (2/24.000000/PHB) CPU/3 (2/24.000000/PHB) CPU/2 (1/24.000000/PHB)
testpc115159:2736632:2737362 [0] NCCL INFO Setting affinity for GPU 0 to ffff,ff000000,00000000,0000ffff,ff000000
testpc115159:2736632:2737362 [0] NCCL INFO NVLS multicast support is not available on dev 0
testpc115159:2736632:2737362 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 0
testpc115159:2736632:2737362 [0] NCCL INFO  0 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736632:2737362 [0] NCCL INFO  1 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736632:2737362 [0] NCCL INFO  2 : GPU/0 GPU/2 GPU/3 GPU/1
testpc115159:2736632:2737362 [0] NCCL INFO  3 : GPU/0 GPU/2 GPU/1 GPU/3
testpc115159:2736632:2737362 [0] NCCL INFO  4 : GPU/0 GPU/3 GPU/1 GPU/2
testpc115159:2736632:2737362 [0] NCCL INFO  5 : GPU/0 GPU/3 GPU/2 GPU/1
testpc115159:2736632:2737362 [0] NCCL INFO  6 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736632:2737362 [0] NCCL INFO  7 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736632:2737362 [0] NCCL INFO  8 : GPU/0 GPU/2 GPU/3 GPU/1
testpc115159:2736632:2737362 [0] NCCL INFO  9 : GPU/0 GPU/2 GPU/1 GPU/3
testpc115159:2736632:2737362 [0] NCCL INFO 10 : GPU/0 GPU/3 GPU/1 GPU/2
testpc115159:2736632:2737362 [0] NCCL INFO 11 : GPU/0 GPU/3 GPU/2 GPU/1
testpc115159:2736632:2737362 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/20.000000, type NVL/PIX, sameChannels 0
testpc115159:2736632:2737362 [0] NCCL INFO  0 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736632:2737362 [0] NCCL INFO  1 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736632:2737362 [0] NCCL INFO  2 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736632:2737362 [0] NCCL INFO  3 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736632:2737362 [0] NCCL INFO  4 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736632:2737362 [0] NCCL INFO  5 : GPU/2 GPU/1 GPU/0 GPU/3
testpc115159:2736632:2737362 [0] NCCL INFO  6 : GPU/3 GPU/1 GPU/0 GPU/2
testpc115159:2736632:2737362 [0] NCCL INFO  7 : GPU/1 GPU/2 GPU/3 GPU/0
testpc115159:2736632:2737362 [0] NCCL INFO  8 : GPU/3 GPU/1 GPU/0 GPU/2
testpc115159:2736632:2737362 [0] NCCL INFO  9 : GPU/1 GPU/2 GPU/0 GPU/3
testpc115159:2736632:2737362 [0] NCCL INFO 10 : GPU/1 GPU/3 GPU/0 GPU/2
testpc115159:2736632:2737362 [0] NCCL INFO 11 : GPU/3 GPU/0 GPU/2 GPU/1
testpc115159:2736632:2737362 [0] NCCL INFO comm 0x55a0bbb78800 rank 0 nRanks 4 nNodes 1 localRanks 4 localRank 0 MNNVL 0
testpc115159:2736632:2737362 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1
testpc115159:2736632:2737362 [0] NCCL INFO Tree 12 : -1 -> 0 -> 1/-1/-1
testpc115159:2736632:2737362 [0] NCCL INFO Tree 1 : -1 -> 0 -> 1/-1/-1
testpc115159:2736632:2737362 [0] NCCL INFO Tree 13 : -1 -> 0 -> 1/-1/-1
testpc115159:2736632:2737362 [0] NCCL INFO Tree 2 : -1 -> 0 -> 1/-1/-1
testpc115159:2736632:2737362 [0] NCCL INFO Tree 14 : -1 -> 0 -> 1/-1/-1
testpc115159:2736632:2737362 [0] NCCL INFO Tree 3 : -1 -> 0 -> 1/-1/-1
testpc115159:2736632:2737362 [0] NCCL INFO Tree 15 : -1 -> 0 -> 1/-1/-1
testpc115159:2736632:2737362 [0] NCCL INFO Tree 4 : -1 -> 0 -> 1/-1/-1
testpc115159:2736632:2737362 [0] NCCL INFO Tree 16 : -1 -> 0 -> 1/-1/-1
testpc115159:2736632:2737362 [0] NCCL INFO Tree 11 : 3 -> 0 -> 2/-1/-1
testpc115159:2736632:2737362 [0] NCCL INFO Tree 23 : 3 -> 0 -> 2/-1/-1
testpc115159:2736632:2737362 [0] NCCL INFO Channel 00/24 :    0   1   2   3
testpc115159:2736632:2737362 [0] NCCL INFO Channel 01/24 :    0   1   3   2
testpc115159:2736632:2737362 [0] NCCL INFO Channel 02/24 :    0   2   1   0

testpc115159:2736632:2737362 [0] graph/rings.cc:38 NCCL WARN Error : ring 2 does not loop back to start (2 != 0)
testpc115159:2736632:2737362 [0] NCCL INFO graph/connect.cc:479 -> 3
testpc115159:2736632:2737362 [0] NCCL INFO init.cc:1169 -> 3
testpc115159:2736632:2737362 [0] NCCL INFO init.cc:1501 -> 3
testpc115159:2736632:2737362 [0] NCCL INFO group.cc:64 -> 3 [Async thread]
testpc115159:2736632:2736632 [0] NCCL INFO group.cc:418 -> 3
testpc115159:2736632:2736632 [0] NCCL INFO group.cc:95 -> 3
testpc115159:2736632:2739609 [0] NCCL INFO Using non-device net plugin version 0
testpc115159:2736632:2739609 [0] NCCL INFO Using network Socket
testpc115159:2736632:2739609 [0] NCCL INFO comm 0x55a0c69ac000 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0xa319625d1be869a3 - Init START
testpc115159:2736632:2739609 [0] NCCL INFO === System : maxBw 80.0 totalBw 240.0 ===
testpc115159:2736632:2739609 [0] NCCL INFO CPU/1 (1/2/-1)
testpc115159:2736632:2739609 [0] NCCL INFO + PCI[5000.0] - NIC/0
testpc115159:2736632:2739609 [0] NCCL INFO + PCI[24.0] - GPU/1000 (0)
testpc115159:2736632:2739609 [0] NCCL INFO               + NVL[80.0] - GPU/41000
testpc115159:2736632:2739609 [0] NCCL INFO               + NVL[80.0] - GPU/C1000
testpc115159:2736632:2739609 [0] NCCL INFO               + NVL[80.0] - GPU/81000
testpc115159:2736632:2739609 [0] NCCL INFO + SYS[16.0] - CPU/0
testpc115159:2736632:2739609 [0] NCCL INFO + SYS[16.0] - CPU/3
testpc115159:2736632:2739609 [0] NCCL INFO + SYS[16.0] - CPU/2
testpc115159:2736632:2739609 [0] NCCL INFO CPU/0 (1/2/-1)
testpc115159:2736632:2739609 [0] NCCL INFO + PCI[24.0] - GPU/41000 (1)
testpc115159:2736632:2739609 [0] NCCL INFO               + NVL[80.0] - GPU/C1000
testpc115159:2736632:2739609 [0] NCCL INFO               + NVL[80.0] - GPU/1000
testpc115159:2736632:2739609 [0] NCCL INFO               + NVL[80.0] - GPU/81000
testpc115159:2736632:2739609 [0] NCCL INFO + SYS[16.0] - CPU/1
testpc115159:2736632:2739609 [0] NCCL INFO + SYS[16.0] - CPU/3
testpc115159:2736632:2739609 [0] NCCL INFO + SYS[16.0] - CPU/2
testpc115159:2736632:2739609 [0] NCCL INFO CPU/3 (1/2/-1)
testpc115159:2736632:2739609 [0] NCCL INFO + PCI[24.0] - GPU/81000 (2)
testpc115159:2736632:2739609 [0] NCCL INFO               + NVL[80.0] - GPU/41000
testpc115159:2736632:2739609 [0] NCCL INFO               + NVL[80.0] - GPU/C1000
testpc115159:2736632:2739609 [0] NCCL INFO               + NVL[80.0] - GPU/1000
testpc115159:2736632:2739609 [0] NCCL INFO + SYS[16.0] - CPU/1
testpc115159:2736632:2739609 [0] NCCL INFO + SYS[16.0] - CPU/0
testpc115159:2736632:2739609 [0] NCCL INFO + SYS[16.0] - CPU/2
testpc115159:2736632:2739609 [0] NCCL INFO CPU/2 (1/2/-1)
testpc115159:2736632:2739609 [0] NCCL INFO + PCI[24.0] - GPU/C1000 (3)
testpc115159:2736632:2739609 [0] NCCL INFO               + NVL[80.0] - GPU/41000
testpc115159:2736632:2739609 [0] NCCL INFO               + NVL[80.0] - GPU/1000
testpc115159:2736632:2739609 [0] NCCL INFO               + NVL[80.0] - GPU/81000
testpc115159:2736632:2739609 [0] NCCL INFO + SYS[16.0] - CPU/1
testpc115159:2736632:2739609 [0] NCCL INFO + SYS[16.0] - CPU/0
testpc115159:2736632:2739609 [0] NCCL INFO + SYS[16.0] - CPU/3
testpc115159:2736632:2739609 [0] NCCL INFO + PCI[0.4] - NIC/E1000
testpc115159:2736632:2739609 [0] NCCL INFO ==========================================
testpc115159:2736632:2739609 [0] NCCL INFO GPU/1000 :GPU/1000 (0/5000.000000/LOC) GPU/41000 (1/80.000000/NVL) GPU/81000 (1/80.000000/NVL) GPU/C1000 (1/80.000000/NVL) CPU/1 (1/24.000000/PHB) CPU/0 (2/24.000000/PHB) CPU/3 (2/24.000000/PHB) CPU/2 (2/24.000000/PHB)
testpc115159:2736632:2739609 [0] NCCL INFO GPU/41000 :GPU/1000 (1/80.000000/NVL) GPU/41000 (0/5000.000000/LOC) GPU/81000 (1/80.000000/NVL) GPU/C1000 (1/80.000000/NVL) CPU/1 (2/24.000000/PHB) CPU/0 (1/24.000000/PHB) CPU/3 (2/24.000000/PHB) CPU/2 (2/24.000000/PHB)
testpc115159:2736632:2739609 [0] NCCL INFO GPU/81000 :GPU/1000 (1/80.000000/NVL) GPU/41000 (1/80.000000/NVL) GPU/81000 (0/5000.000000/LOC) GPU/C1000 (1/80.000000/NVL) CPU/1 (2/24.000000/PHB) CPU/0 (2/24.000000/PHB) CPU/3 (1/24.000000/PHB) CPU/2 (2/24.000000/PHB)
testpc115159:2736632:2739609 [0] NCCL INFO GPU/C1000 :GPU/1000 (1/80.000000/NVL) GPU/41000 (1/80.000000/NVL) GPU/81000 (1/80.000000/NVL) GPU/C1000 (0/5000.000000/LOC) CPU/1 (2/24.000000/PHB) CPU/0 (2/24.000000/PHB) CPU/3 (2/24.000000/PHB) CPU/2 (1/24.000000/PHB)
testpc115159:2736632:2739609 [0] NCCL INFO Setting affinity for GPU 0 to ffff,ff000000,00000000,0000ffff,ff000000
testpc115159:2736632:2739609 [0] NCCL INFO NVLS multicast support is not available on dev 0
testpc115159:2736632:2739609 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 16, bw 20.000000/20.000000, type NVL/PIX, sameChannels 0
testpc115159:2736632:2739609 [0] NCCL INFO  0 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736632:2739609 [0] NCCL INFO  1 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736632:2739609 [0] NCCL INFO  2 : GPU/0 GPU/2 GPU/3 GPU/1
testpc115159:2736632:2739609 [0] NCCL INFO  3 : GPU/1 GPU/3 GPU/0 GPU/2
testpc115159:2736632:2739609 [0] NCCL INFO  4 : GPU/0 GPU/2 GPU/3 GPU/1
testpc115159:2736632:2739609 [0] NCCL INFO  5 : GPU/2 GPU/3 GPU/1 GPU/0
testpc115159:2736632:2739609 [0] NCCL INFO  6 : GPU/1 GPU/3 GPU/2 GPU/0
testpc115159:2736632:2739609 [0] NCCL INFO  7 : GPU/2 GPU/1 GPU/0 GPU/3
testpc115159:2736632:2739609 [0] NCCL INFO  8 : GPU/2 GPU/0 GPU/3 GPU/1
testpc115159:2736632:2739609 [0] NCCL INFO  9 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736632:2739609 [0] NCCL INFO 10 : GPU/0 GPU/1 GPU/3 GPU/2
testpc115159:2736632:2739609 [0] NCCL INFO 11 : GPU/0 GPU/2 GPU/3 GPU/1
testpc115159:2736632:2739609 [0] NCCL INFO 12 : GPU/1 GPU/3 GPU/0 GPU/2
testpc115159:2736632:2739609 [0] NCCL INFO 13 : GPU/0 GPU/2 GPU/3 GPU/1
testpc115159:2736632:2739609 [0] NCCL INFO 14 : GPU/2 GPU/3 GPU/1 GPU/0
testpc115159:2736632:2739609 [0] NCCL INFO 15 : GPU/1 GPU/3 GPU/2 GPU/0
testpc115159:2736632:2739609 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 16, bw 40.000000/20.000000, type NVL/PIX, sameChannels 0
testpc115159:2736632:2739609 [0] NCCL INFO  0 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736632:2739609 [0] NCCL INFO  1 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736632:2739609 [0] NCCL INFO  2 : GPU/0 GPU/1 GPU/2 GPU/3
testpc115159:2736632:2739609 [0] NCCL INFO  3 : GPU/0 GPU/2 GPU/3 GPU/1
testpc115159:2736632:2739609 [0] NCCL INFO  4 : GPU/0 GPU/2 GPU/1 GPU/3
testpc115159:2736632:2739609 [0] NCCL INFO  5 : GPU/0 GPU/2 GPU/1 GPU/3
testpc115159:2736632:2739609 [0] NCCL INFO  6 : GPU/0 GPU/3 GPU/1 GPU/2
testpc115159:2736632:2739609 [0] NCCL INFO  7 : GPU/2 GPU/0 GPU/3 GPU/1
testpc115159:2736632:2739609 [0] NCCL INFO  8 : GPU/1 GPU/3 GPU/2 GPU/0
testpc115159:2736632:2739609 [0] NCCL INFO  9 : GPU/3 GPU/2 GPU/0 GPU/1
testpc115159:2736632:2739609 [0] NCCL INFO 10 : GPU/3 GPU/0 GPU/2 GPU/1
testpc115159:2736632:2739609 [0] NCCL INFO 11 : GPU/3 GPU/1 GPU/0 GPU/2
testpc115159:2736632:2739609 [0] NCCL INFO 12 : GPU/1 GPU/0 GPU/3 GPU/2
testpc115159:2736632:2739609 [0] NCCL INFO 13 : GPU/3 GPU/1 GPU/0 GPU/2
testpc115159:2736632:2739609 [0] NCCL INFO 14 : GPU/3 GPU/1 GPU/0 GPU/2
testpc115159:2736632:2739609 [0] NCCL INFO 15 : GPU/3 GPU/2 GPU/0 GPU/1
testpc115159:2736632:2739609 [0] NCCL INFO comm 0x55a0c69ac000 rank 0 nRanks 4 nNodes 1 localRanks 4 localRank 0 MNNVL 0
testpc115159:2736632:2739609 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1
testpc115159:2736632:2739609 [0] NCCL INFO Tree 12 : -1 -> 0 -> 1/-1/-1
testpc115159:2736632:2739609 [0] NCCL INFO Tree 1 : -1 -> 0 -> 1/-1/-1
testpc115159:2736632:2739609 [0] NCCL INFO Tree 13 : -1 -> 0 -> 1/-1/-1
testpc115159:2736632:2739609 [0] NCCL INFO Tree 2 : -1 -> 0 -> 1/-1/-1
testpc115159:2736632:2739609 [0] NCCL INFO Tree 14 : -1 -> 0 -> 1/-1/-1
testpc115159:2736632:2739609 [0] NCCL INFO Tree 3 : -1 -> 0 -> 2/-1/-1
testpc115159:2736632:2739609 [0] NCCL INFO Tree 15 : -1 -> 0 -> 2/-1/-1
testpc115159:2736632:2739609 [0] NCCL INFO Tree 4 : -1 -> 0 -> 2/-1/-1
testpc115159:2736632:2739609 [0] NCCL INFO Tree 16 : -1 -> 0 -> 2/-1/-1
testpc115159:2736632:2739609 [0] NCCL INFO Tree 5 : -1 -> 0 -> 2/-1/-1
testpc115159:2736632:2739609 [0] NCCL INFO Tree 17 : -1 -> 0 -> 2/-1/-1
testpc115159:2736632:2739609 [0] NCCL INFO Tree 6 : -1 -> 0 -> 3/-1/-1
testpc115159:2736632:2739609 [0] NCCL INFO Tree 18 : -1 -> 0 -> 3/-1/-1
testpc115159:2736632:2739609 [0] NCCL INFO Tree 7 : 2 -> 0 -> 3/-1/-1
testpc115159:2736632:2739609 [0] NCCL INFO Tree 19 : 2 -> 0 -> 3/-1/-1
testpc115159:2736632:2739609 [0] NCCL INFO Tree 10 : 3 -> 0 -> 2/-1/-1
testpc115159:2736632:2739609 [0] NCCL INFO Tree 22 : 3 -> 0 -> 2/-1/-1
testpc115159:2736632:2739609 [0] NCCL INFO Channel 00/24 :    0   1   2   3
testpc115159:2736632:2739609 [0] NCCL INFO Channel 01/24 :    0   1   3  -1

testpc115159:2736632:2739609 [0] graph/rings.cc:51 NCCL WARN Error : ring 1 does not contain rank 2
testpc115159:2736632:2739609 [0] NCCL INFO graph/connect.cc:479 -> 3
testpc115159:2736632:2739609 [0] NCCL INFO init.cc:1169 -> 3
testpc115159:2736632:2739609 [0] NCCL INFO init.cc:1501 -> 3
testpc115159:2736632:2739609 [0] NCCL INFO group.cc:64 -> 3 [Async thread]
testpc115159:2736632:2736632 [0] NCCL INFO group.cc:418 -> 3
testpc115159:2736632:2736632 [0] NCCL INFO group.cc:95 -> 3
testpc115159:2736632:2736632 [0] NCCL INFO comm 0x55a0c69ac000 rank 0 nranks 4 cudaDev 0 busId 1000 - Abort COMPLETE
Running on 4 ranks using nccl backend
fail to enable all_to_all_single primitive: NCCL error in: /mnt/yuanningbai/pytorch-2.3.0/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, internal error - please report this issue to the NCCL developers, NCCL version 2.20.5
ncclInternalError: Internal check failed.
Last error:
Error : ring 2 does not loop back to start (2 != 0)
world size: 4, current rank: 0, local rank: 0
Using 1 GPU(s)...
-*-*-*-*-*-*nn.EmbeddingBag-*-*-*-*-*-*
-*-*-*-*-*-*nn.EmbeddingBag-*-*-*-*-*-*
testpc115159:2736632:2739631 [0] NCCL INFO comm 0x55a0bbb78800 rank 0 nranks 4 cudaDev 0 busId 1000 - Abort COMPLETE


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants