Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nccl test failed when using gdr #78

Open
wangshaochuang opened this issue Nov 2, 2020 · 2 comments
Open

nccl test failed when using gdr #78

wangshaochuang opened this issue Nov 2, 2020 · 2 comments

Comments

@wangshaochuang
Copy link

No description provided.

@wangshaochuang wangshaochuang changed the title nccl will nccl test failed when using gdr Nov 2, 2020
@wangshaochuang
Copy link
Author

wangshaochuang commented Nov 2, 2020

disable p2p and shm for network test

./all_reduce_perf -g 2
nThread 1 nGpus 2 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 validation: 1

Using devices
Rank 0 Pid 85517 on k69a05298 device 0 [0x52] A100-SXM4-40GB
Rank 1 Pid 85517 on k69a05298 device 1 [0x57] A100-SXM4-40GB
k69a05298:85517:85517 [0] NCCL INFO Bootstrap : Using [0]bond0:100.82.131.167<0> [1]bond1:11.22.33.61<0> [2]bond2:11.22.33.62<0> [3]bond3:11.22.33.63<0> [4]bond4:11.22.33.64<0> [5]bond5:11.22.33.65<0> [6]bond6:11.22.33.66<0> [7]bond7:11.22.33.67<0> [8]bond8:11.22.33.68<0> [9]br-bb9003a7ecb2:192.168.10.1<0>
k69a05298:85517:85517 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
k69a05298:85517:85517 [0] NCCL INFO NET/IB : Using [0]mlx5_bond_8:1/RoCE [1]mlx5_bond_7:1/RoCE [2]mlx5_bond_6:1/RoCE [3]mlx5_bond_5:1/RoCE [4]mlx5_bond_4:1/RoCE [5]mlx5_bond_3:1/RoCE [6]mlx5_bond_2:1/RoCE [7]mlx5_bond_1:1/RoCE [8]mlx5_bond_0:1/RoCE ; OOB bond0:100.82.131.167<0>
k69a05298:85517:85517 [0] NCCL INFO Using network IB
NCCL version 2.7.8+cuda11.0
k69a05298:85517:85801 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
k69a05298:85517:85801 [0] NCCL INFO NCCL_SHM_DISABLE set by environment to 1.
k69a05298:85517:85802 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
k69a05298:85517:85801 [0] NCCL INFO Channel 00/02 : 0 1
k69a05298:85517:85801 [0] NCCL INFO Channel 01/02 : 0 1
k69a05298:85517:85802 [1] NCCL INFO Trees [0] -1/-1/-1->1->0|0->1->-1/-1/-1 [1] 0/-1/-1->1->-1|-1->1->0/-1/-1
k69a05298:85517:85802 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffffffff,ffffffff,ffffffff
k69a05298:85517:85801 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
k69a05298:85517:85801 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] -1/-1/-1->0->1|1->0->-1/-1/-1
k69a05298:85517:85801 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffffffff,ffffffff,ffffffff
k69a05298:85517:85802 [1] NCCL INFO Channel 00 : 0[52000] -> 1[57000] [receive] via NET/IB/6/GDRDMA
k69a05298:85517:85801 [0] NCCL INFO Channel 00 : 1[57000] -> 0[52000] [receive] via NET/IB/7/GDRDMA
k69a05298:85517:85802 [1] NCCL INFO Channel 00 : 1[57000] -> 0[52000] [send] via NET/IB/6/GDRDMA
k69a05298:85517:85801 [0] NCCL INFO Channel 00 : 0[52000] -> 1[57000] [send] via NET/IB/7/GDRDMA
k69a05298:85517:85802 [1] NCCL INFO Channel 01 : 0[52000] -> 1[57000] [receive] via NET/IB/6/GDRDMA
k69a05298:85517:85801 [0] NCCL INFO Channel 01 : 1[57000] -> 0[52000] [receive] via NET/IB/7/GDRDMA
k69a05298:85517:85802 [1] NCCL INFO Channel 01 : 1[57000] -> 0[52000] [send] via NET/IB/6/GDRDMA
k69a05298:85517:85801 [0] NCCL INFO Channel 01 : 0[52000] -> 1[57000] [send] via NET/IB/7/GDRDMA
k69a05298:85517:85802 [1] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
k69a05298:85517:85801 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
k69a05298:85517:85801 [0] NCCL INFO comm 0x7f6effc59800 rank 0 nranks 2 cudaDev 0 busId 52000 - Init COMPLETE
k69a05298:85517:85802 [1] NCCL INFO comm 0x7f6ef0000b60 rank 1 nranks 2 cudaDev 1 busId 57000 - Init COMPLETE

                                                 out-of-place                       in-place
   size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
    (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)

k69a05298:85517:85517 [0] NCCL INFO Launch mode Group/CGMD
mlx5: k69a05298.eu95sqa: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 02005104 080037a1 0000e4d2
mlx5: k69a05298.eu95sqa: got completion with error:

k69a05298:85517:85837 [0] transport/net_ib.cc:818 NCCL WARN NET/IB : Got completion with error 4, opcode 1, len 32622, vendor err 81
00000000 00000000 00000000 00000000
k69a05298:85517:85837 [0] NCCL INFO include/net.h:28 -> 2
00000000 00000000 00000000 00000000
k69a05298:85517:85837 [0] NCCL INFO transport/net.cc:310 -> 2
00000003 00000000 00000000 00000000
k69a05298:85517:85837 [0] NCCL INFO proxy.cc:198 -> 2 [Proxy Thread]
00000000 02005104 08003c03 00004ed2

k69a05298:85517:85836 [0] transport/net_ib.cc:818 NCCL WARN NET/IB : Got completion with error 4, opcode 1, len 32622, vendor err 81
k69a05298:85517:85836 [0] NCCL INFO include/net.h:28 -> 2
k69a05298:85517:85836 [0] NCCL INFO transport/net.cc:310 -> 2
k69a05298:85517:85836 [0] NCCL INFO proxy.cc:198 -> 2 [Proxy Thread]

Environment:DGX-2
Hardware: GPU A100 NIC Mellanox CX5
nvdriver version 450.51.05
cuda version 11.0
ofed version 5.0
nccl version 2.7.8

@OasisArtisan
Copy link

I'm a user like you but I had the same problem and I solved it by disabling PCIe ACS.

I got my information from this issue which seems to match your problem.
NVIDIA/nccl#214

And this https://forums.developer.nvidia.com/t/multi-gpu-peer-to-peer-access-failing-on-tesla-k80/39748/13
explains how to disable PCIe ACS

I'm not an expert so take my suggestion with a grain of salt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants