Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transport/net_ib.cu:788 NCCL WARN NET/IB : Got completion with error 4, opcode 1, len 21938, vendor err 81 - GPU Direct RDMA error when running NCCL-tests allreduce with nv_peer_mem #214

Closed
wavesj opened this issue May 3, 2019 · 8 comments

Comments

@wavesj
Copy link

wavesj commented May 3, 2019

These errors occur when running NCCL-tests with allreduce bandwidth test nv_peer_mem kernel module loaded. When the nv_peer_mem kernel module` is not loaded it completes with ~20 GB/s bandwidth (about half of what we should see if we enable GPU Direct RDMA). I have run it in two separate software environments and with two separate versions of NCCL. Hoping to find a solution to this.

Hardware Environment

Two nodes:

  • 8x NVIDIA V100 32 GB SXM2 Modules in NVLink
  • 4x Mellanox 100 Gb/s Connect X-5 NICs with latest firmware version (16.24.1000)
  • Intel Skylake CPUs

Env 1 (Got completion with error 11, opcode 2, len 1048576, vendor err 137)

SW Version
OS Ubuntu 18.04
NCCL 2.3.7
NVIDIA Driver 410.48
CUDA 10.0
Linux Kernel 4.15.0-47-generic
OFED MLNX_OFED_LINUX-4.5-1.0.1.0 (OFED-4.5-1.0.1)
OpenMPI 4.0.0
nv_peer_mem 1.0-8

Command

LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/lib64 mpirun --allow-run-as-root -np 16 --hostfile hostfile -mca btl_tcp_if_include enp129s0f1 -x NCCL_IB_HCA=mlx5_0,mlx5_2,mlx5_4,mlx5_6 -x LD_LIBRARY_PATH -x NCCL_DEBUG=INFO ./build/all_reduce_perf -b 16M -e 1G -f 2 -g 1 -t 1 -c 0 -w 10 -n 3

Error Excerpt:

node-8x-v100-nvlink-2:19258:19394 [2] transport/net_ib.cu:829 NCCL WARN NET/IB : Got completion with error 11, opcode 2, len 1048576, vendor err 137
node-8x-v100-nvlink-2:19258:19394 [2] NCCL INFO transport/net_ib.cu:805 -> 2
node-8x-v100-nvlink-2:19258:19394 [2] NCCL INFO include/net.h:29 -> 2
mlx5: node-8x-v100-nvlink-1: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000001 00000000 00000000 00000000
00000000 00008914 100000ca 00006fd2

node-8x-v100-nvlink-1:26175:26326 [0] transport/net_ib.cu:829 NCCL WARN NET/IB : Got completion with error 11, opcode 2, len 1048576, vendor err 137
node-8x-v100-nvlink-1:26175:26326 [0] NCCL INFO transport/net_ib.cu:805 -> 2
node-8x-v100-nvlink-1:26175:26326 [0] NCCL INFO include/net.h:29 -> 2

node-8x-v100-nvlink-2:19260:19396 [4] transport/net_ib.cu:829 NCCL WARN NET/IB : Got completion with error 5, opcode 2, len 32655, vendor err 249
node-8x-v100-nvlink-2:19260:19396 [4] NCCL INFO transport/net_ib.cu:805 -> 2
node-8x-v100-nvlink-2:19260:19396 [4] NCCL INFO include/net.h:29 -> 2
mlx5: node-8x-v100-nvlink-1: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000001 00000000 00000000 00000000
00000000 00008914 100000ca 00006fd2

node-8x-v100-nvlink-1:26177:26328 [2] transport/net_ib.cu:829 NCCL WARN NET/IB : Got completion with error 11, opcode 2, len 1048576, vendor err 137
node-8x-v100-nvlink-1:26177:26328 [2] NCCL INFO transport/net_ib.cu:805 -> 2
node-8x-v100-nvlink-1:26177:26328 [2] NCCL INFO include/net.h:29 -> 2

node-8x-v100-nvlink-1:26183:26324 [6] transport/net_ib.cu:829 NCCL WARN NET/IB : Got completion with error 5, opcode 2, len 32746, vendor err 249
node-8x-v100-nvlink-1:26183:26324 [6] NCCL INFO transport/net_ib.cu:805 -> 2
node-8x-v100-nvlink-1:26183:26324 [6] NCCL INFO include/net.h:29 -> 2

node-8x-v100-nvlink-2:19260:19396 [4] transport/net_ib.cu:829 NCCL WARN NET/IB : Got completion with error 5, opcode 2, len 32655, vendor err 249
node-8x-v100-nvlink-2:19260:19396 [4] NCCL INFO transport/net_ib.cu:805 -> 2
node-8x-v100-nvlink-2:19260:19396 [4] NCCL INFO include/net.h:29 -> 2

node-8x-v100-nvlink-2:19260:19396 [4] transport/net_ib.cu:829 NCCL WARN NET/IB : Got completion with error 5, opcode 2, len 0, vendor err 249
node-8x-v100-nvlink-2:19260:19396 [4] NCCL INFO transport/net_ib.cu:805 -> 2
node-8x-v100-nvlink-2:19260:19396 [4] NCCL INFO include/net.h:29 -> 2

Full Error Log:
https://gist.github.com/wavesj/258782634523281d238e99c4c4a79990

Env 2 (Got completion with error 4, opcode 1, len 21938, vendor err 81 node-8x-v100-nvlink-2:17939:18094 [0] NCCL INFO include/net.h:34 -> 2)

SW Version
OS Ubuntu 18.04
NCCL 2.4.2
NVIDIA Driver 418.40.04
CUDA 10.1
Linux Kernel 4.15.0-47-generic
OFED MLNX_OFED_LINUX-4.5-1.0.1.0 (OFED-4.5-1.0.1)
OpenMPI 4.0.0
nv_peer_mem 1.0-8

Command

LD_LIBRARY_PATH=/usr/lib64 mpirun --allow-run-as-root -np 16 --hostfile hostfile -mca btl_tcp_if_include enp129s0f1 -x NCCL_IB_HCA=mlx5_0,mlx5
_2,mlx5_4,mlx5_6 -x LD_LIBRARY_PATH -x NCCL_DEBUG=INFO ./build/all_reduce_perf -b 16M -e 1G -f 2 -g 1 -t 1 -c 0 -w 10 -n 3

Error Excerpt:

node-8x-v100-nvlink-1:39719:39719 [3] init.cu:1197 NCCL WARN Mismatched collective detected, please check your collective calls at and around rank 11. You can use NCCL_DEBUG=INFO and NCCL_DEBUG_SUBSYS=COLL to see the collective logs

node-8x-v100-nvlink-1:39727:39727 [7] init.cu:1197 NCCL WARN Mismatched collective detected, please check your collective calls at and around rank 15. You can use NCCL_DEBUG=INFO and NCCL_DEBUG_SUBSYS=COLL to see the collective logs

node-8x-v100-nvlink-1:39717:39717 [1] init.cu:1197 NCCL WARN Mismatched collective detected, please check your collective calls at and around rank 9. You can use NCCL_DEBUG=INFO and NCCL_DEBUG_SUBSYS=COLL to see the collective logs

node-8x-v100-nvlink-1:39721:39721 [5] init.cu:1197 NCCL WARN Mismatched collective detected, please check your collective calls at and around rank 13. You can use NCCL_DEBUG=INFO and NCCL_DEBUG_SUBSYS=COLL to see the collective logs
node-8x-v100-nvlink-2:17932:18078 [1] NCCL INFO comm 0x7fd694002360 rank 1 nranks 16 cudaDev 1 nvmlDev 1 - Init COMPLETE
node-8x-v100-nvlink-2:17947:18081 [7] NCCL INFO comm 0x7f1b44002360 rank 7 nranks 16 cudaDev 7 nvmlDev 7 - Init COMPLETE
node-8x-v100-nvlink-2:17931:18046 [0] NCCL INFO Trees [0] 2->0->3/-1/-1 [1] -1->0->2/8/-1 [2] 1->0->4/-1/-1 [3] 4->0->1/-1/-1 [4] 2->0->3/-1/-1 [5] 8->0->2/-1/-1 [6] 1->0->4/-1/-1 [7] 4->0->1/-1/-1
node-8x-v100-nvlink-2:17931:18046 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees enabled for all sizes
node-8x-v100-nvlink-2:17943:18076 [6] NCCL INFO comm 0x7f67bc002360 rank 6 nranks 16 cudaDev 6 nvmlDev 6 - Init COMPLETE
node-8x-v100-nvlink-2:17936:18080 [3] NCCL INFO comm 0x7feb78002360 rank 3 nranks 16 cudaDev 3 nvmlDev 3 - Init COMPLETE
node-8x-v100-nvlink-2:17933:18077 [2] NCCL INFO comm 0x7f4e5c002360 rank 2 nranks 16 cudaDev 2 nvmlDev 2 - Init COMPLETE
node-8x-v100-nvlink-2:17931:18046 [0] NCCL INFO comm 0x7f5b40002360 rank 0 nranks 16 cudaDev 0 nvmlDev 0 - Init COMPLETE
node-8x-v100-nvlink-2:17939:18075 [4] NCCL INFO Ring 07 : 4 -> 12 [send] via NET/IB/3/GDRDMA
#
#                                                     out-of-place                       in-place          
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
node-8x-v100-nvlink-2:17931:17931 [0] NCCL INFO Launch mode Parallel
node-8x-v100-nvlink-2:17939:18075 [4] NCCL INFO Ring 07 : 12 -> 4 [receive] via NET/IB/3/GDRDMA
node-8x-v100-nvlink-2:17939:18075 [4] NCCL INFO Trees [0] 5->4->7/-1/-1 [1] 7->4->6/-1/-1 [2] 0->4->5/-1/-1 [3] -1->4->0/12/-1 [4] 5->4->7/-1/-1 [5] 7->4->6/-1/-1 [6] 0->4->5/-1/-1 [7] 12->4->0/-1/-1
node-8x-v100-nvlink-2:17939:18075 [4] NCCL INFO comm 0x7f4378002360 rank 4 nranks 16 cudaDev 4 nvmlDev 4 - Init COMPLETE
mlx5: node-8x-v100-nvlink-2: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000007 00000000 00000000 00000000
00000000 02005104 090000b7 0000c3d2

node-8x-v100-nvlink-2:17939:18094 [0] transport/net_ib.cu:788 NCCL WARN NET/IB : Got completion with error 4, opcode 1, len 21938, vendor err 81
node-8x-v100-nvlink-2:17939:18094 [0] NCCL INFO include/net.h:34 -> 2
node-8x-v100-nvlink-2:17939:18094 [0] NCCL INFO transport/net.cu:478 -> 2
node-8x-v100-nvlink-2:17939:18094 [0] NCCL INFO transport.cu:163 -> 2 [Proxy Thread]

node-8x-v100-nvlink-2:17947:17947 [7] init.cu:1197 NCCL WARN Mismatched collective detected, please check your collective calls at and around rank 7. You can use NCCL_DEBUG=INFO and NCCL_DEBUG_SUBSYS=COLL to see the collective logs
node-8x-v100-nvlink-1:39719:39719 [3] NCCL INFO Destroyed comm 0x7f11a8002360 rank 11
node-8x-v100-nvlink-1: Test NCCL failure common.cu:345 'invalid usage'
 .. node-8x-v100-nvlink-1: Test failure common.cu:393
 .. node-8x-v100-nvlink-1: Test failure common.cu:492
 .. node-8x-v100-nvlink-1: Test failure all_reduce.cu:103
 .. node-8x-v100-nvlink-1: Test failure common.cu:518
 .. node-8x-v100-nvlink-1: Test failure common.cu:839
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
node-8x-v100-nvlink-1:39727:39727 [7] NCCL INFO Destroyed comm 0x7f7c40002360 rank 15
node-8x-v100-nvlink-1: Test NCCL failure common.cu:345 'invalid usage'
 .. node-8x-v100-nvlink-1: Test failure common.cu:393
 .. node-8x-v100-nvlink-1: Test failure common.cu:492
 .. node-8x-v100-nvlink-1: Test failure all_reduce.cu:103
 .. node-8x-v100-nvlink-1: Test failure common.cu:518
 .. node-8x-v100-nvlink-1: Test failure common.cu:839
node-8x-v100-nvlink-2:17947:17947 [7] NCCL INFO Destroyed comm 0x7f1b44002360 rank 7
node-8x-v100-nvlink-2: Test NCCL failure common.cu:345 'invalid usage'
 .. node-8x-v100-nvlink-2: Test failure common.cu:393
 .. node-8x-v100-nvlink-2: Test failure common.cu:492
 .. node-8x-v100-nvlink-2: Test failure all_reduce.cu:103
 .. node-8x-v100-nvlink-2: Test failure common.cu:518
 .. node-8x-v100-nvlink-2: Test failure common.cu:839
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Full Error Log:
https://gist.github.com/wavesj/c8be90e8716a6dc77cba1228bff53347

OpenMPI config things (for those trying to reproduce)

Hostfile looks like this:

node-8x-v100-nvlink-1        slots=8
node-8x-v100-nvlink-2        slots=8

SSH config looks like this (~/.ssh/config):

Host node-8x-v100-nvlink-1
  HostName 10.1.10.15
  User root

Host node-8x-v100-nvlink-2
  HostName 10.1.10.169
  User root

Additional Notes

Mellanox said that vendor error 0x81 is timeout and transport error counter exceeded. RDMAMojo's description on the ibv_poll_cq function suggests that IBV_WC_LOC_PROT_ERR is the error we're getting back when it's code 4.

@wavesj
Copy link
Author

wavesj commented May 3, 2019

@sjeaugey should I cross post this to nv_peer_mem or rdma-core?

@sjeaugey
Copy link
Member

sjeaugey commented May 3, 2019

No need to post it elsewhere for now.

I have a couple of questions :

  • Is this happening baremetal or inside a container/virtual machine ?
  • Is VT-d enabled on the CPU ?
  • Can you explain why we're using mlx5_0/2/4/6 and not 0/1/2/3 ? Are those dual port NICs or did you duplicate interfaces somehow (e.g. with SR-IOV) ?

@wavesj
Copy link
Author

wavesj commented May 3, 2019

Is this happening baremetal or inside a container/virtual machine?

Both in container and on bare metal.

Is VT-d enabled on the CPU ?

It was enabled, but after disabling it via the BIOS we still get the same error 11 and error 5 in ENV 1 and now instead of error 4 in ENV 2 I get the same error 11 + error 5 in both ENV 1 and ENV 2.

mlx5: node-8x-v100-nvlink-2: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000001 00000000 00000000 00000000
00000000 00008914 10000090 000035d2
mlx5: node-8x-v100-nvlink-2: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000001 00000000 00000000 00000000
00000000 00008914 10000090 000035d2
node-8x-v100-nvlink-2:11267:11363 [1] NCCL INFO Ring 03 : 1[1] -> 3[3] via P2P/IPC
node-8x-v100-nvlink-2:11267:11363 [1] NCCL INFO comm 0x7fb96004a3f0 rank 1 nranks 16 - COMPLETE
node-8x-v100-nvlink-2:11269:11362 [3] NCCL INFO NET/IB: Dev 2 Port 1 qpn 142 mtu 5 LID 4
node-8x-v100-nvlink-2:11269:11362 [3] NCCL INFO Ring 03 : 3[3] -> 2[2] via P2P/IPC
node-8x-v100-nvlink-2:11269:11362 [3] NCCL INFO comm 0x7fddc804a3f0 rank 3 nranks 16 - COMPLETE
node-8x-v100-nvlink-2:11266:11328 [0] NCCL INFO Ring 02 : 0[0] -> 4[4] via P2P/IPC
node-8x-v100-nvlink-2:11266:11328 [0] NCCL INFO Ring 03 : 0[0] -> 1[1] via P2P/IPC
node-8x-v100-nvlink-2:11266:11328 [0] NCCL INFO comm 0x7f588004a3f0 rank 0 nranks 16 - COMPLETE
#
#                                                     out-of-place                       in-place          
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
node-8x-v100-nvlink-2:11266:11266 [0] NCCL INFO Launch mode Parallel

node-8x-v100-nvlink-2:11266:11366 [0] transport/net_ib.cu:829 NCCL WARN NET/IB : Got completion with error 11, opcode 2, len 1048576, vendor err 137
node-8x-v100-nvlink-2:11266:11366 [0] NCCL INFO transport/net_ib.cu:805 -> 2
node-8x-v100-nvlink-2:11266:11366 [0] NCCL INFO include/net.h:29 -> 2

node-8x-v100-nvlink-2:11266:11366 [0] transport/net_ib.cu:829 NCCL WARN NET/IB : Got completion with error 5, opcode 2, len 32600, vendor err 249
node-8x-v100-nvlink-2:11266:11366 [0] NCCL INFO transport/net_ib.cu:805 -> 2
node-8x-v100-nvlink-2:11266:11366 [0] NCCL INFO include/net.h:29 -> 2

Can you explain why we're using mlx5_0/2/4/6 and not 0/1/2/3 ? Are those dual port NICs or did you duplicate interfaces somehow (e.g. with SR-IOV) ?

Those are dual port NICs. SR-IOV is not enabled.

@sjeaugey
Copy link
Member

sjeaugey commented May 3, 2019

Could you double check there is no ACS enabled on either of the nodes ?

sudo lspci -vvv | grep -i acsctl

@wavesj
Copy link
Author

wavesj commented May 3, 2019

Both nodes show this but let me get the actual vendors names printed:

$ sudo lspci -vvv | grep -i acsctl
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
                ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
                ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
                ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
                ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
                ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
                ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
                ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
                ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-

@sjeaugey
Copy link
Member

sjeaugey commented May 3, 2019

It would be good to make sure all show SrcValid- (like shown here : https://devtalk.nvidia.com/default/topic/883054/cuda-programming-and-performance/multi-gpu-peer-to-peer-access-failing-on-tesla-k80-/2 and here : https://www.supermicro.com/support/faqs/faq.cfm?faq=20732 for an example)

@wavesj
Copy link
Author

wavesj commented May 3, 2019

This fixes it. Thank you Sylvain.

@wavesj wavesj closed this as completed May 3, 2019
@kuenishi
Copy link

kuenishi commented May 7, 2019

This issue was also our case and @sjeaugey 's advice fixed our issue too. Thanks for the tip and also reporting the issue @wavesj .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants