NCCL WARN NET/IB : Got completion with error 12, opcode 0, len 0, vendor err 129 #426

NHZlX · 2020-12-01T10:33:23Z

Openmpi 1.8.5
Nccl 2.8.3
Cuda10.2
MLNX_OFED_LINUX-5.1-2.5.8.0

ibv_devinfo:

hca_id:	mlx5_bond_0
	transport:			InfiniBand (0)
	fw_ver:				16.28.2006
	node_guid:			0c42:a103:0023:ac92
	sys_image_guid:			0c42:a103:0023:ac92
	vendor_id:			0x02c9
	vendor_part_id:			4119
	hw_ver:				0x0
	board_id:			MT_0000000012
	phys_port_cnt:			1
		port:	1
			state:			PORT_ACTIVE (4)
			max_mtu:		4096 (5)
			active_mtu:		1024 (3)
			sm_lid:			0
			port_lid:		0
			port_lmc:		0x00
			link_layer:		Ethernet

This issue like https://github.com/NVIDIA/nccl/issues/214， but i have verified that there is no ACS enabled on either of the nodes.

The following are the command and the error log:

mpirun --allow-run-as-root -np 2 --mca btl tcp,self --mca btl_tcp_if_exclude eth0 -host , -x CUDA_VISIBLE_DEVICES="0,2" -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x NCCL_IB_HCA=mlx5_bond_0:1 -x NCCL_P2P_DISABLE=0 -x NCCL_SHM_DISABLE=0 -x NCCL_IB_DISABLE=0 -x NCCL_IB_CUDA_SUPPORT=0 /home/mingkun/enter/test/nccl-tests/build/all_reduce_perf -b 9 -e 128M -f 2 -g 1 -z 0

# nThread 1 nGpus 1 minBytes 9 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
#   Rank  0 Pid  20109 on machine-17 device  0 [0x1a] Tesla V100-SXM2-32GB
#   Rank  1 Pid  70497 on machine-19 device  0 [0x1a] Tesla V100-SXM2-32GB
machine-17:20109:20109 [0] NCCL INFO Bootstrap : Using eth0:10.11.170.41<0>
machine-17:20109:20109 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
machine-17:20109:20109 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
machine-17:20109:20109 [0] NCCL INFO NET/IB : Using [0]mlx5_bond_0:1/RoCE ; OOB eth0:10.11.170.41<0>
machine-17:20109:20109 [0] NCCL INFO Using network IB
machine-19:70497:70497 [0] NCCL INFO Bootstrap : Using eth0:10.11.170.43<0>
machine-19:70497:70497 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
machine-19:70497:70497 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
machine-19:70497:70497 [0] NCCL INFO NET/IB : Using [0]mlx5_bond_0:1/RoCE ; OOB eth0:10.11.170.43<0>
machine-19:70497:70497 [0] NCCL INFO Using network IB
NCCL version 2.8.3+cuda10.2
machine-19:70497:70509 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1
machine-19:70497:70509 [0] NCCL INFO Setting affinity for GPU 0 to 010000,00000001
machine-17:20109:20124 [0] NCCL INFO Channel 00/02 :    0   1
machine-17:20109:20124 [0] NCCL INFO Channel 01/02 :    0   1
machine-17:20109:20124 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1
machine-17:20109:20124 [0] NCCL INFO Setting affinity for GPU 0 to 010000,00000001
machine-19:70497:70509 [0] NCCL INFO NCCL_SHM_DISABLE set by environment to 0.
machine-17:20109:20124 [0] NCCL INFO NCCL_SHM_DISABLE set by environment to 0.
machine-19:70497:70509 [0] NCCL INFO Channel 00 : 0[1a000] -> 1[1a000] [receive] via NET/IB/0
machine-17:20109:20124 [0] NCCL INFO Channel 00 : 1[1a000] -> 0[1a000] [receive] via NET/IB/0
machine-19:70497:70509 [0] NCCL INFO Channel 01 : 0[1a000] -> 1[1a000] [receive] via NET/IB/0
machine-19:70497:70509 [0] NCCL INFO Channel 00 : 1[1a000] -> 0[1a000] [send] via NET/IB/0
machine-17:20109:20124 [0] NCCL INFO Channel 01 : 1[1a000] -> 0[1a000] [receive] via NET/IB/0
machine-19:70497:70509 [0] NCCL INFO Channel 01 : 1[1a000] -> 0[1a000] [send] via NET/IB/0
machine-17:20109:20124 [0] NCCL INFO Channel 00 : 0[1a000] -> 1[1a000] [send] via NET/IB/0
machine-17:20109:20124 [0] NCCL INFO Channel 01 : 0[1a000] -> 1[1a000] [send] via NET/IB/0
machine-17:20109:20124 [0] NCCL INFO Connected all rings
machine-17:20109:20124 [0] NCCL INFO Connected all trees
machine-17:20109:20124 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
machine-17:20109:20124 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
machine-19:70497:70509 [0] NCCL INFO Connected all rings
machine-19:70497:70509 [0] NCCL INFO Connected all trees
machine-19:70497:70509 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
machine-19:70497:70509 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
machine-17:20109:20124 [0] NCCL INFO comm 0x468ee30 rank 0 nranks 2 cudaDev 0 busId 1a000 - Init COMPLETE
machine-19:70497:70509 [0] NCCL INFO comm 0x3cfcf70 rank 1 nranks 2 cudaDev 0 busId 1a000 - Init COMPLETE
#
#                                                     out-of-place                       in-place
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
machine-17:20109:20109 [0] NCCL INFO Launch mode Parallel

machine-19:70497:70516 [0] transport/net_ib.cc:839 NCCL WARN NET/IB : Got completion with error 12, opcode 0, len 0, vendor err 129
machine-19:70497:70516 [0] NCCL INFO include/net.h:28 -> 2
machine-19:70497:70516 [0] NCCL INFO transport/net.cc:404 -> 2
machine-19:70497:70516 [0] NCCL INFO proxy.cc:320 -> 2
machine-19:70497:70516 [0] NCCL INFO proxy.cc:367 -> 2 [Proxy Thread]

machine-17:20109:20140 [0] transport/net_ib.cc:839 NCCL WARN NET/IB : Got completion with error 12, opcode 0, len 0, vendor err 129
machine-17:20109:20140 [0] NCCL INFO include/net.h:28 -> 2
machine-17:20109:20140 [0] NCCL INFO transport/net.cc:404 -> 2
machine-17:20109:20140 [0] NCCL INFO proxy.cc:320 -> 2
machine-17:20109:20140 [0] NCCL INFO proxy.cc:367 -> 2 [Proxy Thread]
machine-19: Test NCCL failure common.cu:346 'unhandled system error'
 .. machine-19: Test failure common.cu:395
 .. machine-19: Test failure common.cu:494
 .. machine-19: Test failure all_reduce.cu:103
 .. machine-19: Test failure common.cu:520
 .. machine-19: Test failure common.cu:844
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[51283,1],1]
  Exit code:    3

The text was updated successfully, but these errors were encountered:

qianzhang613 · 2020-12-03T08:53:40Z

-worker-1:946:988 [0] transport/net_ib.cc:839 NCCL WARN NET/IB : Got completion with error 12, opcode 1, len 0, vendor err 129
same problem

NHZlX · 2020-12-03T11:24:19Z

solved it by setting the -x NCCL_IB_GID_INDEX=3

sjeaugey · 2020-12-03T17:13:10Z

Sorry for having missed this. An error 12 is a timeout. When it happens right away, it usually means the NICs can't talk to each other using RoCE. The connection can be established because it's not done using RoCE, but then as soon as we start communicating through RoCE we get a timeout.

How to solve that is unfortunately often vendor dependent. Switches can filter packets or fail to route them. I'd suggest running low level RoCE tests first (OFED perftest) and making sure NCCL runs in the same conditions (GID Index, Traffic class, ...).

tingweiwu · 2022-05-09T02:28:24Z

solved it by setting the -x NCCL_IB_GID_INDEX=3
@NHZlX Hi, What is the output of this command show_gids in your env?

The NCCL_IB_GID_INDEX variable defines the Global ID index used in RoCE mode. See the InfiniBand show_gids command in order to set this value.

ShoufaChen · 2022-06-22T09:23:25Z

NCCL_IB_GID_INDEX=3 solved my issue.

show_gids information:

DEV     PORT    INDEX   GID                                     IPv4            VER     DEV
---     ----    -----   ---                                     ------------    ---     ---

mlx5_2  1       0       fe80:0000:0000:0000:5054:00ff:fec2:7a7a                 v1      eth1
mlx5_2  1       1       fe80:0000:0000:0000:5054:00ff:fec2:7a7a                 v2      eth1
mlx5_2  1       2       0000:0000:0000:0000:0000:ffff:0bd8:591d 11.216.89.29    v1      eth1
mlx5_2  1       3       0000:0000:0000:0000:0000:ffff:0bd8:591d 11.216.89.29    v2      eth1
n_gids_found=4

Would you mind giving some explanation about NCCL_IB_GID_INDEX?
Thanks.

sjeaugey · 2022-06-22T09:31:45Z

The GID Index is basically determining how IB packets are encapsulated over Ethernet or IP (v4 or v6). I'm no expert, but I think here GID 0 and 1 would be using Ethernet, GID 2 and 3 would use IPv4 and if you had an IPv6 address configured on the interface you would have 2 other GID indexes. And each time you can choose between RoCEv1 and RoCEv2 which have different encapsulation and capabilities.

So choosing the GID index is key in how packets can or cannot be routed through the fabric and how they will be applied QoS policies, etc ... but all this is obviously dependent on how the fabric is configured which is outside the reach of NCCL.

corrtia · 2022-09-04T11:16:41Z

solved it by setting the -x NCCL_IB_GID_INDEX=3
@NHZlX Hi, What is the output of this command show_gids in your env?
The NCCL_IB_GID_INDEX variable defines the Global ID index used in RoCE mode. See the InfiniBand show_gids command in order to set this value.

I use NCCL_IB_GID_INDEX=3
still have this error.

XiaoqingNLP · 2022-09-29T06:27:24Z

@NHZlX How do you solve this problem?

sjeaugey · 2022-09-29T09:47:23Z

There is no universal solution to this. Error 12 in IB terms is the same as "No route to host" with sockets.
It could be that you're not using the right interface (NCCL_IB_GID_INDEX), it could be that your IP addressing is wrong, it could be that the switch is down, or because they are in different networks and there is no routing between networks, ... it can be a lot of things. Basically it just says that two NICs could not talk to each other.

thelongestusernameofall · 2023-11-21T12:08:32Z

NCCL_IB_GID_INDEX=3

export NCCL_IB_GID_INDEX=3 solved my problem. Thanks very much.

Keep0828 · 2023-11-24T08:48:02Z

It worked for me! Thanks!
I'm trying to utilize deepspeed for DDP, but this problem occurs. I added the line NCCL_IB_GID_INDEX=3 to .deepspeed_env and the problem was solved!

haofanwang · 2024-08-15T11:53:52Z

What a magic! Can anyone explain a little bit here?

sjeaugey · 2024-08-15T12:53:47Z

With recent NCCL versions you should no longer need to set NCCL_IB_GID_INDEX=3, and doing so can actually work less well in case the GID changes. So I would advise to upgrade NCCL and remove that environment variable from your scripts in the future.

NHZlX closed this as completed Dec 3, 2020

DogeWatch mentioned this issue Jul 11, 2022

Multi-node training hangs at accelerator.prepare(model) huggingface/accelerate#412

Closed

anj-s mentioned this issue Jul 24, 2022

NCCL WARN NET/IB : Got completion from peer 192.168.17.147<38560> with error 12, opcode 0, len 0, vendor err 129 pytorch/pytorch#81628

Closed

maxhgerlach mentioned this issue Oct 20, 2022

error: missing ranks horovod/horovod#3751

Closed

chuanli11 mentioned this issue Jan 16, 2023

DDP on multinode [not yet working] karpathy/nanoGPT#55

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCCL WARN NET/IB : Got completion with error 12, opcode 0, len 0, vendor err 129 #426

NCCL WARN NET/IB : Got completion with error 12, opcode 0, len 0, vendor err 129 #426

NHZlX commented Dec 1, 2020 •

edited

Loading

qianzhang613 commented Dec 3, 2020

NHZlX commented Dec 3, 2020

sjeaugey commented Dec 3, 2020

tingweiwu commented May 9, 2022

ShoufaChen commented Jun 22, 2022

sjeaugey commented Jun 22, 2022

corrtia commented Sep 4, 2022

XiaoqingNLP commented Sep 29, 2022

sjeaugey commented Sep 29, 2022

thelongestusernameofall commented Nov 21, 2023

Keep0828 commented Nov 24, 2023

haofanwang commented Aug 15, 2024

sjeaugey commented Aug 15, 2024

NCCL WARN NET/IB : Got completion with error 12, opcode 0, len 0, vendor err 129 #426

NCCL WARN NET/IB : Got completion with error 12, opcode 0, len 0, vendor err 129 #426

Comments

NHZlX commented Dec 1, 2020 • edited Loading

qianzhang613 commented Dec 3, 2020

NHZlX commented Dec 3, 2020

sjeaugey commented Dec 3, 2020

tingweiwu commented May 9, 2022

ShoufaChen commented Jun 22, 2022

sjeaugey commented Jun 22, 2022

corrtia commented Sep 4, 2022

XiaoqingNLP commented Sep 29, 2022

sjeaugey commented Sep 29, 2022

thelongestusernameofall commented Nov 21, 2023

Keep0828 commented Nov 24, 2023

haofanwang commented Aug 15, 2024

sjeaugey commented Aug 15, 2024

NHZlX commented Dec 1, 2020 •

edited

Loading