Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL WARN NET/IB : Got completion with error 12, opcode 0, len 0, vendor err 129 #426

Closed
NHZlX opened this issue Dec 1, 2020 · 13 comments
Closed

Comments

@NHZlX
Copy link

NHZlX commented Dec 1, 2020

Openmpi 1.8.5
Nccl 2.8.3
Cuda10.2
MLNX_OFED_LINUX-5.1-2.5.8.0

ibv_devinfo:

hca_id:	mlx5_bond_0
	transport:			InfiniBand (0)
	fw_ver:				16.28.2006
	node_guid:			0c42:a103:0023:ac92
	sys_image_guid:			0c42:a103:0023:ac92
	vendor_id:			0x02c9
	vendor_part_id:			4119
	hw_ver:				0x0
	board_id:			MT_0000000012
	phys_port_cnt:			1
		port:	1
			state:			PORT_ACTIVE (4)
			max_mtu:		4096 (5)
			active_mtu:		1024 (3)
			sm_lid:			0
			port_lid:		0
			port_lmc:		0x00
			link_layer:		Ethernet

This issue like https://github.com/NVIDIA/nccl/issues/214, but i have verified that there is no ACS enabled on either of the nodes.

The following are the command and the error log:

mpirun --allow-run-as-root -np 2 --mca btl tcp,self --mca btl_tcp_if_exclude eth0 -host , -x CUDA_VISIBLE_DEVICES="0,2" -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x NCCL_IB_HCA=mlx5_bond_0:1 -x NCCL_P2P_DISABLE=0 -x NCCL_SHM_DISABLE=0 -x NCCL_IB_DISABLE=0 -x NCCL_IB_CUDA_SUPPORT=0 /home/mingkun/enter/test/nccl-tests/build/all_reduce_perf -b 9 -e 128M -f 2 -g 1 -z 0

# nThread 1 nGpus 1 minBytes 9 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
#   Rank  0 Pid  20109 on machine-17 device  0 [0x1a] Tesla V100-SXM2-32GB
#   Rank  1 Pid  70497 on machine-19 device  0 [0x1a] Tesla V100-SXM2-32GB
machine-17:20109:20109 [0] NCCL INFO Bootstrap : Using eth0:10.11.170.41<0>
machine-17:20109:20109 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
machine-17:20109:20109 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
machine-17:20109:20109 [0] NCCL INFO NET/IB : Using [0]mlx5_bond_0:1/RoCE ; OOB eth0:10.11.170.41<0>
machine-17:20109:20109 [0] NCCL INFO Using network IB
machine-19:70497:70497 [0] NCCL INFO Bootstrap : Using eth0:10.11.170.43<0>
machine-19:70497:70497 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
machine-19:70497:70497 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
machine-19:70497:70497 [0] NCCL INFO NET/IB : Using [0]mlx5_bond_0:1/RoCE ; OOB eth0:10.11.170.43<0>
machine-19:70497:70497 [0] NCCL INFO Using network IB
NCCL version 2.8.3+cuda10.2
machine-19:70497:70509 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1
machine-19:70497:70509 [0] NCCL INFO Setting affinity for GPU 0 to 010000,00000001
machine-17:20109:20124 [0] NCCL INFO Channel 00/02 :    0   1
machine-17:20109:20124 [0] NCCL INFO Channel 01/02 :    0   1
machine-17:20109:20124 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1
machine-17:20109:20124 [0] NCCL INFO Setting affinity for GPU 0 to 010000,00000001
machine-19:70497:70509 [0] NCCL INFO NCCL_SHM_DISABLE set by environment to 0.
machine-17:20109:20124 [0] NCCL INFO NCCL_SHM_DISABLE set by environment to 0.
machine-19:70497:70509 [0] NCCL INFO Channel 00 : 0[1a000] -> 1[1a000] [receive] via NET/IB/0
machine-17:20109:20124 [0] NCCL INFO Channel 00 : 1[1a000] -> 0[1a000] [receive] via NET/IB/0
machine-19:70497:70509 [0] NCCL INFO Channel 01 : 0[1a000] -> 1[1a000] [receive] via NET/IB/0
machine-19:70497:70509 [0] NCCL INFO Channel 00 : 1[1a000] -> 0[1a000] [send] via NET/IB/0
machine-17:20109:20124 [0] NCCL INFO Channel 01 : 1[1a000] -> 0[1a000] [receive] via NET/IB/0
machine-19:70497:70509 [0] NCCL INFO Channel 01 : 1[1a000] -> 0[1a000] [send] via NET/IB/0
machine-17:20109:20124 [0] NCCL INFO Channel 00 : 0[1a000] -> 1[1a000] [send] via NET/IB/0
machine-17:20109:20124 [0] NCCL INFO Channel 01 : 0[1a000] -> 1[1a000] [send] via NET/IB/0
machine-17:20109:20124 [0] NCCL INFO Connected all rings
machine-17:20109:20124 [0] NCCL INFO Connected all trees
machine-17:20109:20124 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
machine-17:20109:20124 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
machine-19:70497:70509 [0] NCCL INFO Connected all rings
machine-19:70497:70509 [0] NCCL INFO Connected all trees
machine-19:70497:70509 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
machine-19:70497:70509 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
machine-17:20109:20124 [0] NCCL INFO comm 0x468ee30 rank 0 nranks 2 cudaDev 0 busId 1a000 - Init COMPLETE
machine-19:70497:70509 [0] NCCL INFO comm 0x3cfcf70 rank 1 nranks 2 cudaDev 0 busId 1a000 - Init COMPLETE
#
#                                                     out-of-place                       in-place
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
machine-17:20109:20109 [0] NCCL INFO Launch mode Parallel

machine-19:70497:70516 [0] transport/net_ib.cc:839 NCCL WARN NET/IB : Got completion with error 12, opcode 0, len 0, vendor err 129
machine-19:70497:70516 [0] NCCL INFO include/net.h:28 -> 2
machine-19:70497:70516 [0] NCCL INFO transport/net.cc:404 -> 2
machine-19:70497:70516 [0] NCCL INFO proxy.cc:320 -> 2
machine-19:70497:70516 [0] NCCL INFO proxy.cc:367 -> 2 [Proxy Thread]

machine-17:20109:20140 [0] transport/net_ib.cc:839 NCCL WARN NET/IB : Got completion with error 12, opcode 0, len 0, vendor err 129
machine-17:20109:20140 [0] NCCL INFO include/net.h:28 -> 2
machine-17:20109:20140 [0] NCCL INFO transport/net.cc:404 -> 2
machine-17:20109:20140 [0] NCCL INFO proxy.cc:320 -> 2
machine-17:20109:20140 [0] NCCL INFO proxy.cc:367 -> 2 [Proxy Thread]
machine-19: Test NCCL failure common.cu:346 'unhandled system error'
 .. machine-19: Test failure common.cu:395
 .. machine-19: Test failure common.cu:494
 .. machine-19: Test failure all_reduce.cu:103
 .. machine-19: Test failure common.cu:520
 .. machine-19: Test failure common.cu:844
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[51283,1],1]
  Exit code:    3
@qianzhang613
Copy link

-worker-1:946:988 [0] transport/net_ib.cc:839 NCCL WARN NET/IB : Got completion with error 12, opcode 1, len 0, vendor err 129
same problem

@NHZlX
Copy link
Author

NHZlX commented Dec 3, 2020

solved it by setting the -x NCCL_IB_GID_INDEX=3

@NHZlX NHZlX closed this as completed Dec 3, 2020
@sjeaugey
Copy link
Member

sjeaugey commented Dec 3, 2020

Sorry for having missed this. An error 12 is a timeout. When it happens right away, it usually means the NICs can't talk to each other using RoCE. The connection can be established because it's not done using RoCE, but then as soon as we start communicating through RoCE we get a timeout.

How to solve that is unfortunately often vendor dependent. Switches can filter packets or fail to route them. I'd suggest running low level RoCE tests first (OFED perftest) and making sure NCCL runs in the same conditions (GID Index, Traffic class, ...).

@tingweiwu
Copy link

solved it by setting the -x NCCL_IB_GID_INDEX=3
@NHZlX Hi, What is the output of this command show_gids in your env?

The NCCL_IB_GID_INDEX variable defines the Global ID index used in RoCE mode. See the InfiniBand show_gids command in order to set this value.

@ShoufaChen
Copy link

NCCL_IB_GID_INDEX=3 solved my issue.

show_gids information:

DEV     PORT    INDEX   GID                                     IPv4            VER     DEV
---     ----    -----   ---                                     ------------    ---     ---

mlx5_2  1       0       fe80:0000:0000:0000:5054:00ff:fec2:7a7a                 v1      eth1
mlx5_2  1       1       fe80:0000:0000:0000:5054:00ff:fec2:7a7a                 v2      eth1
mlx5_2  1       2       0000:0000:0000:0000:0000:ffff:0bd8:591d 11.216.89.29    v1      eth1
mlx5_2  1       3       0000:0000:0000:0000:0000:ffff:0bd8:591d 11.216.89.29    v2      eth1
n_gids_found=4

Would you mind giving some explanation about NCCL_IB_GID_INDEX?
Thanks.

@sjeaugey
Copy link
Member

The GID Index is basically determining how IB packets are encapsulated over Ethernet or IP (v4 or v6). I'm no expert, but I think here GID 0 and 1 would be using Ethernet, GID 2 and 3 would use IPv4 and if you had an IPv6 address configured on the interface you would have 2 other GID indexes. And each time you can choose between RoCEv1 and RoCEv2 which have different encapsulation and capabilities.

So choosing the GID index is key in how packets can or cannot be routed through the fabric and how they will be applied QoS policies, etc ... but all this is obviously dependent on how the fabric is configured which is outside the reach of NCCL.

@corrtia
Copy link

corrtia commented Sep 4, 2022

solved it by setting the -x NCCL_IB_GID_INDEX=3
@NHZlX Hi, What is the output of this command show_gids in your env?

The NCCL_IB_GID_INDEX variable defines the Global ID index used in RoCE mode. See the InfiniBand show_gids command in order to set this value.

I use NCCL_IB_GID_INDEX=3
still have this error.

@XiaoqingNLP
Copy link

@NHZlX How do you solve this problem?

@sjeaugey
Copy link
Member

There is no universal solution to this. Error 12 in IB terms is the same as "No route to host" with sockets.
It could be that you're not using the right interface (NCCL_IB_GID_INDEX), it could be that your IP addressing is wrong, it could be that the switch is down, or because they are in different networks and there is no routing between networks, ... it can be a lot of things. Basically it just says that two NICs could not talk to each other.

@thelongestusernameofall

NCCL_IB_GID_INDEX=3

export NCCL_IB_GID_INDEX=3 solved my problem. Thanks very much.

@Keep0828
Copy link

It worked for me! Thanks!
I'm trying to utilize deepspeed for DDP, but this problem occurs. I added the line NCCL_IB_GID_INDEX=3 to .deepspeed_env and the problem was solved!

@haofanwang
Copy link

What a magic! Can anyone explain a little bit here?

@sjeaugey
Copy link
Member

With recent NCCL versions you should no longer need to set NCCL_IB_GID_INDEX=3, and doing so can actually work less well in case the GID changes. So I would advise to upgrade NCCL and remove that environment variable from your scripts in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants