-
Notifications
You must be signed in to change notification settings - Fork 812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NCCL WARN NET/IB : Got completion with error 12, opcode 0, len 0, vendor err 129 #426
Comments
-worker-1:946:988 [0] transport/net_ib.cc:839 NCCL WARN NET/IB : Got completion with error 12, opcode 1, len 0, vendor err 129 |
solved it by setting the |
Sorry for having missed this. An error 12 is a timeout. When it happens right away, it usually means the NICs can't talk to each other using RoCE. The connection can be established because it's not done using RoCE, but then as soon as we start communicating through RoCE we get a timeout. How to solve that is unfortunately often vendor dependent. Switches can filter packets or fail to route them. I'd suggest running low level RoCE tests first (OFED perftest) and making sure NCCL runs in the same conditions (GID Index, Traffic class, ...). |
|
Would you mind giving some explanation about |
The GID Index is basically determining how IB packets are encapsulated over Ethernet or IP (v4 or v6). I'm no expert, but I think here GID 0 and 1 would be using Ethernet, GID 2 and 3 would use IPv4 and if you had an IPv6 address configured on the interface you would have 2 other GID indexes. And each time you can choose between RoCEv1 and RoCEv2 which have different encapsulation and capabilities. So choosing the GID index is key in how packets can or cannot be routed through the fabric and how they will be applied QoS policies, etc ... but all this is obviously dependent on how the fabric is configured which is outside the reach of NCCL. |
I use |
@NHZlX How do you solve this problem? |
There is no universal solution to this. Error 12 in IB terms is the same as "No route to host" with sockets. |
export NCCL_IB_GID_INDEX=3 solved my problem. Thanks very much. |
It worked for me! Thanks! |
What a magic! Can anyone explain a little bit here? |
With recent NCCL versions you should no longer need to set NCCL_IB_GID_INDEX=3, and doing so can actually work less well in case the GID changes. So I would advise to upgrade NCCL and remove that environment variable from your scripts in the future. |
Openmpi 1.8.5
Nccl 2.8.3
Cuda10.2
MLNX_OFED_LINUX-5.1-2.5.8.0
ibv_devinfo:
This issue like https://github.com/NVIDIA/nccl/issues/214, but i have verified that there is no ACS enabled on either of the nodes.
The following are the command and the error log:
mpirun --allow-run-as-root -np 2 --mca btl tcp,self --mca btl_tcp_if_exclude eth0 -host , -x CUDA_VISIBLE_DEVICES="0,2" -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x NCCL_IB_HCA=mlx5_bond_0:1 -x NCCL_P2P_DISABLE=0 -x NCCL_SHM_DISABLE=0 -x NCCL_IB_DISABLE=0 -x NCCL_IB_CUDA_SUPPORT=0 /home/mingkun/enter/test/nccl-tests/build/all_reduce_perf -b 9 -e 128M -f 2 -g 1 -z 0
The text was updated successfully, but these errors were encountered: