ibv_poll_cq error during all_reduce_perf with openmpi running on two GPU server #24

Keepmoving-ZXY · 2019-06-19T08:03:44Z

Today I run all_reduce_perf with openmpi and GPUDirectRDMA support, and I have two GPU servers and each server install 4 NVIDIA V100 GPU and a Mellonx MT27800 NIC, the hardware is sure to support GPUDirectRDMA technique. And the error message is:

gpu5:35917:35994 [0] transport/net_ib.cu:788 NCCL WARN NET/IB : Got completion with error 12, opcode 1, len 0, vendor err 129
gpu5:35917:35994 [0] NCCL INFO include/net.h:34 -> 2
gpu5:35917:35994 [0] NCCL INFO transport/net.cu:537 -> 2
gpu5:35917:35994 [0] NCCL INFO transport.cu:163 -> 2 [Proxy Thread]

gpu4:35902:35977 [0] transport/net_ib.cu:788 NCCL WARN NET/IB : Got completion with error 12, opcode 1, len 0, vendor err 129
gpu4:35902:35977 [0] NCCL INFO include/net.h:34 -> 2
gpu4:35902:35977 [0] NCCL INFO transport/net.cu:537 -> 2
gpu4:35902:35977 [0] NCCL INFO transport.cu:163 -> 2 [Proxy Thread]
gpu5:35917:35917 [0] NCCL INFO Destroyed comm 0x7f24c4001af0 rank 4
gpu4:35902:35902 [0] NCCL INFO Destroyed comm 0x7f1de8001af0 rank 0

gpu5:35917:35917 [1] init.cu:194 NCCL WARN Cuda failure 'an illegal memory access was encountered'
gpu5:35917:35917 [1] NCCL INFO init.cu:1143 -> 1
gpu5: Test NCCL failure common.cu:343 'unhandled cuda error'
 .. gpu5: Test failure common.cu:393
 .. gpu5: Test failure common.cu:492
 .. gpu5: Test failure all_reduce.cu:103
 .. gpu5: Test failure common.cu:518
 .. gpu5: Test failure common.cu:839

gpu4:35902:35902 [1] init.cu:194 NCCL WARN Cuda failure 'an illegal memory access was encountered'
gpu4:35902:35902 [1] NCCL INFO init.cu:1143 -> 1
gpu4: Test NCCL failure common.cu:343 'unhandled cuda error'
 .. gpu4: Test failure common.cu:393
 .. gpu4: Test failure common.cu:492
 .. gpu4: Test failure all_reduce.cu:103
 .. gpu4: Test failure common.cu:518
 .. gpu4: Test failure common.cu:839
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[36982,1],1]
  Exit code:    3
--------------------------------------------------------------------------

I check line 788 in transport/net_ib.cu, it seems that return value of ibv_poll_cq is bad.

command to run all_reduce_perf:

/usr/local/openmpi/bin/mpirun -np 2 -host 10.0.21.2:1,10.0.23.4:1 \
	-x NCCL_SOCKET_IFNAME=rdma0 -x NCCL_DEBUG=INFO \
 	-x NCCL_IB_DISABLE=0 -x NCCL_IB_CUDA_SUPPORT=1 \
 	-mca orte_base_help_aggregate 0 \
	./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4 -c 0

I have search this error in google and few people meet this problem, so can you help me to solve this problem?

The text was updated successfully, but these errors were encountered:

sjeaugey · 2019-06-19T08:24:58Z

Error 12 is a timeout.

If it is intermittent, it might be the network is congested and the solution is to increase NCCL_IB_TIMEOUT.

But it is usually happening right at the start of the test and just means the NICs are not able to commmunicate with each other. Making sure all NICs work fine using the ib_write_bw test with -d <nic> might help diagnose that.

Lastly, NCCL_IB_CUDA_SUPPORT=1 should not be set. This is forcing NCCL to use GPU Direct RDMA in all situations even when it would result in lower performance. NCCL will use GPU Direct RDMA when available and when it would improve performance, you can check logs with NCCL_DEBUG=INFO, there should be lines with using NET/IB/0/GDRDMA.

kwen2501 · 2019-06-19T15:02:44Z

Also, NCCL_IB_CUDA_SUPPORT is replaced by NCCL_NET_GDR_LEVEL since 2.4. See here

Keepmoving-ZXY · 2019-06-20T06:40:12Z

I run ib_write_bw, and it finish completely, output is:

ubuntu@gpu4:~/nccl-tests/src$ ib_write_bw -d mlx5_2 -R 

************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF		Device         : mlx5_2
 Number of qps   : 1		Transport type : IB
 Connection type : RC		Using SRQ      : OFF
 CQ Moderation   : 100
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 3
 Max inline data : 0[B]
 rdma_cm QPs	 : ON
 Data ex. method : rdma_cm
---------------------------------------------------------------------------------------
 Waiting for client rdma_cm QP to connect
 Please run the same command with the IB/RoCE interface IP
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x0838 PSN 0x900570
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:00:21:02
 remote address: LID 0000 QPN 0x07fc PSN 0xd14453
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:00:23:04
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 65536      5000             10783.15            10782.77		   0.172524
---------------------------------------------------------------------------------------

ubuntu@gpu5:~$ ib_write_bw -d mlx5_2 -R 10.0.21.2
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF		Device         : mlx5_2
 Number of qps   : 1		Transport type : IB
 Connection type : RC		Using SRQ      : OFF
 TX depth        : 128
 CQ Moderation   : 100
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 3
 Max inline data : 0[B]
 rdma_cm QPs	 : ON
 Data ex. method : rdma_cm
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x07fc PSN 0xd14453
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:00:23:04
 remote address: LID 0000 QPN 0x0838 PSN 0x900570
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:00:21:02
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
Conflicting CPU frequency values detected: 800.288000 != 1547.256000. CPU Frequency is not max.
 65536      5000             10783.15            10782.77		   0.172524
---------------------------------------------------------------------------------------

I think NIC between two server is normal, and I forbid nccl to use GDRDMA, all_reduce_perf can run successfully, but when use GDRDMA, error is the same.

sjeaugey · 2019-06-20T08:31:18Z

OK so you seem to imply this is GDR specific.

Is NCCL using GDRDMA by default (not setting NCCL_IB_CUDA_SUPPORT) ? Any WARN in the log ?

Keepmoving-ZXY · 2019-06-20T08:43:22Z

Does NCCL_IB_DISABLE=0 means forbid IB and GDRDMA? I add this to argument of mpirun.

Keepmoving-ZXY · 2019-06-20T08:47:32Z

It seems that in output of ib_write_bw, gid_index is 3 and default gid nccl used is 0, and add NCCL_IB_GID_INDEX=3 to mpirun, so command is:

/usr/local/openmpi/bin/mpirun -np 2 -host 10.0.21.2:1,10.0.23.4:1 \
	-x NCCL_SOCKET_IFNAME=rdma0 -x NCCL_DEBUG=INFO \
 	-x NCCL_IB_DISABLE=0 \
 	-x NCCL_IB_HCA=mlx5_2:1 \
 	-x NCCL_IB_TIMEOUT=200 \
 	-x NCCL_IB_GID_INDEX=3 \
 	-mca orte_base_help_aggregate 0 \
	./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4 -c 0

and another occurs:

#
#                                                     out-of-place                       in-place          
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
gpu4:5633:5633 [0] NCCL INFO Launch mode Group/CGMD
mlx5: gpu5: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000003 00000000 00000000 00000000
00000000 02005104 09000ac5 0000bfd2

gpu5:6252:6325 [0] transport/net_ib.cu:788 NCCL WARN NET/IB : Got completion with error 4, opcode 1, len 0, vendor err 81
gpu5:6252:6325 [0] NCCL INFO include/net.h:34 -> 2
gpu5:6252:6325 [0] NCCL INFO transport/net.cu:478 -> 2
gpu5:6252:6325 [0] NCCL INFO transport.cu:163 -> 2 [Proxy Thread]
mlx5: gpu4: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000009 00000000 00000000 00000000
00000000 02005104 09000b07 000076d2

gpu4:5633:5703 [0] transport/net_ib.cu:788 NCCL WARN NET/IB : Got completion with error 4, opcode 1, len 0, vendor err 81
gpu4:5633:5703 [0] NCCL INFO include/net.h:34 -> 2
gpu4:5633:5703 [0] NCCL INFO transport/net.cu:478 -> 2
gpu4:5633:5703 [0] NCCL INFO transport.cu:163 -> 2 [Proxy Thread]
gpu4:5633:5633 [0] NCCL INFO Destroyed comm 0x7f932c001af0 rank 0
gpu5:6252:6252 [0] NCCL INFO Destroyed comm 0x7fdd44001af0 rank 4

gpu4:5633:5633 [1] init.cu:194 NCCL WARN Cuda failure 'an illegal memory access was encountered'
gpu4:5633:5633 [1] NCCL INFO init.cu:1143 -> 1
gpu4: Test NCCL failure common.cu:343 'unhandled cuda error'
 .. gpu4: Test failure common.cu:393
 .. gpu4: Test failure common.cu:492
 .. gpu4: Test failure all_reduce.cu:103
 .. gpu4: Test failure common.cu:518
 .. gpu4: Test failure common.cu:839

gpu5:6252:6252 [1] init.cu:194 NCCL WARN Cuda failure 'an illegal memory access was encountered'
gpu5:6252:6252 [1] NCCL INFO init.cu:1143 -> 1
gpu5: Test NCCL failure common.cu:343 'unhandled cuda error'
 .. gpu5: Test failure common.cu:393
 .. gpu5: Test failure common.cu:492
 .. gpu5: Test failure all_reduce.cu:103
 .. gpu5: Test failure common.cu:518
 .. gpu5: Test failure common.cu:839

sjeaugey · 2019-06-20T10:11:51Z

Interesting ... this definitely looks like NVIDIA/nccl#214 where GDR is broken by an incorrect configuration of PCI switches (disabling ACS should fix it).

Keepmoving-ZXY · 2019-06-20T12:39:06Z

ok, I will try it later.

Keepmoving-ZXY · 2019-06-21T03:00:48Z

Interesting ... this definitely looks like NVIDIA/nccl#214 where GDR is broken by an incorrect configuration of PCI switches (disabling ACS should fix it).

this takes effect, thank you very much for help.

Keepmoving-ZXY · 2019-06-21T03:02:33Z

It seems that in output of ib_write_bw, gid_index is 3 and default gid nccl used is 0, and add NCCL_IB_GID_INDEX=3 to mpirun, so command is:

/usr/local/openmpi/bin/mpirun -np 2 -host 10.0.21.2:1,10.0.23.4:1 \
	-x NCCL_SOCKET_IFNAME=rdma0 -x NCCL_DEBUG=INFO \
 	-x NCCL_IB_DISABLE=0 \
 	-x NCCL_IB_HCA=mlx5_2:1 \
 	-x NCCL_IB_TIMEOUT=200 \
 	-x NCCL_IB_GID_INDEX=3 \
 	-mca orte_base_help_aggregate 0 \
	./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4 -c 0

and another occurs:

#
#                                                     out-of-place                       in-place          
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
gpu4:5633:5633 [0] NCCL INFO Launch mode Group/CGMD
mlx5: gpu5: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000003 00000000 00000000 00000000
00000000 02005104 09000ac5 0000bfd2

gpu5:6252:6325 [0] transport/net_ib.cu:788 NCCL WARN NET/IB : Got completion with error 4, opcode 1, len 0, vendor err 81
gpu5:6252:6325 [0] NCCL INFO include/net.h:34 -> 2
gpu5:6252:6325 [0] NCCL INFO transport/net.cu:478 -> 2
gpu5:6252:6325 [0] NCCL INFO transport.cu:163 -> 2 [Proxy Thread]
mlx5: gpu4: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000009 00000000 00000000 00000000
00000000 02005104 09000b07 000076d2

gpu4:5633:5703 [0] transport/net_ib.cu:788 NCCL WARN NET/IB : Got completion with error 4, opcode 1, len 0, vendor err 81
gpu4:5633:5703 [0] NCCL INFO include/net.h:34 -> 2
gpu4:5633:5703 [0] NCCL INFO transport/net.cu:478 -> 2
gpu4:5633:5703 [0] NCCL INFO transport.cu:163 -> 2 [Proxy Thread]
gpu4:5633:5633 [0] NCCL INFO Destroyed comm 0x7f932c001af0 rank 0
gpu5:6252:6252 [0] NCCL INFO Destroyed comm 0x7fdd44001af0 rank 4

gpu4:5633:5633 [1] init.cu:194 NCCL WARN Cuda failure 'an illegal memory access was encountered'
gpu4:5633:5633 [1] NCCL INFO init.cu:1143 -> 1
gpu4: Test NCCL failure common.cu:343 'unhandled cuda error'
 .. gpu4: Test failure common.cu:393
 .. gpu4: Test failure common.cu:492
 .. gpu4: Test failure all_reduce.cu:103
 .. gpu4: Test failure common.cu:518
 .. gpu4: Test failure common.cu:839

gpu5:6252:6252 [1] init.cu:194 NCCL WARN Cuda failure 'an illegal memory access was encountered'
gpu5:6252:6252 [1] NCCL INFO init.cu:1143 -> 1
gpu5: Test NCCL failure common.cu:343 'unhandled cuda error'
 .. gpu5: Test failure common.cu:393
 .. gpu5: Test failure common.cu:492
 .. gpu5: Test failure all_reduce.cu:103
 .. gpu5: Test failure common.cu:518
 .. gpu5: Test failure common.cu:839

And I find that if run test on none NVLINK GPU server, this problem don`t appear, and if run on NVLINK GPU server, change bios settings takes effect.

sjeaugey · 2019-06-21T08:35:58Z

Thanks for the feedback. Is there anything still not working ?

Keepmoving-ZXY · 2019-06-21T08:40:48Z

No, I want to confirm that NCCL INFO Ring 00 : 7 -> 0 [receive] via NET/IB/0/GDRDMA in below message:

gpu2:15416:15507 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via direct shared memory
gpu2:15417:15513 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via P2P/IPC
gpu2:15415:15508 [0] NCCL INFO Ring 00 : 7 -> 0 [receive] via NET/IB/0/GDRDMA
gpu2:15418:15516 [3] NCCL INFO Ring 00 : 3 -> 4 [send] via NET/IB/0
gpu2:15415:15508 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/IPC
gpu3:16316:16407 [2] NCCL INFO Ring 00 : 6[2] -> 7[3] via P2P/IPC
gpu3:16315:16427 [1] NCCL INFO Ring 00 : 5[1] -> 6[2] via direct shared memory
gpu3:16314:16408 [0] NCCL INFO Ring 00 : 3 -> 4 [receive] via NET/IB/0/GDRDMA
gpu2:15417:15513 [2] NCCL INFO Ring 00 : 2[2] -> 1[1] via direct shared memory
gpu3:16317:16406 [3] NCCL INFO Ring 00 : 7 -> 0 [send] via NET/IB/0
gpu3:16314:16408 [0] NCCL INFO Ring 00 : 4[0] -> 5[1] via P2P/IPC

means receive message with RDMA network, not TCP network?

sjeaugey · 2019-06-21T09:33:34Z

NET/IB means we use RDMA and not TCP. Otherwise it would show NET/Socket.

GDRDMA only means data goes directly from the NIC to the GPU and vice versa, instead of going through CPU memory. It's kind of an implementation detail users should not have to worry about as NCCL should make the right decision on whether to use it or not.

If your NIC is connected to GPUs using a PCI switch, GDRDMA is important (and should be used), but if both are connected to the CPU, using GDRDMA is usually slightly slower so NCCL would not use it by default.

Keepmoving-ZXY · 2019-06-21T09:43:16Z

ok, thank you.

Keepmoving-ZXY closed this as completed Jun 21, 2019

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ibv_poll_cq error during all_reduce_perf with openmpi running on two GPU server #24

ibv_poll_cq error during all_reduce_perf with openmpi running on two GPU server #24

Keepmoving-ZXY commented Jun 19, 2019

sjeaugey commented Jun 19, 2019

kwen2501 commented Jun 19, 2019

Keepmoving-ZXY commented Jun 20, 2019 •

edited

Loading

sjeaugey commented Jun 20, 2019

Keepmoving-ZXY commented Jun 20, 2019

Keepmoving-ZXY commented Jun 20, 2019

sjeaugey commented Jun 20, 2019

Keepmoving-ZXY commented Jun 20, 2019

Keepmoving-ZXY commented Jun 21, 2019

Keepmoving-ZXY commented Jun 21, 2019

sjeaugey commented Jun 21, 2019

Keepmoving-ZXY commented Jun 21, 2019

sjeaugey commented Jun 21, 2019

Keepmoving-ZXY commented Jun 21, 2019

ibv_poll_cq error during all_reduce_perf with openmpi running on two GPU server #24

ibv_poll_cq error during all_reduce_perf with openmpi running on two GPU server #24

Comments

Keepmoving-ZXY commented Jun 19, 2019

sjeaugey commented Jun 19, 2019

kwen2501 commented Jun 19, 2019

Keepmoving-ZXY commented Jun 20, 2019 • edited Loading

sjeaugey commented Jun 20, 2019

Keepmoving-ZXY commented Jun 20, 2019

Keepmoving-ZXY commented Jun 20, 2019

sjeaugey commented Jun 20, 2019

Keepmoving-ZXY commented Jun 20, 2019

Keepmoving-ZXY commented Jun 21, 2019

Keepmoving-ZXY commented Jun 21, 2019

sjeaugey commented Jun 21, 2019

Keepmoving-ZXY commented Jun 21, 2019

sjeaugey commented Jun 21, 2019

Keepmoving-ZXY commented Jun 21, 2019

Keepmoving-ZXY commented Jun 20, 2019 •

edited

Loading