Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ibv_poll_cq error during all_reduce_perf with openmpi running on two GPU server #24

Closed
Keepmoving-ZXY opened this issue Jun 19, 2019 · 14 comments

Comments

@Keepmoving-ZXY
Copy link

Today I run all_reduce_perf with openmpi and GPUDirectRDMA support, and I have two GPU servers and each server install 4 NVIDIA V100 GPU and a Mellonx MT27800 NIC, the hardware is sure to support GPUDirectRDMA technique. And the error message is:

gpu5:35917:35994 [0] transport/net_ib.cu:788 NCCL WARN NET/IB : Got completion with error 12, opcode 1, len 0, vendor err 129
gpu5:35917:35994 [0] NCCL INFO include/net.h:34 -> 2
gpu5:35917:35994 [0] NCCL INFO transport/net.cu:537 -> 2
gpu5:35917:35994 [0] NCCL INFO transport.cu:163 -> 2 [Proxy Thread]

gpu4:35902:35977 [0] transport/net_ib.cu:788 NCCL WARN NET/IB : Got completion with error 12, opcode 1, len 0, vendor err 129
gpu4:35902:35977 [0] NCCL INFO include/net.h:34 -> 2
gpu4:35902:35977 [0] NCCL INFO transport/net.cu:537 -> 2
gpu4:35902:35977 [0] NCCL INFO transport.cu:163 -> 2 [Proxy Thread]
gpu5:35917:35917 [0] NCCL INFO Destroyed comm 0x7f24c4001af0 rank 4
gpu4:35902:35902 [0] NCCL INFO Destroyed comm 0x7f1de8001af0 rank 0

gpu5:35917:35917 [1] init.cu:194 NCCL WARN Cuda failure 'an illegal memory access was encountered'
gpu5:35917:35917 [1] NCCL INFO init.cu:1143 -> 1
gpu5: Test NCCL failure common.cu:343 'unhandled cuda error'
 .. gpu5: Test failure common.cu:393
 .. gpu5: Test failure common.cu:492
 .. gpu5: Test failure all_reduce.cu:103
 .. gpu5: Test failure common.cu:518
 .. gpu5: Test failure common.cu:839

gpu4:35902:35902 [1] init.cu:194 NCCL WARN Cuda failure 'an illegal memory access was encountered'
gpu4:35902:35902 [1] NCCL INFO init.cu:1143 -> 1
gpu4: Test NCCL failure common.cu:343 'unhandled cuda error'
 .. gpu4: Test failure common.cu:393
 .. gpu4: Test failure common.cu:492
 .. gpu4: Test failure all_reduce.cu:103
 .. gpu4: Test failure common.cu:518
 .. gpu4: Test failure common.cu:839
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[36982,1],1]
  Exit code:    3
--------------------------------------------------------------------------

I check line 788 in transport/net_ib.cu, it seems that return value of ibv_poll_cq is bad.

command to run all_reduce_perf:

/usr/local/openmpi/bin/mpirun -np 2 -host 10.0.21.2:1,10.0.23.4:1 \
	-x NCCL_SOCKET_IFNAME=rdma0 -x NCCL_DEBUG=INFO \
 	-x NCCL_IB_DISABLE=0 -x NCCL_IB_CUDA_SUPPORT=1 \
 	-mca orte_base_help_aggregate 0 \
	./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4 -c 0

I have search this error in google and few people meet this problem, so can you help me to solve this problem?

@sjeaugey
Copy link
Member

Error 12 is a timeout.

If it is intermittent, it might be the network is congested and the solution is to increase NCCL_IB_TIMEOUT.

But it is usually happening right at the start of the test and just means the NICs are not able to commmunicate with each other. Making sure all NICs work fine using the ib_write_bw test with -d <nic> might help diagnose that.

Lastly, NCCL_IB_CUDA_SUPPORT=1 should not be set. This is forcing NCCL to use GPU Direct RDMA in all situations even when it would result in lower performance. NCCL will use GPU Direct RDMA when available and when it would improve performance, you can check logs with NCCL_DEBUG=INFO, there should be lines with using NET/IB/0/GDRDMA.

@kwen2501
Copy link

Also, NCCL_IB_CUDA_SUPPORT is replaced by NCCL_NET_GDR_LEVEL since 2.4. See here

@Keepmoving-ZXY
Copy link
Author

Keepmoving-ZXY commented Jun 20, 2019

I run ib_write_bw, and it finish completely, output is:

ubuntu@gpu4:~/nccl-tests/src$ ib_write_bw -d mlx5_2 -R 

************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF		Device         : mlx5_2
 Number of qps   : 1		Transport type : IB
 Connection type : RC		Using SRQ      : OFF
 CQ Moderation   : 100
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 3
 Max inline data : 0[B]
 rdma_cm QPs	 : ON
 Data ex. method : rdma_cm
---------------------------------------------------------------------------------------
 Waiting for client rdma_cm QP to connect
 Please run the same command with the IB/RoCE interface IP
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x0838 PSN 0x900570
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:00:21:02
 remote address: LID 0000 QPN 0x07fc PSN 0xd14453
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:00:23:04
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 65536      5000             10783.15            10782.77		   0.172524
---------------------------------------------------------------------------------------
ubuntu@gpu5:~$ ib_write_bw -d mlx5_2 -R 10.0.21.2
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF		Device         : mlx5_2
 Number of qps   : 1		Transport type : IB
 Connection type : RC		Using SRQ      : OFF
 TX depth        : 128
 CQ Moderation   : 100
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 3
 Max inline data : 0[B]
 rdma_cm QPs	 : ON
 Data ex. method : rdma_cm
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x07fc PSN 0xd14453
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:00:23:04
 remote address: LID 0000 QPN 0x0838 PSN 0x900570
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:00:21:02
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
Conflicting CPU frequency values detected: 800.288000 != 1547.256000. CPU Frequency is not max.
 65536      5000             10783.15            10782.77		   0.172524
---------------------------------------------------------------------------------------

I think NIC between two server is normal, and I forbid nccl to use GDRDMA, all_reduce_perf can run successfully, but when use GDRDMA, error is the same.

@sjeaugey
Copy link
Member

OK so you seem to imply this is GDR specific.

Is NCCL using GDRDMA by default (not setting NCCL_IB_CUDA_SUPPORT) ? Any WARN in the log ?

@Keepmoving-ZXY
Copy link
Author

Does NCCL_IB_DISABLE=0 means forbid IB and GDRDMA? I add this to argument of mpirun.

@Keepmoving-ZXY
Copy link
Author

It seems that in output of ib_write_bw, gid_index is 3 and default gid nccl used is 0, and add NCCL_IB_GID_INDEX=3 to mpirun, so command is:

/usr/local/openmpi/bin/mpirun -np 2 -host 10.0.21.2:1,10.0.23.4:1 \
	-x NCCL_SOCKET_IFNAME=rdma0 -x NCCL_DEBUG=INFO \
 	-x NCCL_IB_DISABLE=0 \
 	-x NCCL_IB_HCA=mlx5_2:1 \
 	-x NCCL_IB_TIMEOUT=200 \
 	-x NCCL_IB_GID_INDEX=3 \
 	-mca orte_base_help_aggregate 0 \
	./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4 -c 0

and another occurs:

#
#                                                     out-of-place                       in-place          
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
gpu4:5633:5633 [0] NCCL INFO Launch mode Group/CGMD
mlx5: gpu5: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000003 00000000 00000000 00000000
00000000 02005104 09000ac5 0000bfd2

gpu5:6252:6325 [0] transport/net_ib.cu:788 NCCL WARN NET/IB : Got completion with error 4, opcode 1, len 0, vendor err 81
gpu5:6252:6325 [0] NCCL INFO include/net.h:34 -> 2
gpu5:6252:6325 [0] NCCL INFO transport/net.cu:478 -> 2
gpu5:6252:6325 [0] NCCL INFO transport.cu:163 -> 2 [Proxy Thread]
mlx5: gpu4: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000009 00000000 00000000 00000000
00000000 02005104 09000b07 000076d2

gpu4:5633:5703 [0] transport/net_ib.cu:788 NCCL WARN NET/IB : Got completion with error 4, opcode 1, len 0, vendor err 81
gpu4:5633:5703 [0] NCCL INFO include/net.h:34 -> 2
gpu4:5633:5703 [0] NCCL INFO transport/net.cu:478 -> 2
gpu4:5633:5703 [0] NCCL INFO transport.cu:163 -> 2 [Proxy Thread]
gpu4:5633:5633 [0] NCCL INFO Destroyed comm 0x7f932c001af0 rank 0
gpu5:6252:6252 [0] NCCL INFO Destroyed comm 0x7fdd44001af0 rank 4

gpu4:5633:5633 [1] init.cu:194 NCCL WARN Cuda failure 'an illegal memory access was encountered'
gpu4:5633:5633 [1] NCCL INFO init.cu:1143 -> 1
gpu4: Test NCCL failure common.cu:343 'unhandled cuda error'
 .. gpu4: Test failure common.cu:393
 .. gpu4: Test failure common.cu:492
 .. gpu4: Test failure all_reduce.cu:103
 .. gpu4: Test failure common.cu:518
 .. gpu4: Test failure common.cu:839

gpu5:6252:6252 [1] init.cu:194 NCCL WARN Cuda failure 'an illegal memory access was encountered'
gpu5:6252:6252 [1] NCCL INFO init.cu:1143 -> 1
gpu5: Test NCCL failure common.cu:343 'unhandled cuda error'
 .. gpu5: Test failure common.cu:393
 .. gpu5: Test failure common.cu:492
 .. gpu5: Test failure all_reduce.cu:103
 .. gpu5: Test failure common.cu:518
 .. gpu5: Test failure common.cu:839

@sjeaugey
Copy link
Member

Interesting ... this definitely looks like NVIDIA/nccl#214 where GDR is broken by an incorrect configuration of PCI switches (disabling ACS should fix it).

@Keepmoving-ZXY
Copy link
Author

ok, I will try it later.

@Keepmoving-ZXY
Copy link
Author

Interesting ... this definitely looks like NVIDIA/nccl#214 where GDR is broken by an incorrect configuration of PCI switches (disabling ACS should fix it).

this takes effect, thank you very much for help.

@Keepmoving-ZXY
Copy link
Author

It seems that in output of ib_write_bw, gid_index is 3 and default gid nccl used is 0, and add NCCL_IB_GID_INDEX=3 to mpirun, so command is:

/usr/local/openmpi/bin/mpirun -np 2 -host 10.0.21.2:1,10.0.23.4:1 \
	-x NCCL_SOCKET_IFNAME=rdma0 -x NCCL_DEBUG=INFO \
 	-x NCCL_IB_DISABLE=0 \
 	-x NCCL_IB_HCA=mlx5_2:1 \
 	-x NCCL_IB_TIMEOUT=200 \
 	-x NCCL_IB_GID_INDEX=3 \
 	-mca orte_base_help_aggregate 0 \
	./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4 -c 0

and another occurs:

#
#                                                     out-of-place                       in-place          
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
gpu4:5633:5633 [0] NCCL INFO Launch mode Group/CGMD
mlx5: gpu5: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000003 00000000 00000000 00000000
00000000 02005104 09000ac5 0000bfd2

gpu5:6252:6325 [0] transport/net_ib.cu:788 NCCL WARN NET/IB : Got completion with error 4, opcode 1, len 0, vendor err 81
gpu5:6252:6325 [0] NCCL INFO include/net.h:34 -> 2
gpu5:6252:6325 [0] NCCL INFO transport/net.cu:478 -> 2
gpu5:6252:6325 [0] NCCL INFO transport.cu:163 -> 2 [Proxy Thread]
mlx5: gpu4: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000009 00000000 00000000 00000000
00000000 02005104 09000b07 000076d2

gpu4:5633:5703 [0] transport/net_ib.cu:788 NCCL WARN NET/IB : Got completion with error 4, opcode 1, len 0, vendor err 81
gpu4:5633:5703 [0] NCCL INFO include/net.h:34 -> 2
gpu4:5633:5703 [0] NCCL INFO transport/net.cu:478 -> 2
gpu4:5633:5703 [0] NCCL INFO transport.cu:163 -> 2 [Proxy Thread]
gpu4:5633:5633 [0] NCCL INFO Destroyed comm 0x7f932c001af0 rank 0
gpu5:6252:6252 [0] NCCL INFO Destroyed comm 0x7fdd44001af0 rank 4

gpu4:5633:5633 [1] init.cu:194 NCCL WARN Cuda failure 'an illegal memory access was encountered'
gpu4:5633:5633 [1] NCCL INFO init.cu:1143 -> 1
gpu4: Test NCCL failure common.cu:343 'unhandled cuda error'
 .. gpu4: Test failure common.cu:393
 .. gpu4: Test failure common.cu:492
 .. gpu4: Test failure all_reduce.cu:103
 .. gpu4: Test failure common.cu:518
 .. gpu4: Test failure common.cu:839

gpu5:6252:6252 [1] init.cu:194 NCCL WARN Cuda failure 'an illegal memory access was encountered'
gpu5:6252:6252 [1] NCCL INFO init.cu:1143 -> 1
gpu5: Test NCCL failure common.cu:343 'unhandled cuda error'
 .. gpu5: Test failure common.cu:393
 .. gpu5: Test failure common.cu:492
 .. gpu5: Test failure all_reduce.cu:103
 .. gpu5: Test failure common.cu:518
 .. gpu5: Test failure common.cu:839

And I find that if run test on none NVLINK GPU server, this problem don`t appear, and if run on NVLINK GPU server, change bios settings takes effect.

@sjeaugey
Copy link
Member

Thanks for the feedback. Is there anything still not working ?

@Keepmoving-ZXY
Copy link
Author

No, I want to confirm that NCCL INFO Ring 00 : 7 -> 0 [receive] via NET/IB/0/GDRDMA in below message:

gpu2:15416:15507 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via direct shared memory
gpu2:15417:15513 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via P2P/IPC
gpu2:15415:15508 [0] NCCL INFO Ring 00 : 7 -> 0 [receive] via NET/IB/0/GDRDMA
gpu2:15418:15516 [3] NCCL INFO Ring 00 : 3 -> 4 [send] via NET/IB/0
gpu2:15415:15508 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/IPC
gpu3:16316:16407 [2] NCCL INFO Ring 00 : 6[2] -> 7[3] via P2P/IPC
gpu3:16315:16427 [1] NCCL INFO Ring 00 : 5[1] -> 6[2] via direct shared memory
gpu3:16314:16408 [0] NCCL INFO Ring 00 : 3 -> 4 [receive] via NET/IB/0/GDRDMA
gpu2:15417:15513 [2] NCCL INFO Ring 00 : 2[2] -> 1[1] via direct shared memory
gpu3:16317:16406 [3] NCCL INFO Ring 00 : 7 -> 0 [send] via NET/IB/0
gpu3:16314:16408 [0] NCCL INFO Ring 00 : 4[0] -> 5[1] via P2P/IPC

means receive message with RDMA network, not TCP network?

@sjeaugey
Copy link
Member

NET/IB means we use RDMA and not TCP. Otherwise it would show NET/Socket.

GDRDMA only means data goes directly from the NIC to the GPU and vice versa, instead of going through CPU memory. It's kind of an implementation detail users should not have to worry about as NCCL should make the right decision on whether to use it or not.

If your NIC is connected to GPUs using a PCI switch, GDRDMA is important (and should be used), but if both are connected to the CPU, using GDRDMA is usually slightly slower so NCCL would not use it by default.

@Keepmoving-ZXY
Copy link
Author

ok, thank you.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants