-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ibv_poll_cq error during all_reduce_perf with openmpi running on two GPU server #24
Comments
Error 12 is a timeout. If it is intermittent, it might be the network is congested and the solution is to increase But it is usually happening right at the start of the test and just means the NICs are not able to commmunicate with each other. Making sure all NICs work fine using the Lastly, |
Also, |
I run
I think NIC between two server is normal, and I forbid nccl to use GDRDMA, |
OK so you seem to imply this is GDR specific. Is NCCL using GDRDMA by default (not setting NCCL_IB_CUDA_SUPPORT) ? Any |
Does |
It seems that in output of
and another occurs:
|
Interesting ... this definitely looks like NVIDIA/nccl#214 where GDR is broken by an incorrect configuration of PCI switches (disabling ACS should fix it). |
ok, I will try it later. |
this takes effect, thank you very much for help. |
And I find that if run test on none NVLINK GPU server, this problem don`t appear, and if run on NVLINK GPU server, change bios settings takes effect. |
Thanks for the feedback. Is there anything still not working ? |
No, I want to confirm that
means receive message with RDMA network, not TCP network? |
GDRDMA only means data goes directly from the NIC to the GPU and vice versa, instead of going through CPU memory. It's kind of an implementation detail users should not have to worry about as NCCL should make the right decision on whether to use it or not. If your NIC is connected to GPUs using a PCI switch, GDRDMA is important (and should be used), but if both are connected to the CPU, using GDRDMA is usually slightly slower so NCCL would not use it by default. |
ok, thank you. |
Today I run all_reduce_perf with openmpi and GPUDirectRDMA support, and I have two GPU servers and each server install 4 NVIDIA V100 GPU and a Mellonx MT27800 NIC, the hardware is sure to support GPUDirectRDMA technique. And the error message is:
I check line 788 in
transport/net_ib.cu
, it seems that return value ofibv_poll_cq
is bad.command to run all_reduce_perf:
I have search this error in google and few people meet this problem, so can you help me to solve this problem?
The text was updated successfully, but these errors were encountered: