Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why is nv_peer_memory severely deteriorating all_reduce_perf result? #50

Open
EdwardZhang88 opened this issue Apr 16, 2019 · 2 comments

Comments

@EdwardZhang88
Copy link

EdwardZhang88 commented Apr 16, 2019

I am running benchmark testing using nccl_test. I have 2 nodes, which are connected via RoCE. I have also installed the nv_peer_memory. However, once I turn on GPU Direct RDMA, the all_reduce_perf bandwidth gets dramatically worse than without GPU Direct RDMA. I am aware that GPU PCIe topology matters and that's why I am only using GPU0 on both nodes since GPU0 and the Mellanox HAC are connected to the same CPU.
The GPU topology is
Screen Shot 2019-04-16 at 8 23 46 PM
Without GPU Direct RDMA and just plain RoCE, GPU0 on node 1 <-> GPU0 on node 2
Screen Shot 2019-04-16 at 8 34 58 PM

With GPU Direct RDMA and just plain RoCE, GPU0 on node 1 <-> GPU0 on node 2
Screen Shot 2019-04-16 at 8 31 29 PM

According to this suggested system support, having single CPU in between GPU and the Mellanox HAC will yield worse performance. But I never expected it to be this much worse.

At this point, I am wondering if there is any tool which can help debug nv_peer_mem to make sure it really takes effect? Or maybe there is sth I misconfigured?

Here is the detail about my environment.
Nvidia Tesla V100
CUDA9.0
NCCL2.2.13
OFED4.2-1.2.0
Mellanox MT27710 ConnectX-4Lx
nvidia_peer_memory1.0-8

I notice that the log says that 'No module present for GPU Direct RDMA'. When I check its status, this is what it look like. Is this normal?
Screen Shot 2019-04-16 at 8 52 55 PM

@EdwardZhang88
Copy link
Author

Even when the 'No module present for GPU Direct RDMA'. message is gone after I re-installed nv_peer_mem, the performance still doesn't get any better for GPU Direct RDMA case.

@raph38130
Copy link

see this post :
https://devblogs.nvidia.com/benchmarking-gpudirect-rdma-on-modern-server-platforms/

RDMA transfer from NIC to GPU Mem using GPUDirect is slower than RDMA from NIC to pinned CPUMEM followed by cudaMemcpy from CPU Mem to GPU Mem.

this is a PCIeDirect (Peer to Peer) issue

In my setup (connectx5, quadroP6000, RoCEv2) I have 97.4Gb/s(with intermediate step in cpumem) or 71Gb/s (GPUDirect)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants