Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How the memory copy works between CPU and GPU? #688

Open
ihchoi12 opened this issue May 31, 2022 · 4 comments
Open

How the memory copy works between CPU and GPU? #688

ihchoi12 opened this issue May 31, 2022 · 4 comments

Comments

@ihchoi12
Copy link

Hello, I want to understand how the memory copy works between CPU and GPU.

  1. CPU -> GPU memory copy (e.g., CPU moves data to GPU) is triggered by cudaMemcpy() called by CPU.
    From my Nsight systems profiling, I found that no GPU kernel is launched for this, meaning that it is purely processed by CPU.
    I wonder how CPU can write to GPU memory w/o GPU involvement. Does the CPU perform PCIe memory write transaction for this?

  2. GPU -> CPU memory copy (e.g., GPU moves gradients to CPU to perform inter-node Allreduce) is triggered by NCCL.
    I saw (in NCCL memcpy time #213) that the NCCL kernels perform store/load operations to the host memory. Does it mean that the GPU performs those operations directly to the host memory? Does GPU have direct access to the host memory? It is a PCIe memory write transaction as well? Or is it a DMA operation?

Any comments much appreciated!

@sjeaugey
Copy link
Member

sjeaugey commented Jun 1, 2022

NCCL does not use cudaMemcpy aside from the initial setup (ncclCommInit*).
Most of the time, the memory shared by the CPU and GPU is in CPU memory, and we then register it on the GPU using cudaHostRegister or allocating it with cudaHostAlloc. That way both the CPU and GPU can access that memory directly using load/stores.

If we have the GDRCopy module loaded, we can locate some buffers in GPU memory, and the CPU will be able to access it directly using load/stores.

Regarding network communication, there is no need to move data from GPU to CPU if we have GPU Direct RDMA. The NIC can directly pull data from GPU memory and write data to the destination GPU memory as well. If GPU Direct RDMA is not present (or we don't want to use it), then the GPU will indeed write its data to CPU memory before it is sent.

@guanbear
Copy link

guanbear commented Jul 1, 2022

Hi @sjeaugey,is it possible to use kernel PCI Peer-to-Peer DMA Support (https://www.kernel.org/doc/html/latest/driver-api/pci/p2pdma.html ) , if GPU Direct is not present?

Looking forward to your reply,thanks!

@sjeaugey
Copy link
Member

sjeaugey commented Jul 1, 2022

GPU Direct relies on PCI Peer-to-peer operations to work. I don't know whether the nvidia driver uses the functions mentioned in the page above but it relies on that same hardware functionality (except when NVLink is present in which case NVLink is preferred).

@MonroeD
Copy link

MonroeD commented Nov 1, 2023

NCCL does not use cudaMemcpy aside from the initial setup (ncclCommInit*). Most of the time, the memory shared by the CPU and GPU is in CPU memory, and we then register it on the GPU using cudaHostRegister or allocating it with cudaHostAlloc. That way both the CPU and GPU can access that memory directly using load/stores.

If we have the GDRCopy module loaded, we can locate some buffers in GPU memory, and the CPU will be able to access it directly using load/stores.

Regarding network communication, there is no need to move data from GPU to CPU if we have GPU Direct RDMA. The NIC can directly pull data from GPU memory and write data to the destination GPU memory as well. If GPU Direct RDMA is not present (or we don't want to use it), then the GPU will indeed write its data to CPU memory before it is sent.

I see in the code, the memory register in IB is cudaHostAlloc (am i right?)

  for (int p=0; p<NCCL_NUM_PROTOCOLS; p++) {
    resources->buffers[p] = NCCL_NET_MAP_GET_POINTER(map, cpu, buffs[p]);
    if (resources->buffers[p]) {
      NCCLCHECK(ncclNetRegMr(resources->netSendComm, resources->buffers[p], resources->buffSizes[p], NCCL_NET_MAP_DEV_MEM(map, buffs[p]) ? NCCL_PTR_CUDA : NCCL_PTR_HOST, &resources->mhandles[p]));
    }
  }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants