[QST]Can UCX accelerate data transfer through PCIe? #5372
-
What is your question? |
Beta Was this translation helpful? Give feedback.
Replies: 13 comments
-
Hello @Matrix-World UCX, and our RAPIDS Shuffle Manager, can work with PCIe. Within a single machine, for example, UCX employs CUDA IPC in order to perform a peer-to-peer transfer between GPUs. If you are curious if your GPUs support peer-to-peer connectivity, you can try building and running the Note, if the GPUs in question are connected via network, UCX supports RoCE/Infiniband in combination with If you install UCX from https://github.com/openucx/ucx/releases/tag/v1.9.0, you should see a
You should see a value in Depending on the host architecture, there may be other limitations here. So feel free to try things and let us know what you find. |
Beta Was this translation helpful? Give feedback.
-
Hi, how can I test the speed of PCIe without using UCX? Or how to monitor his speed? |
Beta Was this translation helpful? Give feedback.
-
Hi @abellina ,
Looking forward to your reply, thx. |
Beta Was this translation helpful? Give feedback.
-
The performance difference you will see all depends on
The architecture of how your system is setup plays into both 1 and 2. So first what is the slowest link in the transfer? https://en.wikipedia.org/wiki/NVLink Has a good breakdown of transfer speeds for NVLink vs PCIe. Just remember that those are theoretical transfer rates and to even get close to those you have to setup the GPU properly. It should be using all 16 PCIe lanes, it should have them to itself (no sharing), you have to be doing relatively large transfers, and you need pinned memory on the host when transferring between CPU memory and GPU memory. In practice as @abellina said we see between 10 and 12 GiB/sec transfer rates on PCIe 3.x at the most. This equates to about 80 to 100 Gigabit network speeds. PCIe 4.0 we have less hands on experience with right now. The theoretical bandwidth is doubled, so I would expect to see closer to 20 to 24 GiB/sec speeds maximum. Disks also play a role here as in many cases we have to spill to disk if we cannot hold all of the shuffle data in memory (or in the case of the regular spark shuffle it always spills to disk)
This all gets to be really complicated really quickly. It is not a simple problem, even if you are the only one transferring the data. But then we have to add in the congestion on all the buses. PCIe can be a shared bus. The controller chips, and CPUs, typically have a maximum number of lanes that they can support and the more devices you add at some point they start to share lanes and you can have congestion. Some systems that have lots of devices on them may even set up a PCIe tree, where not all of the GPUs, NICs, etc are close to each other and may have to send data between different PCIe complexes (branches in the tree). This can create points in the system that can be overloaded more quickly than others because data between the different complexes is going over these links. Your network switch may get overloaded, your disk may get overloaded with reads/writes from other processes too. All of these can play a role in the performance of a transfer. So how do you figure all of this out? Generally, we have found that if you are running on a fairly standard modern server setup.
Then the UCX shuffle plugin makes little if any difference in performance. The PCIe bus just does not get overloaded enough to slow down the transfers and we can overlap compression on the CPU with data processing on the GPU. If you have
This is under active development and we are working on getting the UCX plugin to be a clear winner in all cases, and simpler to setup. But as it is right now (release 0.2 and 0.3) unless you have one of the issues mentioned above the added complexity to set it up is probably not worth it. |
Beta Was this translation helpful? Give feedback.
-
@revans2 |
Beta Was this translation helpful? Give feedback.
-
@revans2's excellent overview summarizes the state of shuffle acceleration as a whole.
In order to test it without UCX, and involving the GPUs, I would use that
For a machine without NVLink (only PCIe) we may see a difference story. It really depends on your system, but on the one I tested I saw:
Latency is better with peer-to-peer, but max bandwidth may not be achievable (according to this test), unless you have NVLink connectivity. That said, there are many other factors that would change the equation. Given an idle PCIe bus, but a busy CPU, data transfer approach that requires the CPU may be impacted more than one that doesn't require the host.
If peer-to-peer access is enabled, we should not involve the CPU. CPU is required to initiate the transfers, i.e. memory has to be allocated ahead of time for the RDMA/DMA transfer to occur. For:
@revans2 covered this one really well. One thing I'll build up on also is that yes NVLink is used, and NVLink accounts for 1/2 the transfers in that scenario so it plays a big role in the results. When you go over the network, the best you can hope for us PCIe speeds (v3 or v4), and the DGX environment does allow for massive bandwidth over the network as well, by having only 2 GPUs share a NIC that can go at PCIe speeds, with a largely independent PCIe "mini bus" (PLX switch -> GPU + NIC + GPU). |
Beta Was this translation helpful? Give feedback.
-
UCX is required to use NVLink in this scenario. So, GPU+UCX means we used NVLink. I did cover some of this in my previous reply. Let us know if you still have doubts @Matrix-World
Same machines? Yes. We will use all CPUs and NICs available in both cases, but in the UCX case NVLink is utilized, and also RoCE instead of TCP for the networking. We also have the advantage that since transfers are peer-to-peer, for the networking components specifically, our network transfers are not coming from the CPU socket, but rather from that "mini bus" that only has 2 GPUs and a NIC. It greatly helps when you have a PCIe topology that allows packets to not saturate the rest of the bus. |
Beta Was this translation helpful? Give feedback.
-
@abellina |
Beta Was this translation helpful? Give feedback.
-
Yes, so it depends on the workload. GPU results in the charts use CPU-based LZ4 compression. GPU+UCX results DO NOT. So we are shuffling more data than the CPU does. Do you know how big the shuffle was? And how much GPU memory do you have? You may consider trying with our lz4 GPU codec to see if you get better results, but it really depends on the data you are sending. It may be worthwhile since you already have something setup given your previous comment. NOTE, this could very well be slower, but it's worth trying: Also, if you don't mind, can you add here your spark-shell/spark-submit configs to see how you are running? |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
@abellina,below is my submission script, you can see the parameter settings. In fact, there is another one that I find very strange. If NVLink can only be used through UCX? Does NVIDIA use UCX in all its products with NVLink? #!/bin/bash |
Beta Was this translation helpful? Give feedback.
-
No, UCX is not required to use NVLink. We use UCX because it is a library that finds the best transport between two endpoints. The transport uses NVLink in the case where that's available, but this is not a decision by UCX specifically. UCX performs a UCX can also choose between: RoCE/Infiniband, TCP when it realizes that two peers are not in the same machine. They are adding more support given demand. Thanks for sharing your command. It looks like it is missing our shuffle manager, however. Please refer to these docs describing how to enable it. |
Beta Was this translation helpful? Give feedback.
-
@abellina @revans2 |
Beta Was this translation helpful? Give feedback.
Hello @Matrix-World
UCX, and our RAPIDS Shuffle Manager, can work with PCIe. Within a single machine, for example, UCX employs CUDA IPC in order to perform a peer-to-peer transfer between GPUs. If you are curious if your GPUs support peer-to-peer connectivity, you can try building and running the
p2pBandwidthLatencyTest
sample program that comes with the cuda runtime (i.e. under samples/1_Utilities). Also:nvidia-smi topo -p2p r
andnvidia-smi topo -p2p w
should tell you weather reads/writes are supported.Note, if the GPUs in question are connected via network, UCX supports RoCE/Infiniband in combination with
nv_peer_mem
, which allows peer-to-peer across the network (RDMA). I am not sure…