[QST]Can UCX accelerate data transfer through PCIe? #5372

Matrix-World · 2020-12-16T15:17:36Z

Matrix-World
Dec 16, 2020

What is your question?
In https://developer.nvidia.com/blog/accelerating-apache-spark-3-0-with-gpus-and-rapids/,
I read that the new shuffle using GPUs with UCX is much faster. The main reason is that UCX uses NVLink to accelerate the data transfer between GPU and GPU. I want to know, if there is no NVLink in GPU, can UCX accelerate the data transfer between GPU and GPU, GPU and CPU through PCI-e Bus? Thank you and look forward to your reply.

Answered by abellina

Dec 16, 2020

Hello @Matrix-World

UCX, and our RAPIDS Shuffle Manager, can work with PCIe. Within a single machine, for example, UCX employs CUDA IPC in order to perform a peer-to-peer transfer between GPUs. If you are curious if your GPUs support peer-to-peer connectivity, you can try building and running the p2pBandwidthLatencyTest sample program that comes with the cuda runtime (i.e. under samples/1_Utilities). Also: nvidia-smi topo -p2p r and nvidia-smi topo -p2p w should tell you weather reads/writes are supported.

Note, if the GPUs in question are connected via network, UCX supports RoCE/Infiniband in combination with nv_peer_mem, which allows peer-to-peer across the network (RDMA). I am not sure…

View full answer

abellina · 2020-12-16T22:31:55Z

abellina
Dec 16, 2020
Collaborator

Hello @Matrix-World

UCX, and our RAPIDS Shuffle Manager, can work with PCIe. Within a single machine, for example, UCX employs CUDA IPC in order to perform a peer-to-peer transfer between GPUs. If you are curious if your GPUs support peer-to-peer connectivity, you can try building and running the p2pBandwidthLatencyTest sample program that comes with the cuda runtime (i.e. under samples/1_Utilities). Also: nvidia-smi topo -p2p r and nvidia-smi topo -p2p w should tell you weather reads/writes are supported.

Note, if the GPUs in question are connected via network, UCX supports RoCE/Infiniband in combination with nv_peer_mem, which allows peer-to-peer across the network (RDMA). I am not sure if your setup allows it.

If you install UCX from https://github.com/openucx/ucx/releases/tag/v1.9.0, you should see a ucx_perftest command. You can test CUDA IPC by doing:

CUDA_VISIBLE_DEVICES=0 UCX_TLS=cuda_copy,cuda_ipc,tcp ucx_perftest -t tag_bw -n 1000 -s 1000000 -m cuda & 
CUDA_VISIBLE_DEVICES=1 UCX_TLS=cuda_copy,cuda_ipc,tcp ucx_perftest -t tag_bw -n 1000 -s 1000000 -m cuda localhost

You should see a value in bandwidth close to PCIe speeds. If this is a PCIe3 box, you should see ~10GB/sec.

Depending on the host architecture, there may be other limitations here. So feel free to try things and let us know what you find.

0 replies

SHUIGUYU · 2020-12-17T03:25:26Z

SHUIGUYU
Dec 17, 2020

Hello @Matrix-World

UCX, and our RAPIDS Shuffle Manager, can work with PCIe. Within a single machine, for example, UCX employs CUDA IPC in order to perform a peer-to-peer transfer between GPUs. If you are curious if your GPUs support peer-to-peer connectivity, you can try building and running the p2pBandwidthLatencyTest sample program that comes with the cuda runtime (i.e. under samples/1_Utilities). Also: nvidia-smi topo -p2p r and nvidia-smi topo -p2p w should tell you weather reads/writes are supported.

Note, if the GPUs in question are connected via network, UCX supports RoCE/Infiniband in combination with nv_peer_mem, which allows peer-to-peer across the network (RDMA). I am not sure if your setup allows it.

If you install UCX from https://github.com/openucx/ucx/releases/tag/v1.9.0, you should see a ucx_perftest command. You can test CUDA IPC by doing:
CUDA_VISIBLE_DEVICES=0 UCX_TLS=cuda_copy,cuda_ipc,tcp ucx_perftest -t tag_bw -n 1000 -s 1000000 -m cuda & 
CUDA_VISIBLE_DEVICES=1 UCX_TLS=cuda_copy,cuda_ipc,tcp ucx_perftest -t tag_bw -n 1000 -s 1000000 -m cuda localhost
You should see a value in bandwidth close to PCIe speeds. If this is a PCIe3 box, you should see ~10GB/sec.

Depending on the host architecture, there may be other limitations here. So feel free to try things and let us know what you find.

Hi, how can I test the speed of PCIe without using UCX? Or how to monitor his speed?

0 replies

Matrix-World · 2020-12-17T15:01:04Z

Matrix-World
Dec 17, 2020
Author

Hello @Matrix-World

UCX, and our RAPIDS Shuffle Manager, can work with PCIe. Within a single machine, for example, UCX employs CUDA IPC in order to perform a peer-to-peer transfer between GPUs. If you are curious if your GPUs support peer-to-peer connectivity, you can try building and running the p2pBandwidthLatencyTest sample program that comes with the cuda runtime (i.e. under samples/1_Utilities). Also: nvidia-smi topo -p2p r and nvidia-smi topo -p2p w should tell you weather reads/writes are supported.

Note, if the GPUs in question are connected via network, UCX supports RoCE/Infiniband in combination with nv_peer_mem, which allows peer-to-peer across the network (RDMA). I am not sure if your setup allows it.

If you install UCX from https://github.com/openucx/ucx/releases/tag/v1.9.0, you should see a ucx_perftest command. You can test CUDA IPC by doing:
CUDA_VISIBLE_DEVICES=0 UCX_TLS=cuda_copy,cuda_ipc,tcp ucx_perftest -t tag_bw -n 1000 -s 1000000 -m cuda & 
CUDA_VISIBLE_DEVICES=1 UCX_TLS=cuda_copy,cuda_ipc,tcp ucx_perftest -t tag_bw -n 1000 -s 1000000 -m cuda localhost
You should see a value in bandwidth close to PCIe speeds. If this is a PCIe3 box, you should see ~10GB/sec.

Depending on the host architecture, there may be other limitations here. So feel free to try things and let us know what you find.

Hi @abellina ,

I want to know whether the CPU is included in the peer-to-peer transfer between GPUs you are talking about here. Or directly through NVLink or PCIe?
2. In Figure 12 and Figure 13 linked above（ https://developer.nvidia.com/blog/accelerating-apache-spark-3-0-with-gpus-and-rapids）, are NVlink used in both GPU and GPU+UCX modes? Or is NVLink only used in GPU+UCX mode? In other words, does NVLink have to use UCX to achieve accelerated data transmission?
3. The article says that UCX dramatically optimizes the data transfer between Spark processes. Will UCX make data transfer faster through NVlink(or PCIe)? Or other reasons?

Looking forward to your reply, thx.

0 replies

revans2 · 2020-12-17T15:16:34Z

revans2
Dec 17, 2020
Maintainer

@Matrix-World

The performance difference you will see all depends on

What is the slowest link in the transfer?
How over loaded the different buses/links involved are.

The architecture of how your system is setup plays into both 1 and 2.

So first what is the slowest link in the transfer?

https://en.wikipedia.org/wiki/NVLink

Has a good breakdown of transfer speeds for NVLink vs PCIe. Just remember that those are theoretical transfer rates and to even get close to those you have to setup the GPU properly. It should be using all 16 PCIe lanes, it should have them to itself (no sharing), you have to be doing relatively large transfers, and you need pinned memory on the host when transferring between CPU memory and GPU memory.

In practice as @abellina said we see between 10 and 12 GiB/sec transfer rates on PCIe 3.x at the most. This equates to about 80 to 100 Gigabit network speeds. PCIe 4.0 we have less hands on experience with right now. The theoretical bandwidth is doubled, so I would expect to see closer to 20 to 24 GiB/sec speeds maximum.

Disks also play a role here as in many cases we have to spill to disk if we cannot hold all of the shuffle data in memory (or in the case of the regular spark shuffle it always spills to disk)

A spinning disk can be between 80 to 120 MiB/sec
A high end SSD over SATA can be 500 to 600 MiB/sec (the maximum that SATA 3 supports)
NvMe disks sit on the PCIe bus and can take up to 4 lanes. In practice, though, the performance varies wildly. Some can take fast bursts but cannot maintain high speeds... Really high end 3.x ones can hit 3.5 GiB/sec and 4.0 can hit just under 5 - 7 GiB/sec
Then you also have to look at the compression speed and compression ratio for you data because very few people are going to transfer uncompressed data because they can often trade off computation for higher transfer speeds.

This all gets to be really complicated really quickly. It is not a simple problem, even if you are the only one transferring the data.

But then we have to add in the congestion on all the buses. PCIe can be a shared bus. The controller chips, and CPUs, typically have a maximum number of lanes that they can support and the more devices you add at some point they start to share lanes and you can have congestion. Some systems that have lots of devices on them may even set up a PCIe tree, where not all of the GPUs, NICs, etc are close to each other and may have to send data between different PCIe complexes (branches in the tree). This can create points in the system that can be overloaded more quickly than others because data between the different complexes is going over these links.

Your network switch may get overloaded, your disk may get overloaded with reads/writes from other processes too. All of these can play a role in the performance of a transfer.

So how do you figure all of this out?

Generally, we have found that if you are running on a fairly standard modern server setup.

No complex PCIe root complex games (PCIe Gen 3+)
Decent networking (10 GigE)
Decent NvMe disks or high end SATA SSDs.
Only one or two GPUs with no NVLink between them

Then the UCX shuffle plugin makes little if any difference in performance.

The PCIe bus just does not get overloaded enough to slow down the transfers and we can overlap compression on the CPU with data processing on the GPU.

If you have

high end networking (Infini-band or RoCE)
or you have NVLink connections between your GPUs
or a complex PCIe setup (This can take a some time to setup UCX to do the right thing here too)
then you probably want to try out the UCX based shuffle plugin.

This is under active development and we are working on getting the UCX plugin to be a clear winner in all cases, and simpler to setup. But as it is right now (release 0.2 and 0.3) unless you have one of the issues mentioned above the added complexity to set it up is probably not worth it.

0 replies

Matrix-World · 2020-12-17T15:55:28Z

Matrix-World
Dec 17, 2020
Author

@Matrix-World

The performance difference you will see all depends on
1. What is the slowest link in the transfer?

2. How over loaded the different buses/links involved are.
The architecture of how your system is setup plays into both 1 and 2.

So first what is the slowest link in the transfer?

https://en.wikipedia.org/wiki/NVLink

Has a good breakdown of transfer speeds for NVLink vs PCIe. Just remember that those are theoretical transfer rates and to even get close to those you have to setup the GPU properly. It should be using all 16 PCIe lanes, it should have them to itself (no sharing), you have to be doing relatively large transfers, and you need pinned memory on the host when transferring between CPU memory and GPU memory.

In practice as @abellina said we see between 10 and 12 GiB/sec transfer rates on PCIe 3.x at the most. This equates to about 80 to 100 Gigabit network speeds. PCIe 4.0 we have less hands on experience with right now. The theoretical bandwidth is doubled, so I would expect to see closer to 20 to 24 GiB/sec speeds maximum.

Disks also play a role here as in many cases we have to spill to disk if we cannot hold all of the shuffle data in memory (or in the case of the regular spark shuffle it always spills to disk)
* A spinning disk can be between 80 to 120 MiB/sec

* A high end SSD over SATA can be 500 to 600 MiB/sec (the maximum that SATA 3 supports)

* NvMe disks sit on the PCIe bus and can take up to 4 lanes.  In practice, though, the performance varies wildly.  Some can take fast bursts but cannot maintain high speeds... Really high end 3.x ones can hit 3.5 GiB/sec and 4.0 can hit just under 5 - 7 GiB/sec

* Then you also have to look at the compression speed and compression ratio for you data because very few people are going to transfer uncompressed data because they can often trade off computation for higher transfer speeds.
This all gets to be really complicated really quickly. It is not a simple problem, even if you are the only one transferring the data.

But then we have to add in the congestion on all the buses. PCIe can be a shared bus. The controller chips, and CPUs, typically have a maximum number of lanes that they can support and the more devices you add at some point they start to share lanes and you can have congestion. Some systems that have lots of devices on them may even set up a PCIe tree, where not all of the GPUs, NICs, etc are close to each other and may have to send data between different PCIe complexes (branches in the tree). This can create points in the system that can be overloaded more quickly than others because data between the different complexes is going over these links.

Your network switch may get overloaded, your disk may get overloaded with reads/writes from other processes too. All of these can play a role in the performance of a transfer.

So how do you figure all of this out?

Generally, we have found that if you are running on a fairly standard modern server setup.
* No complex PCIe root complex games (PCIe Gen 3+)

* Decent networking (10 GigE)

* Decent NvMe disks or high end SATA SSDs.

* Only one or two GPUs with no NVLink between them
Then the UCX shuffle plugin makes little if any difference in performance.

The PCIe bus just does not get overloaded enough to slow down the transfers and we can overlap compression on the CPU with data processing on the GPU.

If you have
* high end networking (Infini-band or RoCE)

* or you have NVLink connections between your GPUs

* or a complex PCIe setup (This can take a some time to setup UCX to do the right thing here too)
  then you probably want to try out the UCX based shuffle plugin.
This is under active development and we are working on getting the UCX plugin to be a clear winner in all cases, and simpler to setup. But as it is right now (release 0.2 and 0.3) unless you have one of the issues mentioned above the added complexity to set it up is probably not worth it.

@revans2
Thank you very much for your reply. I carefully read what you wrote and answered a lot of doubts for me. In our cluster, each node has two physical CPUs and 6 GPUs, but there is no NVLink between GPUs. GPU and GPU, GPU and CPU transfer data through PCIe Bus.
I don't quite understand some of the statements in the two pictures below.

Are the NVlink used in both GPU and GPU+UCX modes in the figures? Or is NVLink only used in GPU+UCX mode? In other words, does NVLink have to use UCX to achieve accelerated data transmission?
In other words, are the hardware environments of these three modes the same?

0 replies

abellina · 2020-12-17T15:56:24Z

abellina
Dec 17, 2020
Collaborator

@revans2's excellent overview summarizes the state of shuffle acceleration as a whole.

how can I test the speed of PCIe without using UCX? Or how to monitor his speed?

In order to test it without UCX, and involving the GPUs, I would use that p2pBandwidthLatencyTest I mentioned. If you check your cuda installation (/usr/local/cuda/samples/1_Utilities) should contain it, but a package may be required to have this (the samples package). This program can be executed and it will tell you whether any pair of GPUs in your system is peer-to-peer accessible (according to NVIDIA) and it will give you a bandwidth number.

Without peer-to-peer connectivity: GPUs talk to each other at ~9-11 GB/sec each way. (PCIe3 speeds). Latency is ~20 micros average.
With peer-to-peer connectivity: GPUs now can use NVLink and see ~140 GB/sec each way. Latency is ~3 micros average.

For a machine without NVLink (only PCIe) we may see a difference story. It really depends on your system, but on the one I tested I saw:

Without peer-to-peer connectivity: GPUs talk to each other at 11 GB/sec pretty consistently each way. (PCIe3 speeds). Latency is ~12-15 micros average.
With peer-to-peer connectivity: GPUs talk to each other at 9.9-10.24 GB/sec each way. (PCIe3 speeds). Latency is ~1.5-3 micros average.

Latency is better with peer-to-peer, but max bandwidth may not be achievable (according to this test), unless you have NVLink connectivity. That said, there are many other factors that would change the equation. Given an idle PCIe bus, but a busy CPU, data transfer approach that requires the CPU may be impacted more than one that doesn't require the host.

I want to know whether the CPU is included in the peer-to-peer transfer between GPUs you are talking about here. Or directly through NVLink or PCIe?

If peer-to-peer access is enabled, we should not involve the CPU. CPU is required to initiate the transfers, i.e. memory has to be allocated ahead of time for the RDMA/DMA transfer to occur.

For:

In Figure 12 and Figure 13 linked above（ https://developer.nvidia.com/blog/accelerating-apache-spark-3-0-with-gpus-and-rapids）, are NVlink used in both GPU and GPU+UCX modes? Or is NVLink only used in GPU+UCX mode? In other words, does NVLink have to use UCX to achieve accelerated data transmission?

The article says that UCX dramatically optimizes the data transfer between Spark processes. Will UCX make data transfer faster through NVlink(or PCIe)? Or other reasons?

@revans2 covered this one really well. One thing I'll build up on also is that yes NVLink is used, and NVLink accounts for 1/2 the transfers in that scenario so it plays a big role in the results. When you go over the network, the best you can hope for us PCIe speeds (v3 or v4), and the DGX environment does allow for massive bandwidth over the network as well, by having only 2 GPUs share a NIC that can go at PCIe speeds, with a largely independent PCIe "mini bus" (PLX switch -> GPU + NIC + GPU).

0 replies

abellina · 2020-12-17T16:05:19Z

abellina
Dec 17, 2020
Collaborator

Are the NVlink used in both GPU and GPU+UCX modes in the figures? Or is NVLink only used in GPU+UCX mode? In other words, does NVLink have to use UCX to achieve accelerated data transmission?

UCX is required to use NVLink in this scenario. So, GPU+UCX means we used NVLink. I did cover some of this in my previous reply. Let us know if you still have doubts @Matrix-World

In other words, are the hardware environments of these three modes the same?

Same machines? Yes. We will use all CPUs and NICs available in both cases, but in the UCX case NVLink is utilized, and also RoCE instead of TCP for the networking. We also have the advantage that since transfers are peer-to-peer, for the networking components specifically, our network transfers are not coming from the CPU socket, but rather from that "mini bus" that only has 2 GPUs and a NIC. It greatly helps when you have a PCIe topology that allows packets to not saturate the rest of the bus.

0 replies

Matrix-World · 2020-12-17T16:19:55Z

Matrix-World
Dec 17, 2020
Author

Are the NVlink used in both GPU and GPU+UCX modes in the figures? Or is NVLink only used in GPU+UCX mode? In other words, does NVLink have to use UCX to achieve accelerated data transmission?

UCX is required to use NVLink in this scenario. So, GPU+UCX means we used NVLink. I did cover some of this in my previous reply. Let us know if you still have doubts @Matrix-World

In other words, are the hardware environments of these three modes the same?

Same machines? Yes. We will use all CPUs and NICs available in both cases, but in the UCX case NVLink is utilized, and also RoCE instead of TCP for the networking. We also have the advantage that since transfers are peer-to-peer, for the networking components specifically, our network transfers are not coming from the CPU socket, but rather from that "mini bus" that only has 2 GPUs and a NIC. It greatly helps when you have a PCIe topology that allows packets to not saturate the rest of the bus.

@abellina
Thank you, I get it now, so for my current cluster (data transfer between GPU and GPU via PCIe), if I also do the same experiment in GPU and GPU+UCX mode respectively (as shown in the figures above ), the Query duration time should be the same, right? In fact, my results are indeed the same.

0 replies

abellina · 2020-12-17T16:24:20Z

abellina
Dec 17, 2020
Collaborator

Yes, so it depends on the workload.

GPU results in the charts use CPU-based LZ4 compression. GPU+UCX results DO NOT. So we are shuffling more data than the CPU does. Do you know how big the shuffle was? And how much GPU memory do you have?

You may consider trying with our lz4 GPU codec to see if you get better results, but it really depends on the data you are sending. It may be worthwhile since you already have something setup given your previous comment. NOTE, this could very well be slower, but it's worth trying: spark.rapids.shuffle.compression.codec=lz4 as a config.

Also, if you don't mind, can you add here your spark-shell/spark-submit configs to see how you are running?

0 replies

Matrix-World · 2020-12-17T16:39:02Z

Matrix-World
Dec 17, 2020
Author

Yes, so it depends on the workload.

GPU results in the charts use CPU-based LZ4 compression. GPU+UCX results DO NOT. So we are shuffling more data than the CPU does. Do you know how big the shuffle was? And how much GPU memory do you have?

You may consider trying with our lz4 GPU codec to see if you get better results, but it really depends on the data you are sending. It may be worthwhile since you already have something setup given your previous comment. NOTE, this could very well be slower, but it's worth trying: spark.rapids.shuffle.compression.codec=lz4 as a config.

Also, if you don't mind, can you add here your spark-shell/spark-submit configs to see how you are running?
@abellina
There are 3 machines in my cluster, each machine has 6 GPUs, and the memory of each GPU is 16GB. The data size of shuffle is about 36GB.
I would like it, but the submit script is on my work computer. I will add here later, OK? Now my time here is 0:30 am, I am going to bed. Thank you again for your patient reply.

0 replies

Matrix-World · 2020-12-18T02:08:49Z

Matrix-World
Dec 18, 2020
Author

Yes, so it depends on the workload.
GPU results in the charts use CPU-based LZ4 compression. GPU+UCX results DO NOT. So we are shuffling more data than the CPU does. Do you know how big the shuffle was? And how much GPU memory do you have?
You may consider trying with our lz4 GPU codec to see if you get better results, but it really depends on the data you are sending. It may be worthwhile since you already have something setup given your previous comment. NOTE, this could very well be slower, but it's worth trying: spark.rapids.shuffle.compression.codec=lz4 as a config.
Also, if you don't mind, can you add here your spark-shell/spark-submit configs to see how you are running?
@abellina
There are 3 machines in my cluster, each machine has 6 GPUs, and the memory of each GPU is 16GB. The data size of shuffle is about 36GB.
I would like it, but the submit script is on my work computer. I will add here later, OK? Now my time here is 0:30 am, I am going to bed. Thank you again for your patient reply.

@abellina，below is my submission script, you can see the parameter settings. In fact, there is another one that I find very strange. If NVLink can only be used through UCX? Does NVIDIA use UCX in all its products with NVLink?

#!/bin/bash
/home/yexin/bigData/spark-3.0.0-bin-hadoop3.2/bin/spark-submit
--class org.hik.TpcxBB
--master spark://10.3.68.116:7077
--executor-memory 30g
--total-executor-cores 12
--executor-cores 4
--jars '/opt/sparkRapidsPlugin/cudf-0.14-cuda10-2.jar,/opt/sparkRapidsPlugin/rapids-4-spark_2.12-0.1.0.jar'
--conf spark.executor.extraClassPath=/opt/sparkRapidsPlugin/cudf-0.14-cuda10-2.jar:/opt/sparkRapidsPlugin/rapids-4-spark_2.12-0.1.0.jar
--conf spark.driver.extraClassPath=/opt/sparkRapidsPlugin/cudf-0.14-cuda10-2.jar:/opt/sparkRapidsPlugin/rapids-4-spark_2.12-0.1.0.jar
--conf spark.plugins=com.nvidia.spark.SQLPlugin
--conf spark.rapids.sql.incompatibleOps.enabled=true
--conf spark.driver.memory=50g
--conf spark.executor.memory=30g
--conf spark.driver.maxResultSize=60g
--conf spark.rapids.memory.gpu.pooling.enabled=true
--conf spark.default.parallelism=100
--conf spark.task.cpus=1
--conf spark.executor.resource.gpu.amount=1
--conf spark.rapids.memory.gpu.allocFraction=0.8
--conf spark.rapids.shuffle.transport.enabled=true
--conf spark.rapids.sql.enabled=true
--conf spark.rapids.sql.concurrentGpuTask=1
--conf spark.sql.shuffle.partitions=100
--conf spark.task.resource.gpu.amount=0.25
--conf spark.locality.wait=0s
--conf spark.rapids.sql.hasNans=true
--conf spark.executor.extraJavaOptions="-Dai.rapids.cudf.prefer-pinned=true"
--conf spark.executor.resource.gpu.discoveryScript=/opt/sparkRapidsPlugin/getGpusResources.sh
/home/yexin/bigData/runJars/tpcx-BB.jar "/user/root/benchmarks/data-200g/data" "5"

0 replies

abellina · 2020-12-18T14:58:33Z

abellina
Dec 18, 2020
Collaborator

In fact, there is another one that I find very strange. If NVLink can only be used through UCX? Does NVIDIA use UCX in all its products with NVLink?

No, UCX is not required to use NVLink. We use UCX because it is a library that finds the best transport between two endpoints. The transport uses NVLink in the case where that's available, but this is not a decision by UCX specifically. UCX performs a cudaIpcOpenMemHandle call on our behalf, allowing two GPUs to memcpy to each other. The transfer can happen via PCIe or NVLink, this part is the driver's call.

UCX can also choose between: RoCE/Infiniband, TCP when it realizes that two peers are not in the same machine. They are adding more support given demand.

Thanks for sharing your command. It looks like it is missing our shuffle manager, however. Please refer to these docs describing how to enable it.

0 replies

Matrix-World · 2020-12-21T03:07:21Z

Matrix-World
Dec 21, 2020
Author

In fact, there is another one that I find very strange. If NVLink can only be used through UCX? Does NVIDIA use UCX in all its products with NVLink?

No, UCX is not required to use NVLink. We use UCX because it is a library that finds the best transport between two endpoints. The transport uses NVLink in the case where that's available, but this is not a decision by UCX specifically. UCX performs a cudaIpcOpenMemHandle call on our behalf, allowing two GPUs to memcpy to each other. The transfer can happen via PCIe or NVLink, this part is the driver's call.

UCX can also choose between: RoCE/Infiniband, TCP when it realizes that two peers are not in the same machine. They are adding more support given demand.

Thanks for sharing your command. It looks like it is missing our shuffle manager, however. Please refer to these docs describing how to enable it.

@abellina @revans2
Thank you for your patient reply, I think I certainly made significant gains.
It's is Christmas Eve soon, I wish you Merry Christmas and Happy New Year, ALSO. Hope you happy every day.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST]Can UCX accelerate data transfer through PCIe? #5372

{{title}}

Replies: 13 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

[QST]Can UCX accelerate data transfer through PCIe? #5372

Matrix-World Dec 16, 2020

Replies: 13 comments

abellina Dec 16, 2020 Collaborator

SHUIGUYU Dec 17, 2020

Matrix-World Dec 17, 2020 Author

revans2 Dec 17, 2020 Maintainer

Matrix-World Dec 17, 2020 Author

abellina Dec 17, 2020 Collaborator

abellina Dec 17, 2020 Collaborator

Matrix-World Dec 17, 2020 Author

abellina Dec 17, 2020 Collaborator

Matrix-World Dec 17, 2020 Author

Matrix-World Dec 18, 2020 Author

abellina Dec 18, 2020 Collaborator

Matrix-World Dec 21, 2020 Author

Matrix-World
Dec 16, 2020

abellina
Dec 16, 2020
Collaborator

SHUIGUYU
Dec 17, 2020

Matrix-World
Dec 17, 2020
Author

revans2
Dec 17, 2020
Maintainer

Matrix-World
Dec 17, 2020
Author

abellina
Dec 17, 2020
Collaborator

abellina
Dec 17, 2020
Collaborator

Matrix-World
Dec 17, 2020
Author

abellina
Dec 17, 2020
Collaborator

Matrix-World
Dec 17, 2020
Author

Matrix-World
Dec 18, 2020
Author

abellina
Dec 18, 2020
Collaborator

Matrix-World
Dec 21, 2020
Author