Add design of asynchronous techniques on heterogeneous devices #7814

reyoung · 2018-01-24T04:59:51Z

The asynchronous techniques on heterogeneous devices are the key issue of performance tuning.

I just try to describe the problem and give a straightforward solution.

Any comments/questions on this design are welcome.

chengduoZH · 2018-01-24T07:06:06Z

doc/design/async_devices.md

+
+Let's use CUDA as an example. There is a building block named `stream` in CUDA. Streams introduce task-based parallelism to CUDA codes. The sequence of operations will be executed in issue-order on the GPU if they are in the same stream. 
+
+The operators in different streams are able to run concurrently as long as they are in multiple streams and hardware supports it. CUDA hardware has no notion of streams. The hardware has separate queues (engines) to perform memory copies and to execute kernels.


To make operations running concurrently, the operations of one stream should be not depending on the operation on other streams.

memory copies ==> data transfers

To make operations running concurrently, the operations of one stream should be not depending on the operation on other streams.

I think I want to talk another issue. Please refer to #7814 (comment)

chengduoZH · 2018-01-24T07:14:36Z

doc/design/async_devices.md

+
+The operators in different streams are able to run concurrently as long as they are in multiple streams and hardware supports it. CUDA hardware has no notion of streams. The hardware has separate queues (engines) to perform memory copies and to execute kernels.
+
+If we want to take advantage of CUDA devices, we must use at least N streams, where N equals the number of hardware queues, and separate operators into these streams. The N equals to three since CUDA can simultaneously execute CUDA kernels, H2D memcpy, D2H memcpy by the CUDA hardware.


It seems that the stream of CUDA does not have a limit. As long as the resources (memory and computaion) of GPU are not occupied, in theory, you can create a new stream.

No, it does not. The CUDA can create many streams without any limits. However, jobs in CUDA can simultaneously execute, in two conditions.

The jobs in different streams.

The hardware supports.

Since the CUDA hardware supports to simultaneously execute THREE kinds of operators, Kernel execution/D2H memcpy/H2D memcpy, we need AT LEAST THREE streams to make full usage of CUDA devices. And if we use more than three streams, it will not help much since the hardware only supports to simultaneously execute THREE kinds of operators.

Threads in a block will be launched in a SM(streaming multiprocessors). If the former CUDA kernel occupies few SMs and there is more SM left, another CUDA kernel can be executed in parallel. Please refer to https://devblogs.nvidia.com/gpu-pro-tip-cuda-7-streams-simplify-concurrency/.

And if we use more than three streams, it will not help much since the hardware only supports to simultaneously execute THREE kinds of operators.

Please consider the following case:

We have 7 kernels of 3 types (A B C). There dependency relationship is as follows:

A0 B0 C0
B1
B2
C1
C2

If we just have 3 streams, and it's possible that they are put into 3 stream in the following order:

A0 B0 C0 C1 C2
B1
B2

In this case C1 C2 don't depend on C0 but still have to wait for C0's completion before being able to run.

I think the number of streams we use should equal to the concurrency expressed in our program, not the hardware.

chengduoZH · 2018-01-24T07:26:39Z

doc/design/async_devices.md

+
+* Create N device contexts on one device. The N should be corresponding to the hardware property. For example, the CUDA devices should have three device contexts.
+
+* Every tensor should hold the one device context, where the current operator of the tensor is performed on.


I wondered whether it is appropriate that every tensor holds one device context.
Device Context maybe has a lot of objects, taking CUDADeviceContext as an example, it currently has six private data:

CUDAPlace place_; std::unique_ptr<Eigen::GpuDevice> eigen_device_; std::unique_ptr<EigenCudaStreamDevice> eigen_stream_; cudaStream_t stream_; cudnnHandle_t cudnn_handle_; cublasHandle_t cublas_handle_;

But for tensors, only place_ and stream_ are necessary.

Only device context can be Wait(). We should not use the low-level APIs, like stream. Because there could be other devices, like OpenCL, need to be supported.

Another reason we use device context is CUDNN/CUBLAS/EIGEN need to bind a stream before we use other APIs. http://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#cudnnSetStream The cudnnHandle_t is coupled with stream.

jacquesqiao · 2018-01-24T08:17:03Z

doc/design/async_devices.md

+  kH2DMEMCPY
+};
+
+std::map<CUDAHardwareStream, DeviceContext* > gDevCtxs;


Does here need a device_id for multi-devices in this global gDevCtxs

Well. This code is used to demonstrate the basic idea of the solution. I do not go so deeply in details.

tonyyang-svail · 2018-01-25T22:54:36Z

doc/design/async_devices.md

+enum CUDAHardwareStream {
+  kCOMPUTATION,
+  kD2HMEMCPY,
+  kH2DMEMCPY


Is there a kD2DMEMCPY?

tonyyang-svail · 2018-01-25T23:15:36Z

doc/design/async_devices.md

+public:
+  ...
+
+  void SwitchDevCtx(DeviceContext* new_ctx) {


I am hesitant to add SwitchDevCtx as a method of Tensor. Reasons:

If we haveTensor::SwitchDevCtx, we may also need SelectedRows::SwitchDevCtx etc.

The operator needs to wait, not the tensor.

So maybe we should put the explicit wait in operator run?

void ReduceOp::Run(scope, place) { gDevCtxs[place, kCOMPUTATION].Wait(); my_ctx = gDevCtx[place, kD2DMEMCPY]; ... }

Only data in different CUDA streams are independent, can these operations execute potentially in parallel. So, the basic problem is to analyze data dependencies between two operations. Maybe we need an explicit scheduler module to do these.

@tonyyang-svail
I don't think the operator needs to wait, not the tensor.

The previous operator of the tensor needs to wait for. So we need

Record the previous operator of the tensor.

For example, ReduceOp::Run may not need to wait kCOMPUTATION if the previous operators are not computation. ReduceOp::Run could wait for any kind of device contexts.

The input tensors of ReduceOp can be operated by various streams. For example, there are two tensors need to be reduced, A, B. A is a computational result. B is a H2DMemcpy result. The two device contexts should be both waited.

Not all operators of Tensor performed by paddle::framework::Operator.

There are memcpy, fill zero in framework module.

We cannot just put an explicit wait in operator run to solve this problem

Only data in different CUDA streams are independent, can these operations execute potentially in parallel. So, the basic problem is to analyze data dependencies between two operations. Maybe we need an explicit scheduler module to do these.

@QiJune
An explicit scheduler module could resolve these problems, however,

Not all operators of Tensor performed by paddle::framework::Operator and Executor.

A scheduler should schedule operators. However, there is no unified abstraction of operators on Tensor to be scheduled.

If we want to add a scheduler, we should add an abstraction layer of operators first.

There is no clear schedule algorithm for Fluid

Fluid is different from other frameworks. We do not use DAG to represent neural networks. Tensors and variables can be overwritten. There could be loops in our framework. We should give a clear schedule algorithm before we write it.

An explicit scheduler may NOT be faster than switching streams as it needs.

Switching streams as it needs will have more conditions in C++ (if statements). However, comparing than the computation and wait for streams, the conditions are ignorable.

An explicit scheduler also needs to calculate dependencies ahead of time. It is not free.

@reyoung Maybe we can write some experimental codes. Following is a pseudocode：

tensor1 = op1(dev_ctx1); tensor2 = op2(tensor1, dev_ctx1); tensor3 = op3(tensor2, dev_ctx1); tensor4 = update_op(tensor1, dev_ctx2);

We expect that after op1 running, update_op can run in parallel with op2 and op3. But cudaStreamSynchronize will block until stream has completed all operations. Please refer to the official doc.

I am not sure if the behavior will be update_op running after op1/op2/op3 finishing, because the dev_ctx1 stream has three operations on it. That's not we wanted.

@QiJune thanks for the example. I am sure these four operators will be executed sequentially.

As fas as parallel_do_grad is concerned, I think the following program is good enough, even without an explicit scheduler

parallel_do_grad w1_grad = fc_grad(.., stream0) all_reduce(w1_grad, stream1) // it will wait for stream0 sgd(w1_grad, w1, stream1) w2_grad = fc_grad(.., stream0) ...

QiJune

For I/O related operator, we need a transpiler to insert it to ProgramDesc accurately to achieve max performance.

chengduoZH · 2018-03-04T15:00:19Z

I think this PR should be active again.
how many computation streams there should be? And how many copying streams? we should reconsider that.

helinwang · 2018-03-05T19:11:05Z

doc/design/async_devices.md

+
+The operators in different streams are able to run concurrently as long as they are in multiple streams and hardware supports it. CUDA hardware has no notion of streams. The hardware has separate queues (engines) to perform memory copies and to execute kernels.
+
+If we want to take advantage of CUDA devices, we must use at least N streams, where N equals the number of hardware queues, and separate operators into these streams. The N equals to three since CUDA can simultaneously execute CUDA kernels, H2D memcpy, D2H memcpy by the CUDA hardware.


Why we must use at least N streams, where N equals the number of hardware queues? Doesn't CUDA will handle multiplexing a single stream on to different hardware queues transparent for us. I agree we need to use N streams, but maybe N should not be bounded by the number of hardware queues? (otherwise we need code to lookup the number of hardware queues given the hardware, complicates our code).

helinwang · 2018-03-05T19:15:45Z

doc/design/async_devices.md

+
+The solution is straightforward based on the hardware properties we described in the problem section. We should:
+
+* Create N device contexts on one device. The N should be corresponding to the hardware property. For example, the CUDA devices should have three device contexts.


Shouldn't it be that the number of device contexts only depends on the "concurrency" requirement of PaddlePaddle program, rather than the hardware? Related question: #7814 (comment)
Please also see: #7814 (comment)

helinwang · 2018-03-05T19:19:54Z

doc/design/async_devices.md

+
+* Create N device contexts on one device. The N should be corresponding to the hardware property. For example, the CUDA devices should have three device contexts.
+
+* Every tensor should hold the one device context, where the current operator of the tensor is performed on.


In Every tensor should hold the one device context, I think it's possible that one tensor gets used by different streams, and I assume one context corresponding to one stream, so which context should it hold?

helinwang · 2018-03-05T19:21:13Z

doc/design/async_devices.md

+
+The solution is straightforward based on the hardware properties we described in the problem section. We should:
+
+* Create N device contexts on one device. The N should be corresponding to the hardware property. For example, the CUDA devices should have three device contexts.


It is true that one "device contexts" = "one stream"? It's a little confusing that in the "Problem" section we are only talking about stream, but in the "Solution" section we are mainly talking about context.

helinwang · 2018-03-05T20:03:11Z

doc/design/async_devices.md

+
+* Every tensor should hold the one device context, where the current operator of the tensor is performed on.
+
+* Wait for the execution complete on the previous device context, when switching the current device context of tensors.


If the stream on the given tensor is as follows:

tensor_related_op op_a op_b op_c op_d op_e

op_a to op_e is not related to the tensor.

You mentioned "Wait for the execution complete on the previous device context", do we have to wait until op_e, or until tensor_related_op?

luotao1 · 2019-02-01T05:53:25Z

感谢您给PaddlePaddle贡献文档。由于文档已迁移至FluidDoc repo，因此关闭您的PR，欢迎您向FluidDoc Repo贡献文档。
Thanks for contributing to PaddlePaddle! Since documents have been moved to FluidDoc repo, we close this PR. Welcome to contribute to FluidDoc repo.

reyoung requested review from wangkuiyi, dzhwinter, qingqing01, QiJune, tonyyang-svail and chengduoZH January 24, 2018 04:59

Add design of asynchronous techniques on heterogeneous devices

ecdab2c

reyoung force-pushed the feature/multi_stream_design_doc branch from db2512a to ecdab2c Compare January 24, 2018 05:08

Polish english

7c7d7dd

chengduoZH reviewed Jan 24, 2018

View reviewed changes

jacquesqiao reviewed Jan 24, 2018

View reviewed changes

tonyyang-svail reviewed Jan 25, 2018

View reviewed changes

tonyyang-svail mentioned this pull request Jan 31, 2018

[Don't Merge]add stream type to CUDAPlace #7986

Closed

QiJune approved these changes Jan 31, 2018

View reviewed changes

helinwang reviewed Mar 5, 2018

View reviewed changes

luotao1 closed this Feb 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add design of asynchronous techniques on heterogeneous devices #7814

Add design of asynchronous techniques on heterogeneous devices #7814

reyoung commented Jan 24, 2018 •

edited

Loading

chengduoZH Jan 24, 2018

reyoung Jan 25, 2018

chengduoZH Jan 24, 2018 •

edited

Loading

reyoung Jan 25, 2018 •

edited

Loading

QiJune Jan 29, 2018

helinwang Mar 5, 2018 •

edited

Loading

chengduoZH Jan 24, 2018

reyoung Jan 25, 2018 •

edited

Loading

jacquesqiao Jan 24, 2018

reyoung Jan 25, 2018

tonyyang-svail Jan 25, 2018

tonyyang-svail Jan 25, 2018

QiJune Jan 29, 2018

reyoung Jan 29, 2018

reyoung Jan 29, 2018 •

edited

Loading

QiJune Jan 30, 2018

tonyyang-svail Jan 31, 2018

QiJune left a comment •

edited

Loading

chengduoZH commented Mar 4, 2018

helinwang Mar 5, 2018 •

edited

Loading

helinwang Mar 5, 2018 •

edited

Loading

helinwang Mar 5, 2018

helinwang Mar 5, 2018

helinwang Mar 5, 2018

luotao1 commented Feb 1, 2019


		Let's use CUDA as an example. There is a building block named `stream` in CUDA. Streams introduce task-based parallelism to CUDA codes. The sequence of operations will be executed in issue-order on the GPU if they are in the same stream.

		The operators in different streams are able to run concurrently as long as they are in multiple streams and hardware supports it. CUDA hardware has no notion of streams. The hardware has separate queues (engines) to perform memory copies and to execute kernels.


		The operators in different streams are able to run concurrently as long as they are in multiple streams and hardware supports it. CUDA hardware has no notion of streams. The hardware has separate queues (engines) to perform memory copies and to execute kernels.

		If we want to take advantage of CUDA devices, we must use at least N streams, where N equals the number of hardware queues, and separate operators into these streams. The N equals to three since CUDA can simultaneously execute CUDA kernels, H2D memcpy, D2H memcpy by the CUDA hardware.


		* Create N device contexts on one device. The N should be corresponding to the hardware property. For example, the CUDA devices should have three device contexts.

		* Every tensor should hold the one device context, where the current operator of the tensor is performed on.


		The solution is straightforward based on the hardware properties we described in the problem section. We should:

		* Create N device contexts on one device. The N should be corresponding to the hardware property. For example, the CUDA devices should have three device contexts.


		* Every tensor should hold the one device context, where the current operator of the tensor is performed on.

		* Wait for the execution complete on the previous device context, when switching the current device context of tensors.

Add design of asynchronous techniques on heterogeneous devices #7814

Add design of asynchronous techniques on heterogeneous devices #7814

Conversation

reyoung commented Jan 24, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chengduoZH Jan 24, 2018 • edited Loading

Choose a reason for hiding this comment

reyoung Jan 25, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang Mar 5, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

reyoung Jan 25, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

reyoung Jan 29, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

QiJune left a comment • edited Loading

Choose a reason for hiding this comment

chengduoZH commented Mar 4, 2018

helinwang Mar 5, 2018 • edited Loading

Choose a reason for hiding this comment

helinwang Mar 5, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

luotao1 commented Feb 1, 2019

reyoung commented Jan 24, 2018 •

edited

Loading

chengduoZH Jan 24, 2018 •

edited

Loading

reyoung Jan 25, 2018 •

edited

Loading

helinwang Mar 5, 2018 •

edited

Loading

reyoung Jan 25, 2018 •

edited

Loading

reyoung Jan 29, 2018 •

edited

Loading

QiJune left a comment •

edited

Loading

helinwang Mar 5, 2018 •

edited

Loading

helinwang Mar 5, 2018 •

edited

Loading