[RUNTIME][OPENCL] OpenCL host pointer support to acheive zero copy #13413

srkreddy1238 · 2022-11-17T06:48:57Z

OpenCL supports device memory access to host by memory mapping.
OpenCL flag "CL_MEM_ALLOC_HOST_PTR" enable this while creating a memory object.

We enable this feature via compilation setting "USE_OPENCL_ENABLE_HOST_PTR"
followed by a new API "GetNativePtr" on OpenCLWorkSpace.

This allows application directly use hardware allocated memory while preparing the input.
From user side we allocate NDArray which same size as graph input, access native memory and
finally call set_input_zero_copy to set the input.

Psudo code looks like

auto narr = tvm::runtime::NDArray::Empty(shape, {kDLFloat, 32, 1}, {kDLOpenCL, 0});
OpenCLWorkspace* workspace = OpenCLWorkspace::Global();
void *nptr = workspace->GetNativePtr(narr);

... access memory pointed by nptr up to the tensor size ...

tvm::runtime::PackedFunc set_input = mod.GetFunction("set_input_zero_copy");
set_input(i, narr);

tvm-bot · 2022-11-17T06:49:00Z

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

cc @areusch, @echuraev, @elvin-n _{See #10317 for details}
Built docs for commit 9030d62 can be found here.

_{Generated by tvm-bot}

echuraev

Thank you for your PR! Quickly took a look at it. I have several questions:

You use CL_MEM_ALLOC_HOST_PTR only if we pass USE_OPENCL_ENABLE_HOST_PTR parameter to cmake. What if we enable CL_MEM_ALLOC_HOST_PTR by default and won't introduce new cmake option?
My second question is a continuation of the previous question. But probably it is the main question and more important. With option USE_OPENCL_ENABLE_HOST_PTR all memory objects will be created and mapped to the host memory, am I right? In this case, is it introducing any overhead on intermediate buffers? I mean that the buffer will be allocated in the host memory instead of global memory on the device.

srkreddy1238 · 2022-11-17T08:58:20Z

In general OpenCL global memory and host accessible memory points to DDR (Common system wide physical memory). We get zero copy advantage only if DDR is shared between the two cores.

Mapping multiple memory objects to a process address space won't cause any performance hit (unless there is writes from both sides on the mapped segment). In our case only the input mem object is written by host and others are untouched.

Cmake compilation option is to avoid any unexpected behaviors due to custom hardware & driver implementations.

tqchen · 2022-11-18T13:28:18Z

Thanks for the PR. In this particular case, it would be great to think about other means that do not necessarily make surgical changes to the DeviceAPI level or NDArray level. Exposing a GetHostPtr from opencl runtime or packed func could be a good starting pt, before motivating a device api change

srkreddy1238 · 2022-11-22T09:01:36Z

Thanks for the review.

TVM benchmarks generally evaluate run call ignoring the set_input & get_output. There exist a significant end to end performance overhead caused due to input/output (copes and also using different input buffer every time affects cache too). This was very evident when I benchmarked TVM model over MLPerf android app.

Buffer sharing is well known practice and is supported by most of the edge platforms across cores like Camera ISP, GPU, CPU...etc. Motivation here is to encourage the runtime backends to support Native Ptr access. This can retain the TVM performance numbers at final application level with less overheads.

I am good with packed function also for now until there is more demand to expose native buffers to applications via NDArray.

OpenCL supports device memory access to host by memory mapping. OpenCL flag "CL_MEM_ALLOC_HOST_PTR" enable this while creating a memory object. We enable this feature via compilation setting "USE_OPENCL_ENABLE_HOST_PTR" followed by a new API "GetNativePtr" on OpenCLWorkSpace. This allows application directly use hardware allocated memory while preparing the input. From user side we allocate NDArray which same size as graph input, access native memory and finally call set_input_zero_copy to set the input. Psudo code looks like auto narr = tvm::runtime::NDArray::Empty(shape, {kDLFloat, 32, 1}, {kDLOpenCL, 0}); OpenCLWorkspace* workspace = OpenCLWorkspace::Global(); void *nptr = workspace->GetNativePtr(narr); ... access memory pointed by nptr up to the tensor size ... tvm::runtime::PackedFunc set_input = mod.GetFunction("set_input_zero_copy"); set_input(i, narr);

tqchen · 2022-12-16T13:06:19Z

Thanks @srkreddy1238 Indeed I am not questioning the usefulness of having the NativePtr.

Just the specificity of it would benefit from PackedFunc or OpenCL specific functionality first

…pache#13413) * [RUNTIME][OPENCL] OpenCL host pointer support to acheive zero copy OpenCL supports device memory access to host by memory mapping. OpenCL flag "CL_MEM_ALLOC_HOST_PTR" enable this while creating a memory object. We enable this feature via compilation setting "USE_OPENCL_ENABLE_HOST_PTR" followed by a new API "GetNativePtr" on OpenCLWorkSpace. This allows application directly use hardware allocated memory while preparing the input. From user side we allocate NDArray which same size as graph input, access native memory and finally call set_input_zero_copy to set the input. Psudo code looks like auto narr = tvm::runtime::NDArray::Empty(shape, {kDLFloat, 32, 1}, {kDLOpenCL, 0}); OpenCLWorkspace* workspace = OpenCLWorkspace::Global(); void *nptr = workspace->GetNativePtr(narr); ... access memory pointed by nptr up to the tensor size ... tvm::runtime::PackedFunc set_input = mod.GetFunction("set_input_zero_copy"); set_input(i, narr);

echuraev reviewed Nov 17, 2022

View reviewed changes

srkreddy1238 force-pushed the native_ptr branch 2 times, most recently from fe53040 to 9030d62 Compare November 17, 2022 17:22

srkreddy1238 force-pushed the native_ptr branch from 9030d62 to ba80b5e Compare December 16, 2022 08:44

srkreddy1238 force-pushed the native_ptr branch 3 times, most recently from 92d57fb to a547f0b Compare December 20, 2022 03:24

* Lint error.

901b189

srkreddy1238 force-pushed the native_ptr branch from a547f0b to 901b189 Compare December 22, 2022 06:33

srkreddy1238 requested a review from tqchen December 29, 2022 18:00

tqchen approved these changes Dec 30, 2022

View reviewed changes

tqchen merged commit cef3f0d into apache:main Dec 30, 2022

srkreddy1238 mentioned this pull request Feb 13, 2023

[DOCS][ADRENO] Improved Adreno documentation #13867

Merged

ysh329 mentioned this pull request Apr 17, 2023

[Release] v0.12.0 Release Candidate Notes #14645

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RUNTIME][OPENCL] OpenCL host pointer support to acheive zero copy #13413

[RUNTIME][OPENCL] OpenCL host pointer support to acheive zero copy #13413

srkreddy1238 commented Nov 17, 2022 •

edited

Loading

tvm-bot commented Nov 17, 2022 •

edited

Loading

echuraev left a comment

srkreddy1238 commented Nov 17, 2022

tqchen commented Nov 18, 2022

srkreddy1238 commented Nov 22, 2022

tqchen commented Dec 16, 2022

[RUNTIME][OPENCL] OpenCL host pointer support to acheive zero copy #13413

[RUNTIME][OPENCL] OpenCL host pointer support to acheive zero copy #13413

Conversation

srkreddy1238 commented Nov 17, 2022 • edited Loading

tvm-bot commented Nov 17, 2022 • edited Loading

echuraev left a comment

Choose a reason for hiding this comment

srkreddy1238 commented Nov 17, 2022

tqchen commented Nov 18, 2022

srkreddy1238 commented Nov 22, 2022

tqchen commented Dec 16, 2022

srkreddy1238 commented Nov 17, 2022 •

edited

Loading

tvm-bot commented Nov 17, 2022 •

edited

Loading