Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RUNTIME][OPENCL] OpenCL host pointer support to acheive zero copy #13413

Merged
merged 2 commits into from
Dec 30, 2022

Conversation

srkreddy1238
Copy link
Contributor

@srkreddy1238 srkreddy1238 commented Nov 17, 2022

OpenCL supports device memory access to host by memory mapping.
OpenCL flag "CL_MEM_ALLOC_HOST_PTR" enable this while creating a memory object.

We enable this feature via compilation setting "USE_OPENCL_ENABLE_HOST_PTR"
followed by a new API "GetNativePtr" on OpenCLWorkSpace.

This allows application directly use hardware allocated memory while preparing the input.
From user side we allocate NDArray which same size as graph input, access native memory and
finally call set_input_zero_copy to set the input.

Psudo code looks like

auto narr = tvm::runtime::NDArray::Empty(shape, {kDLFloat, 32, 1}, {kDLOpenCL, 0});
OpenCLWorkspace* workspace = OpenCLWorkspace::Global();
void *nptr = workspace->GetNativePtr(narr);

... access memory pointed by nptr up to the tensor size ...

tvm::runtime::PackedFunc set_input = mod.GetFunction("set_input_zero_copy");
set_input(i, narr);

@tvm-bot
Copy link
Collaborator

tvm-bot commented Nov 17, 2022

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

Generated by tvm-bot

Copy link
Contributor

@echuraev echuraev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your PR! Quickly took a look at it. I have several questions:

  1. You use CL_MEM_ALLOC_HOST_PTR only if we pass USE_OPENCL_ENABLE_HOST_PTR parameter to cmake. What if we enable CL_MEM_ALLOC_HOST_PTR by default and won't introduce new cmake option?
  2. My second question is a continuation of the previous question. But probably it is the main question and more important. With option USE_OPENCL_ENABLE_HOST_PTR all memory objects will be created and mapped to the host memory, am I right? In this case, is it introducing any overhead on intermediate buffers? I mean that the buffer will be allocated in the host memory instead of global memory on the device.

@srkreddy1238
Copy link
Contributor Author

In general OpenCL global memory and host accessible memory points to DDR (Common system wide physical memory). We get zero copy advantage only if DDR is shared between the two cores.

Mapping multiple memory objects to a process address space won't cause any performance hit (unless there is writes from both sides on the mapped segment). In our case only the input mem object is written by host and others are untouched.

Cmake compilation option is to avoid any unexpected behaviors due to custom hardware & driver implementations.

@srkreddy1238 srkreddy1238 force-pushed the native_ptr branch 2 times, most recently from fe53040 to 9030d62 Compare November 17, 2022 17:22
@tqchen
Copy link
Member

tqchen commented Nov 18, 2022

Thanks for the PR. In this particular case, it would be great to think about other means that do not necessarily make surgical changes to the DeviceAPI level or NDArray level. Exposing a GetHostPtr from opencl runtime or packed func could be a good starting pt, before motivating a device api change

@srkreddy1238
Copy link
Contributor Author

Thanks for the review.

TVM benchmarks generally evaluate run call ignoring the set_input & get_output. There exist a significant end to end performance overhead caused due to input/output (copes and also using different input buffer every time affects cache too). This was very evident when I benchmarked TVM model over MLPerf android app.

Buffer sharing is well known practice and is supported by most of the edge platforms across cores like Camera ISP, GPU, CPU...etc. Motivation here is to encourage the runtime backends to support Native Ptr access. This can retain the TVM performance numbers at final application level with less overheads.

I am good with packed function also for now until there is more demand to expose native buffers to applications via NDArray.

OpenCL supports device memory access to host by memory mapping.
OpenCL flag "CL_MEM_ALLOC_HOST_PTR" enable this while creating a memory object.

We enable this feature via compilation setting "USE_OPENCL_ENABLE_HOST_PTR"
followed by a new API "GetNativePtr" on OpenCLWorkSpace.

This allows application directly use hardware allocated memory while preparing the input.
From user side we allocate NDArray which same size as graph input, access native memory and
finally call set_input_zero_copy to set the input.

Psudo code looks like

auto narr = tvm::runtime::NDArray::Empty(shape, {kDLFloat, 32, 1}, {kDLOpenCL, 0});
OpenCLWorkspace* workspace = OpenCLWorkspace::Global();
void *nptr = workspace->GetNativePtr(narr);

... access memory pointed by nptr up to the tensor size ...

tvm::runtime::PackedFunc set_input = mod.GetFunction("set_input_zero_copy");
set_input(i, narr);
@tqchen
Copy link
Member

tqchen commented Dec 16, 2022

Thanks @srkreddy1238 Indeed I am not questioning the usefulness of having the NativePtr.

Just the specificity of it would benefit from PackedFunc or OpenCL specific functionality first

@srkreddy1238 srkreddy1238 force-pushed the native_ptr branch 3 times, most recently from 92d57fb to a547f0b Compare December 20, 2022 03:24
@tqchen tqchen merged commit cef3f0d into apache:main Dec 30, 2022
fzi-peccia pushed a commit to fzi-peccia/tvm that referenced this pull request Mar 27, 2023
…pache#13413)

* [RUNTIME][OPENCL] OpenCL host pointer support to acheive zero copy

OpenCL supports device memory access to host by memory mapping.
OpenCL flag "CL_MEM_ALLOC_HOST_PTR" enable this while creating a memory object.

We enable this feature via compilation setting "USE_OPENCL_ENABLE_HOST_PTR"
followed by a new API "GetNativePtr" on OpenCLWorkSpace.

This allows application directly use hardware allocated memory while preparing the input.
From user side we allocate NDArray which same size as graph input, access native memory and
finally call set_input_zero_copy to set the input.

Psudo code looks like

auto narr = tvm::runtime::NDArray::Empty(shape, {kDLFloat, 32, 1}, {kDLOpenCL, 0});
OpenCLWorkspace* workspace = OpenCLWorkspace::Global();
void *nptr = workspace->GetNativePtr(narr);

... access memory pointed by nptr up to the tensor size ...

tvm::runtime::PackedFunc set_input = mod.GetFunction("set_input_zero_copy");
set_input(i, narr);
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants