-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RUNTIME][OPENCL] OpenCL host pointer support to acheive zero copy #13413
Conversation
srkreddy1238
commented
Nov 17, 2022
•
edited
Loading
edited
Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.
Generated by tvm-bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your PR! Quickly took a look at it. I have several questions:
- You use
CL_MEM_ALLOC_HOST_PTR
only if we passUSE_OPENCL_ENABLE_HOST_PTR
parameter to cmake. What if we enableCL_MEM_ALLOC_HOST_PTR
by default and won't introduce new cmake option? - My second question is a continuation of the previous question. But probably it is the main question and more important. With option
USE_OPENCL_ENABLE_HOST_PTR
all memory objects will be created and mapped to the host memory, am I right? In this case, is it introducing any overhead on intermediate buffers? I mean that the buffer will be allocated in the host memory instead of global memory on the device.
In general OpenCL global memory and host accessible memory points to DDR (Common system wide physical memory). We get zero copy advantage only if DDR is shared between the two cores. Mapping multiple memory objects to a process address space won't cause any performance hit (unless there is writes from both sides on the mapped segment). In our case only the input mem object is written by host and others are untouched. Cmake compilation option is to avoid any unexpected behaviors due to custom hardware & driver implementations. |
fe53040
to
9030d62
Compare
Thanks for the PR. In this particular case, it would be great to think about other means that do not necessarily make surgical changes to the DeviceAPI level or NDArray level. Exposing a GetHostPtr from opencl runtime or packed func could be a good starting pt, before motivating a device api change |
Thanks for the review. TVM benchmarks generally evaluate Buffer sharing is well known practice and is supported by most of the edge platforms across cores like Camera ISP, GPU, CPU...etc. Motivation here is to encourage the runtime backends to support Native Ptr access. This can retain the TVM performance numbers at final application level with less overheads. I am good with packed function also for now until there is more demand to expose native buffers to applications via NDArray. |
OpenCL supports device memory access to host by memory mapping. OpenCL flag "CL_MEM_ALLOC_HOST_PTR" enable this while creating a memory object. We enable this feature via compilation setting "USE_OPENCL_ENABLE_HOST_PTR" followed by a new API "GetNativePtr" on OpenCLWorkSpace. This allows application directly use hardware allocated memory while preparing the input. From user side we allocate NDArray which same size as graph input, access native memory and finally call set_input_zero_copy to set the input. Psudo code looks like auto narr = tvm::runtime::NDArray::Empty(shape, {kDLFloat, 32, 1}, {kDLOpenCL, 0}); OpenCLWorkspace* workspace = OpenCLWorkspace::Global(); void *nptr = workspace->GetNativePtr(narr); ... access memory pointed by nptr up to the tensor size ... tvm::runtime::PackedFunc set_input = mod.GetFunction("set_input_zero_copy"); set_input(i, narr);
9030d62
to
ba80b5e
Compare
Thanks @srkreddy1238 Indeed I am not questioning the usefulness of having the NativePtr. Just the specificity of it would benefit from PackedFunc or OpenCL specific functionality first |
92d57fb
to
a547f0b
Compare
a547f0b
to
901b189
Compare
…pache#13413) * [RUNTIME][OPENCL] OpenCL host pointer support to acheive zero copy OpenCL supports device memory access to host by memory mapping. OpenCL flag "CL_MEM_ALLOC_HOST_PTR" enable this while creating a memory object. We enable this feature via compilation setting "USE_OPENCL_ENABLE_HOST_PTR" followed by a new API "GetNativePtr" on OpenCLWorkSpace. This allows application directly use hardware allocated memory while preparing the input. From user side we allocate NDArray which same size as graph input, access native memory and finally call set_input_zero_copy to set the input. Psudo code looks like auto narr = tvm::runtime::NDArray::Empty(shape, {kDLFloat, 32, 1}, {kDLOpenCL, 0}); OpenCLWorkspace* workspace = OpenCLWorkspace::Global(); void *nptr = workspace->GetNativePtr(narr); ... access memory pointed by nptr up to the tensor size ... tvm::runtime::PackedFunc set_input = mod.GetFunction("set_input_zero_copy"); set_input(i, narr);