-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A kernel with a 2D grid of work item is slower than same kernel that use a 1D grid but internally remaps items using // and % #941
Comments
@fcharras some comments:
I'm not really familiar with cuda, but I always thought, that in cuda 1d grid maps to 2D grid in a column-major order. E.g. 1D grid of size 6 would maps to 2D grid of size (3,2) as:
Was not able to find CUDA docs quickly, but this article states so (as far as I understand it): In case of sycl 1D grid maps to 2D grid in row-major order. E.g. grid of size 6 would maps to 2D grid of size (3,2) as:
And I'm 100% sure. numba-dpex should follows sycl semantics. But a year ago we observed that on some Intel HW numba-dpex generates column-major grid instead of row-major, which significantly affects performance.
This is possible. As I said above we have observed such issues earlier. If you would be able to create minimal reproducer it would greatly helps.
There shouldn't be any. On other hand, manually calculating 2D indexes from 1D shouldn't have significant (if any at all) impact on performance. Memory-bound kernels are memory-bound and compute-bound are usually too heavy to be affected by index calculation. |
Thank you for the comprehensive explanation. The documentation I used for cuda is https://stackoverflow.com/a/15044884 that links to https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#thread-hierarchy . Looking back at it, you're right it seems to be column-major order, I read too fast. It made me assume row-major order for sycl anyway which according to what you say is correct, but I have this performance loss. I will try column-major order instead and report back.
To be clear here you mean that
I will post it if I have one eventually, but currently the observation comes from a complicated kernel that makes it time consuming to extract minimal examples from it, especially when I don't really understand well the task dispatch mechanism or when it's known to have rare unpredicted behavior :-/
Thanks for the insight, I had read that it could matter on some stackoverflow thread but that makes sense too. |
(it is a bit confusing that internally
|
OK so what I think is happening is that SYCL indeed maps with a row-major order, but numba-dpex mimics cuda and its column-major order, it can be seen in the snippet above. I finally could test the column-major order and there's no more performance regression. Maybe the issue can be closed (or left open for discussion about the order that should be used in 2D work group size, SYCL or CUDA?) |
Here is the relevant context: https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:opencl:kernel-conventions-sycl. dpex generates an OpenCL interoperability kernel for a
|
@fcharras On what target device you are facing this issue? Is it Level Zero? I am checking if the above rules still apply to L0. I will clearly document these behavior in our user manual so that everyone has the correct context of what get_global id means. I also have plans to change the user-level intrinsics |
@diptorupd I think you can't get SYCL semantics from OCL semantics just by reverting range order. OCL vectorizes on the first dimension in range, while SYCL vectorizes on the last dimension. This doesn't change if you just revert range order. |
This vectorization choice is made by IGC compiler in both cases, which doesn't know/care about the frontend. The only thing it sees is SPIRV binary. So if you are getting different vectorinzation choices it means something is different in source SPIRV. Another possible option is that SYCL completely skips IGC vectorization and generates already vectorized SPIRV but IIRC this is not the case. Does anyone actualy looked into SPIRV differences between Intel OpenCL and SYCL? |
I do not follow. Are you saying that we cannot support SYCL semantics while specifying the range in SYCL indexing order and submitting an OpenCL interop kernel? My point is dpex always generates an OpenCL interoperability kernel at the SPIR-V level. The SPIRV for indexing calls as generated by dpex and dpc++ will not be the same as the front-end for dpc++ generates the indexing based on SYCL spec and dpex does that based on OpenCL spec given that we always compile an OpenCL program at the SPIR-V level and then create a SYCL interoperability kernel. The interoperability kernel is then used to create a Sycl KernelBundle and submitted as such. The confusion is coming because you expect When specifying a range as the global or local size in a parallel_for that invokes an OpenCL interop kernel (through cl_kernel interop), the highest dimension of the range in SYCL will map to the lowest dimension within the OpenCL kernel. That statement applies to both an underlying enqueue operation such as clEnqueueNDRangeKernel in OpenCL, and also ID and size queries within the OpenCL kernel. For example, a 3D global range specified in SYCL as:
The real question is what semantics dpex follows for its front-end indexing functions. I am in favor of switching to SYCL semantics Refer the 2 year old issue #274. May be it is time to finally fix it? |
I am closing the issue and moving the discussion to #964. Please add your comments under the discussion, so that we can develop a design spec for addressing the issue. |
Yes it's Level Zero, I usually have the l0 runtime installed in my environment and the kernels run it by default ( |
I've witnessed a performance hit in kernels that should run equivalent instructions, but one is ran with 2D global and work group sizes, and the other is ran with 1D global and work group sizes. In both cases the work groups are of equal size, and the kernel with the 1D sizes use % and // to remap the id of a work item (from
dpex.get_global/local_id(0)
calls) to the ids that one get by callingdpex.get_global/local_id(0)
anddpex.get_global/local_id(1)
, assuming row-major order in the 2D grid.I don't have a minimal reproducer yet, but I'm opening this issue early because the performance hit I've seen is serious ( 30% performance loss ), while 2D grid of work items should rather benefit performance (because remapping with // and % can be expensive), or at worst not reduce it.
See soda-inria/sklearn-numba-dpex#98 for more information. 90% of the execution time of the KMeans comes from the
lloyd_single_step
kernel there, and this kernel show a 30% performance hit just by remapping work items to a 2D grid.Questions and suggestions:
numba_dpex
/SYCL
?numba_dpex
or inSYCL
?The text was updated successfully, but these errors were encountered: