Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A kernel with a 2D grid of work item is slower than same kernel that use a 1D grid but internally remaps items using // and % #941

Closed
fcharras opened this issue Feb 28, 2023 · 11 comments
Labels
user User submitted issue

Comments

@fcharras
Copy link

fcharras commented Feb 28, 2023

I've witnessed a performance hit in kernels that should run equivalent instructions, but one is ran with 2D global and work group sizes, and the other is ran with 1D global and work group sizes. In both cases the work groups are of equal size, and the kernel with the 1D sizes use % and // to remap the id of a work item (from dpex.get_global/local_id(0) calls) to the ids that one get by calling dpex.get_global/local_id(0) and dpex.get_global/local_id(1), assuming row-major order in the 2D grid.

I don't have a minimal reproducer yet, but I'm opening this issue early because the performance hit I've seen is serious ( 30% performance loss ), while 2D grid of work items should rather benefit performance (because remapping with // and % can be expensive), or at worst not reduce it.

See soda-inria/sklearn-numba-dpex#98 for more information. 90% of the execution time of the KMeans comes from the lloyd_single_step kernel there, and this kernel show a 30% performance hit just by remapping work items to a 2D grid.

Questions and suggestions:

  • on GPU, I would expect 2D grid to index the equivalent 1D grid of work items in a row-major order, like it's supposed to be with cuda kernels. Is there such equivalence between 1D and 2D grid one can expect with numba_dpex / SYCL ?
  • my guess would be that the indexing is not what I expect and that the tasks are dispatched to wrong execution units, and might cause non optimal read write patterns or prevent simd instructions
  • if there is any kernel overhead induced by using 2D grid, either in numba_dpex or in SYCL ?
@AlexanderKalistratov
Copy link
Contributor

AlexanderKalistratov commented Mar 2, 2023

@fcharras some comments:

on GPU, I would expect 2D grid to index the equivalent 1D grid of work items in a row-major order, like it's supposed to be with cuda kernels. Is there such equivalence between 1D and 2D grid one can expect with numba_dpex / SYCL ?

I'm not really familiar with cuda, but I always thought, that in cuda 1d grid maps to 2D grid in a column-major order. E.g. 1D grid of size 6 would maps to 2D grid of size (3,2) as:

0 (0, 0); 1 (1, 0); 2 (2; 0)
3 (0, 1); 4 (1, 1); 5 (2, 1)

Was not able to find CUDA docs quickly, but this article states so (as far as I understand it):
https://erangad.medium.com/1d-2d-and-3d-thread-allocation-for-loops-in-cuda-e0f908537a52
Though I'm not 100% sure.

In case of sycl 1D grid maps to 2D grid in row-major order. E.g. grid of size 6 would maps to 2D grid of size (3,2) as:

0 (0, 0); 1 (0, 1)
2 (1; 0); 3 (1, 1)
4 (2, 1); 5 (2, 1)

And I'm 100% sure.
sycl vectorizes on the innermost dimension (i.e. on 2 in this case). Sub-group size would be 2

numba-dpex should follows sycl semantics. But a year ago we observed that on some Intel HW numba-dpex generates column-major grid instead of row-major, which significantly affects performance.

my guess would be that the indexing is not what I expect and that the tasks are dispatched to wrong execution units, and might cause non optimal read write patterns or prevent simd instructions

This is possible. As I said above we have observed such issues earlier. If you would be able to create minimal reproducer it would greatly helps.

if there is any kernel overhead induced by using 2D grid, either in numba_dpex or in SYCL ?

There shouldn't be any.

On other hand, manually calculating 2D indexes from 1D shouldn't have significant (if any at all) impact on performance. Memory-bound kernels are memory-bound and compute-bound are usually too heavy to be affected by index calculation.
In my limited experience I was not able to find performance difference between manually calculating 2D indexes from 1D and just 2D work group. Probably I was just lucky though.

@fcharras
Copy link
Author

fcharras commented Mar 3, 2023

Thank you for the comprehensive explanation.

The documentation I used for cuda is https://stackoverflow.com/a/15044884 that links to https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#thread-hierarchy . Looking back at it, you're right it seems to be column-major order, I read too fast.

It made me assume row-major order for sycl anyway which according to what you say is correct, but I have this performance loss. I will try column-major order instead and report back.

Sub-group size would be 2

To be clear here you mean that 2 is what we can suppose or hope to be the sub_group_size actually used during execution ? If I understand correctly, we don't really have leverage over what sub group size is actually used during execution. When using 2D group size, does it actually impact the "real" sub group size ?

minimal reproducer

I will post it if I have one eventually, but currently the observation comes from a complicated kernel that makes it time consuming to extract minimal examples from it, especially when I don't really understand well the task dispatch mechanism or when it's known to have rare unpredicted behavior :-/

On other hand, manually calculating 2D indexes from 1D shouldn't have significant (if any at all) impact on performance.

Thanks for the insight, I had read that it could matter on some stackoverflow thread but that makes sense too.

@fcharras
Copy link
Author

fcharras commented Mar 3, 2023

(it is a bit confusing that internally numba_dpex invert sizes parameters

self._global_range = list(args.global_range)[::-1]
)

@fcharras
Copy link
Author

fcharras commented Mar 3, 2023

OK so what I think is happening is that SYCL indeed maps with a row-major order, but numba-dpex mimics cuda and its column-major order, it can be seen in the snippet above.

I finally could test the column-major order and there's no more performance regression.

Maybe the issue can be closed (or left open for discussion about the order that should be used in 2D work group size, SYCL or CUDA?)

@diptorupd
Copy link
Contributor

(it is a bit confusing that internally numba_dpex invert sizes parameters

OK so what I think is happening is that SYCL indeed maps with a row-major order, but numba-dpex mimics cuda and its column-major order, it can be seen in the snippet above.

Here is the relevant context: https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:opencl:kernel-conventions-sycl.

dpex generates an OpenCL interoperability kernel for a @kernel decorated function. The numba_dpex.Range semantics follow the SYCL standard. The inversion is done to handle the differences as per this table:

SYCL kernel query OpenCL kernel query Returned Value
With enqueued 3D SYCL global range of range<3> R{r0,r1,r2}
nd_item::get_global_range(0) / item::get_range(0) get_global_size(2) r0
nd_item::get_global_range(1) / item::get_range(1) get_global_size(1) r1
nd_item::get_global_range(2) / item::get_range(2) get_global_size(0) r2
nd_item::get_global_id(0) / item::get_id(0) get_global_id(2) Value in range 0..(r0-1)}
nd_item::get_global_id(1) / item::get_id(1) get_global_id(1) Value in range 0..(r1-1)}
nd_item::get_global_id(2) / item::get_id(2) get_global_id(0) Value in range 0..(r2-1)}
With enqueued 2D SYCL global range of range<2> R{r0,r1}
nd_item::get_global_range(0) / item::get_range(0) get_global_size(1) r0
nd_item::get_global_range(1) / item::get_range(1) get_global_size(0) r1
nd_item::get_global_id(0) / item::get_id(0) get_global_id(1) Value in range 0..(r0-1)}
nd_item::get_global_id(1) / item::get_id(1) get_global_id(0) Value in range 0..(r1-1)}
With enqueued 1D SYCL global range of range<1> R{r0}
nd_item::get_global_range(0) / item::get_range(0) get_global_size(0) r0
nd_item::get_global_id(0) / item::get_id(0) get_global_id(0) Value in range 0..(r0-1)}

@diptorupd
Copy link
Contributor

diptorupd commented Mar 9, 2023

dpex generates an OpenCL interoperability kernel for a @kernel decorated function. The numba_dpex.Range semantics follow the SYCL standard. The inversion is done to handle the differences as per this table:

@fcharras On what target device you are facing this issue? Is it Level Zero? I am checking if the above rules still apply to L0.

I will clearly document these behavior in our user manual so that everyone has the correct context of what get_global id means. I also have plans to change the user-level intrinsics get_global_id to follow SYCL standard and numba-dpex making changes internally if the kernel needs to be submitted as an OpenCL interop kernel.

@AlexanderKalistratov
Copy link
Contributor

@diptorupd I think you can't get SYCL semantics from OCL semantics just by reverting range order. OCL vectorizes on the first dimension in range, while SYCL vectorizes on the last dimension. This doesn't change if you just revert range order.

@Hardcode84
Copy link
Contributor

Hardcode84 commented Mar 9, 2023

@diptorupd I think you can't get SYCL semantics from OCL semantics just by reverting range order. OCL vectorizes on the first dimension in range, while SYCL vectorizes on the last dimension. This doesn't change if you just revert range order.

This vectorization choice is made by IGC compiler in both cases, which doesn't know/care about the frontend. The only thing it sees is SPIRV binary. So if you are getting different vectorinzation choices it means something is different in source SPIRV. Another possible option is that SYCL completely skips IGC vectorization and generates already vectorized SPIRV but IIRC this is not the case. Does anyone actualy looked into SPIRV differences between Intel OpenCL and SYCL?

@diptorupd
Copy link
Contributor

diptorupd commented Mar 10, 2023

I think you can't get SYCL semantics from OCL semantics just by reverting range order. OCL vectorizes on the first dimension in range, while SYCL vectorizes on the last dimension. This doesn't change if you just revert range order.

I do not follow. Are you saying that we cannot support SYCL semantics while specifying the range in SYCL indexing order and submitting an OpenCL interop kernel?

My point is dpex always generates an OpenCL interoperability kernel at the SPIR-V level. The SPIRV for indexing calls as generated by dpex and dpc++ will not be the same as the front-end for dpc++ generates the indexing based on SYCL spec and dpex does that based on OpenCL spec given that we always compile an OpenCL program at the SPIR-V level and then create a SYCL interoperability kernel. The interoperability kernel is then used to create a Sycl KernelBundle and submitted as such.

The confusion is coming because you expect dpex.get_global_id to work as SYCL's nd_item::get_global_id when it is in fact the equivalent of OpenCL's get_global_id. Hence, we do the flipping of the range order as specified in the spec:

When specifying a range as the global or local size in a parallel_for that invokes an OpenCL interop kernel (through cl_kernel interop), the highest dimension of the range in SYCL will map to the lowest dimension within the OpenCL kernel. That statement applies to both an underlying enqueue operation such as clEnqueueNDRangeKernel in OpenCL, and also ID and size queries within the OpenCL kernel. For example, a 3D global range specified in SYCL as:

range<3> R { r0, r1, r2 };
maps to an clEnqueueNDRangeKernel global_work_size argument of:
size_t cl_interop_range[3] = { r2, r1, r0 };

The real question is what semantics dpex follows for its front-end indexing functions. I am in favor of switching to SYCL semantics Refer the 2 year old issue #274. May be it is time to finally fix it?

@diptorupd
Copy link
Contributor

Maybe the issue can be closed (or left open for discussion about the order that should be used in 2D work group size, SYCL or CUDA?)

I am closing the issue and moving the discussion to #964. Please add your comments under the discussion, so that we can develop a design spec for addressing the issue.

@fcharras
Copy link
Author

fcharras commented Mar 10, 2023

@fcharras On what target device you are facing this issue? Is it Level Zero? I am checking if the above rules still apply to L0.

Yes it's Level Zero, I usually have the l0 runtime installed in my environment and the kernels run it by default (level_zero devices have a better score than opencl devices in dpctl selection functions). I didn't check if the issue also show with opencl devices (will do). Having different dispatch with level_zero and opencl would be troublesome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
user User submitted issue
Projects
None yet
Development

No branches or pull requests

4 participants