A kernel with a 2D grid of work item is slower than same kernel that use a 1D grid but internally remaps items using // and % #941

fcharras · 2023-02-28T17:09:01Z

I've witnessed a performance hit in kernels that should run equivalent instructions, but one is ran with 2D global and work group sizes, and the other is ran with 1D global and work group sizes. In both cases the work groups are of equal size, and the kernel with the 1D sizes use % and // to remap the id of a work item (from dpex.get_global/local_id(0) calls) to the ids that one get by calling dpex.get_global/local_id(0) and dpex.get_global/local_id(1), assuming row-major order in the 2D grid.

I don't have a minimal reproducer yet, but I'm opening this issue early because the performance hit I've seen is serious ( 30% performance loss ), while 2D grid of work items should rather benefit performance (because remapping with // and % can be expensive), or at worst not reduce it.

See soda-inria/sklearn-numba-dpex#98 for more information. 90% of the execution time of the KMeans comes from the lloyd_single_step kernel there, and this kernel show a 30% performance hit just by remapping work items to a 2D grid.

Questions and suggestions:

on GPU, I would expect 2D grid to index the equivalent 1D grid of work items in a row-major order, like it's supposed to be with cuda kernels. Is there such equivalence between 1D and 2D grid one can expect with numba_dpex / SYCL ?
my guess would be that the indexing is not what I expect and that the tasks are dispatched to wrong execution units, and might cause non optimal read write patterns or prevent simd instructions
if there is any kernel overhead induced by using 2D grid, either in numba_dpex or in SYCL ?

The text was updated successfully, but these errors were encountered:

AlexanderKalistratov · 2023-03-02T20:35:05Z

@fcharras some comments:

on GPU, I would expect 2D grid to index the equivalent 1D grid of work items in a row-major order, like it's supposed to be with cuda kernels. Is there such equivalence between 1D and 2D grid one can expect with numba_dpex / SYCL ?

I'm not really familiar with cuda, but I always thought, that in cuda 1d grid maps to 2D grid in a column-major order. E.g. 1D grid of size 6 would maps to 2D grid of size (3,2) as:

0 (0, 0); 1 (1, 0); 2 (2; 0)
3 (0, 1); 4 (1, 1); 5 (2, 1)

Was not able to find CUDA docs quickly, but this article states so (as far as I understand it):
https://erangad.medium.com/1d-2d-and-3d-thread-allocation-for-loops-in-cuda-e0f908537a52
Though I'm not 100% sure.

In case of sycl 1D grid maps to 2D grid in row-major order. E.g. grid of size 6 would maps to 2D grid of size (3,2) as:

0 (0, 0); 1 (0, 1)
2 (1; 0); 3 (1, 1)
4 (2, 1); 5 (2, 1)

And I'm 100% sure.
sycl vectorizes on the innermost dimension (i.e. on 2 in this case). Sub-group size would be 2

numba-dpex should follows sycl semantics. But a year ago we observed that on some Intel HW numba-dpex generates column-major grid instead of row-major, which significantly affects performance.

my guess would be that the indexing is not what I expect and that the tasks are dispatched to wrong execution units, and might cause non optimal read write patterns or prevent simd instructions

This is possible. As I said above we have observed such issues earlier. If you would be able to create minimal reproducer it would greatly helps.

if there is any kernel overhead induced by using 2D grid, either in numba_dpex or in SYCL ?

There shouldn't be any.

On other hand, manually calculating 2D indexes from 1D shouldn't have significant (if any at all) impact on performance. Memory-bound kernels are memory-bound and compute-bound are usually too heavy to be affected by index calculation.
In my limited experience I was not able to find performance difference between manually calculating 2D indexes from 1D and just 2D work group. Probably I was just lucky though.

fcharras · 2023-03-03T09:06:24Z

Thank you for the comprehensive explanation.

The documentation I used for cuda is https://stackoverflow.com/a/15044884 that links to https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#thread-hierarchy . Looking back at it, you're right it seems to be column-major order, I read too fast.

It made me assume row-major order for sycl anyway which according to what you say is correct, but I have this performance loss. I will try column-major order instead and report back.

Sub-group size would be 2

To be clear here you mean that 2 is what we can suppose or hope to be the sub_group_size actually used during execution ? If I understand correctly, we don't really have leverage over what sub group size is actually used during execution. When using 2D group size, does it actually impact the "real" sub group size ?

minimal reproducer

I will post it if I have one eventually, but currently the observation comes from a complicated kernel that makes it time consuming to extract minimal examples from it, especially when I don't really understand well the task dispatch mechanism or when it's known to have rare unpredicted behavior :-/

On other hand, manually calculating 2D indexes from 1D shouldn't have significant (if any at all) impact on performance.

Thanks for the insight, I had read that it could matter on some stackoverflow thread but that makes sense too.

fcharras · 2023-03-03T09:18:53Z

(it is a bit confusing that internally numba_dpex invert sizes parameters

numba-dpex/numba_dpex/core/kernel_interface/dispatcher.py

Line 328 in 56e7129

self._global_range = list(args.global_range)[::-1]

)

fcharras · 2023-03-03T15:06:05Z

OK so what I think is happening is that SYCL indeed maps with a row-major order, but numba-dpex mimics cuda and its column-major order, it can be seen in the snippet above.

I finally could test the column-major order and there's no more performance regression.

Maybe the issue can be closed (or left open for discussion about the order that should be used in 2D work group size, SYCL or CUDA?)

diptorupd · 2023-03-09T22:46:09Z

(it is a bit confusing that internally numba_dpex invert sizes parameters

OK so what I think is happening is that SYCL indeed maps with a row-major order, but numba-dpex mimics cuda and its column-major order, it can be seen in the snippet above.

Here is the relevant context: https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:opencl:kernel-conventions-sycl.

dpex generates an OpenCL interoperability kernel for a @kernel decorated function. The numba_dpex.Range semantics follow the SYCL standard. The inversion is done to handle the differences as per this table:

SYCL kernel query	OpenCL kernel query	Returned Value
With enqueued 3D SYCL global range of range<3> R{r0,r1,r2}
nd_item::get_global_range(0) / item::get_range(0)	get_global_size(2)	r0
nd_item::get_global_range(1) / item::get_range(1)	get_global_size(1)	r1
nd_item::get_global_range(2) / item::get_range(2)	get_global_size(0)	r2
nd_item::get_global_id(0) / item::get_id(0)	get_global_id(2)	Value in range 0..(r0-1)}
nd_item::get_global_id(1) / item::get_id(1)	get_global_id(1)	Value in range 0..(r1-1)}
nd_item::get_global_id(2) / item::get_id(2)	get_global_id(0)	Value in range 0..(r2-1)}
With enqueued 2D SYCL global range of range<2> R{r0,r1}
nd_item::get_global_range(0) / item::get_range(0)	get_global_size(1)	r0
nd_item::get_global_range(1) / item::get_range(1)	get_global_size(0)	r1
nd_item::get_global_id(0) / item::get_id(0)	get_global_id(1)	Value in range 0..(r0-1)}
nd_item::get_global_id(1) / item::get_id(1)	get_global_id(0)	Value in range 0..(r1-1)}
With enqueued 1D SYCL global range of range<1> R{r0}
nd_item::get_global_range(0) / item::get_range(0)	get_global_size(0)	r0
nd_item::get_global_id(0) / item::get_id(0)	get_global_id(0)	Value in range 0..(r0-1)}

diptorupd · 2023-03-09T22:48:29Z

dpex generates an OpenCL interoperability kernel for a @kernel decorated function. The numba_dpex.Range semantics follow the SYCL standard. The inversion is done to handle the differences as per this table:

@fcharras On what target device you are facing this issue? Is it Level Zero? I am checking if the above rules still apply to L0.

I will clearly document these behavior in our user manual so that everyone has the correct context of what get_global id means. I also have plans to change the user-level intrinsics get_global_id to follow SYCL standard and numba-dpex making changes internally if the kernel needs to be submitted as an OpenCL interop kernel.

AlexanderKalistratov · 2023-03-09T23:04:38Z

@diptorupd I think you can't get SYCL semantics from OCL semantics just by reverting range order. OCL vectorizes on the first dimension in range, while SYCL vectorizes on the last dimension. This doesn't change if you just revert range order.

Hardcode84 · 2023-03-09T23:21:31Z

@diptorupd I think you can't get SYCL semantics from OCL semantics just by reverting range order. OCL vectorizes on the first dimension in range, while SYCL vectorizes on the last dimension. This doesn't change if you just revert range order.

This vectorization choice is made by IGC compiler in both cases, which doesn't know/care about the frontend. The only thing it sees is SPIRV binary. So if you are getting different vectorinzation choices it means something is different in source SPIRV. Another possible option is that SYCL completely skips IGC vectorization and generates already vectorized SPIRV but IIRC this is not the case. Does anyone actualy looked into SPIRV differences between Intel OpenCL and SYCL?

diptorupd · 2023-03-10T02:03:45Z

I think you can't get SYCL semantics from OCL semantics just by reverting range order. OCL vectorizes on the first dimension in range, while SYCL vectorizes on the last dimension. This doesn't change if you just revert range order.

I do not follow. Are you saying that we cannot support SYCL semantics while specifying the range in SYCL indexing order and submitting an OpenCL interop kernel?

My point is dpex always generates an OpenCL interoperability kernel at the SPIR-V level. The SPIRV for indexing calls as generated by dpex and dpc++ will not be the same as the front-end for dpc++ generates the indexing based on SYCL spec and dpex does that based on OpenCL spec given that we always compile an OpenCL program at the SPIR-V level and then create a SYCL interoperability kernel. The interoperability kernel is then used to create a Sycl KernelBundle and submitted as such.

The confusion is coming because you expect dpex.get_global_id to work as SYCL's nd_item::get_global_id when it is in fact the equivalent of OpenCL's get_global_id. Hence, we do the flipping of the range order as specified in the spec:

When specifying a range as the global or local size in a parallel_for that invokes an OpenCL interop kernel (through cl_kernel interop), the highest dimension of the range in SYCL will map to the lowest dimension within the OpenCL kernel. That statement applies to both an underlying enqueue operation such as clEnqueueNDRangeKernel in OpenCL, and also ID and size queries within the OpenCL kernel. For example, a 3D global range specified in SYCL as:

range<3> R { r0, r1, r2 };
maps to an clEnqueueNDRangeKernel global_work_size argument of:
size_t cl_interop_range[3] = { r2, r1, r0 };

The real question is what semantics dpex follows for its front-end indexing functions. I am in favor of switching to SYCL semantics Refer the 2 year old issue #274. May be it is time to finally fix it?

diptorupd · 2023-03-10T02:47:22Z

Maybe the issue can be closed (or left open for discussion about the order that should be used in 2D work group size, SYCL or CUDA?)

I am closing the issue and moving the discussion to #964. Please add your comments under the discussion, so that we can develop a design spec for addressing the issue.

fcharras · 2023-03-10T17:58:06Z

@fcharras On what target device you are facing this issue? Is it Level Zero? I am checking if the above rules still apply to L0.

Yes it's Level Zero, I usually have the l0 runtime installed in my environment and the kernels run it by default (level_zero devices have a better score than opencl devices in dpctl selection functions). I didn't check if the issue also show with opencl devices (will do). Having different dispatch with level_zero and opencl would be troublesome.

fcharras mentioned this issue Feb 28, 2023

ENH Use a 2D grid of work items where applicable soda-inria/sklearn-numba-dpex#98

Merged

2 tasks

diptorupd added the user User submitted issue label Feb 28, 2023

diptorupd closed this as completed Mar 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A kernel with a 2D grid of work item is slower than same kernel that use a 1D grid but internally remaps items using // and % #941

A kernel with a 2D grid of work item is slower than same kernel that use a 1D grid but internally remaps items using // and % #941

fcharras commented Feb 28, 2023 •

edited

Loading

AlexanderKalistratov commented Mar 2, 2023 •

edited

Loading

fcharras commented Mar 3, 2023 •

edited

Loading

fcharras commented Mar 3, 2023

fcharras commented Mar 3, 2023 •

edited

Loading

diptorupd commented Mar 9, 2023

diptorupd commented Mar 9, 2023 •

edited

Loading

AlexanderKalistratov commented Mar 9, 2023

Hardcode84 commented Mar 9, 2023 •

edited

Loading

diptorupd commented Mar 10, 2023 •

edited

Loading

diptorupd commented Mar 10, 2023

fcharras commented Mar 10, 2023 •

edited

Loading

A kernel with a 2D grid of work item is slower than same kernel that use a 1D grid but internally remaps items using // and % #941

A kernel with a 2D grid of work item is slower than same kernel that use a 1D grid but internally remaps items using // and % #941

Comments

fcharras commented Feb 28, 2023 • edited Loading

AlexanderKalistratov commented Mar 2, 2023 • edited Loading

fcharras commented Mar 3, 2023 • edited Loading

fcharras commented Mar 3, 2023

fcharras commented Mar 3, 2023 • edited Loading

diptorupd commented Mar 9, 2023

diptorupd commented Mar 9, 2023 • edited Loading

AlexanderKalistratov commented Mar 9, 2023

Hardcode84 commented Mar 9, 2023 • edited Loading

diptorupd commented Mar 10, 2023 • edited Loading

diptorupd commented Mar 10, 2023

fcharras commented Mar 10, 2023 • edited Loading

fcharras commented Feb 28, 2023 •

edited

Loading

AlexanderKalistratov commented Mar 2, 2023 •

edited

Loading

fcharras commented Mar 3, 2023 •

edited

Loading

fcharras commented Mar 3, 2023 •

edited

Loading

diptorupd commented Mar 9, 2023 •

edited

Loading

Hardcode84 commented Mar 9, 2023 •

edited

Loading

diptorupd commented Mar 10, 2023 •

edited

Loading

fcharras commented Mar 10, 2023 •

edited

Loading