Anti-pattern in dpex.DEFAULT_LOCAL_SIZE #766

fcharras · 2022-08-17T08:54:08Z

As i understand it, choosing a local size for running a kernel must follow a few rules to ensure that the execution of the kernel fits well with the underlying hardware:

preferably, it should be a multiple of the size of the pools of threads that execute in a lock step at the hardware level (what would be called warp size for nvidia gpus or wavefronts for amd gpus)
and at least be equal to this value, if it is smaller the remaining threads of a warp will remain idle (causing underload and hurting performances). In general, part of the device will remain idle if the group size is not a multiple of the warp size.

clinfo, among other information, can display the values the group size should be a multiple of:

> clinfo
...
Device Name                                     Intel(R) UHD Graphics [0x9a60]
...
Device Type                                     GPU
...
Max work group size                             512
Preferred work group size multiple              64
...

Regarding opencl and python those values are also exposed by pyopencl.

dpex.DEFAULT_LOCAL_SIZE seems to enforce different rules:

it tries to set a value that is actually reasonable (it is 512 on my computer, is it hardcoded or does it changes with hardware requirements ?)
in any case, the local size must divide the global size. The global size is user inputed and could be any values, usually it is related to the size of the input (e.g number of rows in an array). If the previous dpex.DEFAULT_LOCAL_SIZE does not divide the global size, it will fallback on the largest divisor to global_size that is smaller than the previous default value.

Here are a few examples:

import numba_dpex as dpex
import dpctl
import numpy as np
def inspect_default_local_size(global_size):
    @dpex.kernel
    def kernel(array):
        idx = dpex.get_global_id(0)
        size = dpex.get_local_size(0)
        array[idx] = size
    array = dpctl.tensor.empty(sh=global_size, dtype=np.int32)
    kernel[global_size, dpex.DEFAULT_LOCAL_SIZE](array)
    return dpctl.tensor.asnumpy(array)[0]
print(inspect_default_local_size(8192))  # prints 512, OK
print(inspect_default_local_size(10000))  # prints 400 (suboptimal but not too bad)
print(inspect_default_local_size(9973))  # prints 1 because 9973 is prime, looks very suspect

If the dpex.DEFAULT_LOCAL_SIZE is close enough to a multiple of the recommended value, there should not be a significant impact on performance, and the grief might be counterbalanced because if saves implementing boundaries check in the kernel.

But if it is not (e.g. when global_size is a prime number, forcing the default local size to 1 (!)) the performance drop could be massive. (only one thread per warp would be effectively used in this case)

I think the user should be responsible to choose its global and local work group sizes and adapt the behavior of the kernel at boundaries if necessary, and I think it is a good practice to work with a fixed local size and adapt the kernel, rather than ignoring boundaries and adapting the local_size ? reading like:

import numba_dpex as dpex
import dpctl
import numpy as np
import math
@dpex.kernel
def kernel(array):
    idx = dpex.get_global_id(0)
    n = array.shape[0]
    if idx >= n:
        return
    array[idx] = idx
array_size = 9973
array = dpctl.tensor.empty(sh=array_size, dtype=np.int32)
local_size = 512
global_size = math.ceil(array_size/local_size) * local_size
kernel[global_size, local_size](array)

And exposing an automatic setting for the local size will be counter productive because it suggests to the user the opposite practice.

If anything, numba_dpex could expose the maximum possible local size and the value it is recommended to be a multiple of.

The text was updated successfully, but these errors were encountered:

ogrisel · 2022-09-08T15:10:24Z

So a possible solution would be to write kernels that can deal with out-of-bound global sizes and then always invoke them with global_size set to a multiple of "Preferred work group size multiple" and then we could safely use the dpex.DEFAULT_LOCAL_SIZE token for the local_size field of dpex kernel invocations.

One way to achieve this would be to use dpctl to introspect the preferred_work_group_size_multiple attribute (see IntelPython/dpctl#894) of a stub kernel scheduled on the same device to introspect it as done in the test script of #769. Then the global_size could be expanded to the nearest multiple of that value.

It would be great if numba_dpex would provide a high level tool to achieve that pattern somewhat automatically or with very little extra boilerplate code.

oleksandr-pavlyk · 2022-09-14T01:00:15Z

Please refer to the section about selection of work-group size in the GPU optimization guide.

Notice that level zero provides a function zeKernelSuggestGroupSize to determine the work group size for a kernel.

The actual implementation of the function can be found in https://github.com/intel/compute-runtime/blob/master/level_zero/core/source/kernel/kernel_imp.cpp#L368-L406

I am sure an equivalent function exist for OpenCL backend.

fcharras · 2022-09-21T09:03:37Z

Per the GPU optimization guide:

So in general you should pick the maximum work-group size supported by the accelerator. The maximum work-group size can be queried by the call device::get_info<cl::sycl::info::device::max_work_group_size>()

but that is not what dpex.DEFAULT_LOCAL_SIZE seems to be doing here anyway, and with the current design (which seems to assume that the kernel is not aware of input shapes and then try to bend group size to be a divisor of global size, when the opposite practice seems more sensible ?) it cannot claim to do it. My initial take was that this issue could be flagged as a bug (from the observation of what happens if global_size is set to a prime number) and maybe this feature should be dropped.

As a replacement having access to functions such as zeKernelSuggestGroupSize in dpctl would be neat.

edit: there are also similar recommendations in the opencl programming guide with some different nuances

diptorupd · 2022-10-07T00:35:28Z

@fcharras @ogrisel @oleksandr-pavlyk Sorry to chime in late. For various reason (including new dad duties), numba-dpex has been in a bit of stasis, but is kicking back into life.

Firstly, I am in full agreement that the name dpex.DEFAULT_LOCAL_SIZE is a bad selection and confusing.

What does the dpex.DEFAULT_LOCAL_SIZE do?

TL;DR; Leave the local range selection to the SYCL runtime.

I am trying to recall why we had added it in the first place, because the local size parameter is totally optional and the only effect of setting dpex.DEFAULT_LOCAL_SIZE in the kernel call is to set the local range internally to None and then call sycl::queue.parallel_for(range<3>{gr[0], gr[1], gr[2]},...);. Here, gr is the list that was passed to the kernel from Python. This tells the SYCL runtime to do what it thinks is appropriate.

If someone wants to set a specific local range, then a list corresponding to the local range has to be passed as the second argument to the [] operator of the kernel. What that does is call sycl::queue.parallel_for(ndrange<3>{{gr[0], gr[1], gr[2]},{lr[0],lr[1],lr[2]}}, ...);

Drop the feature

I never liked the [] syntax and it is a legacy of us mimicking numba.cuda design. I am in favour of dropping both the [] and the dpex.DEFAULT_LOCAL_SIZE. I have created a separate ticket to track the discussion around how users can provide the global and local range preferences during kernel launch. #790

dpex or dpctl provides a tool that can come up with a local range recommendation based on some heuristics

That will be useful and I have been toying with the idea for a while. My plans were to explore some kind of auto-tuning to help define the local range, but a heuristics driven cost model should be a decent starting point. Let me think a bit more and open a separate discussion thread.

fcharras · 2022-10-07T08:41:07Z

Leave the local range selection to the SYCL runtime

I'm not convinced by this design, because in all cases that requires implicit decisions from the runtime that can only hurt performances and it will misled beginners about the ins and outs of it. The main obstacle being that the SYCL runtime is not aware of the intention of the user regarding having the kernel check boundaries before calling __getitem__ on arrays. IMO the user should be required to set local_size and global_size all the time. The only help that can be provided is exposing a function that suggests a good local_size.

diptorupd · 2023-03-11T20:39:47Z

numba-dpex 0.20 has deprecated the DEFAULT_LOCAL_SIZE flag and will remove it in the next release. Instead, we now require kernel launch args to be explicitly specified as numba_dpex.Range or numba_dpex.NdRange instances. The choice of what indexer classes to use to launch a kernel is left to a user. Depending on what the user specified a kernel is launched as either a range kernel or as a ndrange kernel.

diptorupd assigned diptorupd and mingjie-intel Oct 6, 2022

mingjie-intel added the user User submitted issue label Oct 19, 2022

fcharras mentioned this issue Nov 17, 2022

High-level API for registering the engine with scikit-learn's entry point soda-inria/sklearn-numba-dpex#20

Merged

chudur-budur mentioned this issue Jan 26, 2023

A more consistent kernel launch parameter syntax #888

Merged

5 tasks

diptorupd closed this as completed Mar 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Anti-pattern in dpex.DEFAULT_LOCAL_SIZE #766

Anti-pattern in dpex.DEFAULT_LOCAL_SIZE #766

fcharras commented Aug 17, 2022 •

edited

Loading

ogrisel commented Sep 8, 2022 •

edited

Loading

oleksandr-pavlyk commented Sep 14, 2022 •

edited

Loading

fcharras commented Sep 21, 2022 •

edited

Loading

diptorupd commented Oct 7, 2022 •

edited

Loading

fcharras commented Oct 7, 2022

diptorupd commented Mar 11, 2023 •

edited

Loading

Anti-pattern in dpex.DEFAULT_LOCAL_SIZE #766

Anti-pattern in dpex.DEFAULT_LOCAL_SIZE #766

Comments

fcharras commented Aug 17, 2022 • edited Loading

ogrisel commented Sep 8, 2022 • edited Loading

oleksandr-pavlyk commented Sep 14, 2022 • edited Loading

fcharras commented Sep 21, 2022 • edited Loading

diptorupd commented Oct 7, 2022 • edited Loading

fcharras commented Oct 7, 2022

diptorupd commented Mar 11, 2023 • edited Loading

fcharras commented Aug 17, 2022 •

edited

Loading

ogrisel commented Sep 8, 2022 •

edited

Loading

oleksandr-pavlyk commented Sep 14, 2022 •

edited

Loading

fcharras commented Sep 21, 2022 •

edited

Loading

diptorupd commented Oct 7, 2022 •

edited

Loading

diptorupd commented Mar 11, 2023 •

edited

Loading