Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Anti-pattern in dpex.DEFAULT_LOCAL_SIZE #766

Closed
fcharras opened this issue Aug 17, 2022 · 6 comments
Closed

Anti-pattern in dpex.DEFAULT_LOCAL_SIZE #766

fcharras opened this issue Aug 17, 2022 · 6 comments
Assignees
Labels
user User submitted issue

Comments

@fcharras
Copy link

fcharras commented Aug 17, 2022

As i understand it, choosing a local size for running a kernel must follow a few rules to ensure that the execution of the kernel fits well with the underlying hardware:

  • preferably, it should be a multiple of the size of the pools of threads that execute in a lock step at the hardware level (what would be called warp size for nvidia gpus or wavefronts for amd gpus)

  • and at least be equal to this value, if it is smaller the remaining threads of a warp will remain idle (causing underload and hurting performances). In general, part of the device will remain idle if the group size is not a multiple of the warp size.

clinfo, among other information, can display the values the group size should be a multiple of:

> clinfo
...
Device Name                                     Intel(R) UHD Graphics [0x9a60]
...
Device Type                                     GPU
...
Max work group size                             512
Preferred work group size multiple              64
...

Regarding opencl and python those values are also exposed by pyopencl.

dpex.DEFAULT_LOCAL_SIZE seems to enforce different rules:

  • it tries to set a value that is actually reasonable (it is 512 on my computer, is it hardcoded or does it changes with hardware requirements ?)
  • in any case, the local size must divide the global size. The global size is user inputed and could be any values, usually it is related to the size of the input (e.g number of rows in an array). If the previous dpex.DEFAULT_LOCAL_SIZE does not divide the global size, it will fallback on the largest divisor to global_size that is smaller than the previous default value.

Here are a few examples:

import numba_dpex as dpex
import dpctl
import numpy as np
def inspect_default_local_size(global_size):
    @dpex.kernel
    def kernel(array):
        idx = dpex.get_global_id(0)
        size = dpex.get_local_size(0)
        array[idx] = size
    array = dpctl.tensor.empty(sh=global_size, dtype=np.int32)
    kernel[global_size, dpex.DEFAULT_LOCAL_SIZE](array)
    return dpctl.tensor.asnumpy(array)[0]
print(inspect_default_local_size(8192))  # prints 512, OK
print(inspect_default_local_size(10000))  # prints 400 (suboptimal but not too bad)
print(inspect_default_local_size(9973))  # prints 1 because 9973 is prime, looks very suspect

If the dpex.DEFAULT_LOCAL_SIZE is close enough to a multiple of the recommended value, there should not be a significant impact on performance, and the grief might be counterbalanced because if saves implementing boundaries check in the kernel.

But if it is not (e.g. when global_size is a prime number, forcing the default local size to 1 (!)) the performance drop could be massive. (only one thread per warp would be effectively used in this case)

I think the user should be responsible to choose its global and local work group sizes and adapt the behavior of the kernel at boundaries if necessary, and I think it is a good practice to work with a fixed local size and adapt the kernel, rather than ignoring boundaries and adapting the local_size ? reading like:

import numba_dpex as dpex
import dpctl
import numpy as np
import math
@dpex.kernel
def kernel(array):
    idx = dpex.get_global_id(0)
    n = array.shape[0]
    if idx >= n:
        return
    array[idx] = idx
array_size = 9973
array = dpctl.tensor.empty(sh=array_size, dtype=np.int32)
local_size = 512
global_size = math.ceil(array_size/local_size) * local_size
kernel[global_size, local_size](array)

And exposing an automatic setting for the local size will be counter productive because it suggests to the user the opposite practice.

If anything, numba_dpex could expose the maximum possible local size and the value it is recommended to be a multiple of.

@ogrisel
Copy link

ogrisel commented Sep 8, 2022

So a possible solution would be to write kernels that can deal with out-of-bound global sizes and then always invoke them with global_size set to a multiple of "Preferred work group size multiple" and then we could safely use the dpex.DEFAULT_LOCAL_SIZE token for the local_size field of dpex kernel invocations.

One way to achieve this would be to use dpctl to introspect the preferred_work_group_size_multiple attribute (see IntelPython/dpctl#894) of a stub kernel scheduled on the same device to introspect it as done in the test script of #769. Then the global_size could be expanded to the nearest multiple of that value.

It would be great if numba_dpex would provide a high level tool to achieve that pattern somewhat automatically or with very little extra boilerplate code.

@oleksandr-pavlyk
Copy link
Contributor

oleksandr-pavlyk commented Sep 14, 2022

Please refer to the section about selection of work-group size in the GPU optimization guide.

Notice that level zero provides a function zeKernelSuggestGroupSize to determine the work group size for a kernel.

The actual implementation of the function can be found in https://github.com/intel/compute-runtime/blob/master/level_zero/core/source/kernel/kernel_imp.cpp#L368-L406

I am sure an equivalent function exist for OpenCL backend.

@fcharras
Copy link
Author

fcharras commented Sep 21, 2022

Per the GPU optimization guide:

So in general you should pick the maximum work-group size supported by the accelerator. The maximum work-group size can be queried by the call device::get_info<cl::sycl::info::device::max_work_group_size>()

but that is not what dpex.DEFAULT_LOCAL_SIZE seems to be doing here anyway, and with the current design (which seems to assume that the kernel is not aware of input shapes and then try to bend group size to be a divisor of global size, when the opposite practice seems more sensible ?) it cannot claim to do it. My initial take was that this issue could be flagged as a bug (from the observation of what happens if global_size is set to a prime number) and maybe this feature should be dropped.

As a replacement having access to functions such as zeKernelSuggestGroupSize in dpctl would be neat.

edit: there are also similar recommendations in the opencl programming guide with some different nuances

@diptorupd
Copy link
Collaborator

diptorupd commented Oct 7, 2022

@fcharras @ogrisel @oleksandr-pavlyk Sorry to chime in late. For various reason (including new dad duties), numba-dpex has been in a bit of stasis, but is kicking back into life.

Firstly, I am in full agreement that the name dpex.DEFAULT_LOCAL_SIZE is a bad selection and confusing.

What does the dpex.DEFAULT_LOCAL_SIZE do?

TL;DR; Leave the local range selection to the SYCL runtime.

I am trying to recall why we had added it in the first place, because the local size parameter is totally optional and the only effect of setting dpex.DEFAULT_LOCAL_SIZE in the kernel call is to set the local range internally to None and then call sycl::queue.parallel_for(range<3>{gr[0], gr[1], gr[2]},...);. Here, gr is the list that was passed to the kernel from Python. This tells the SYCL runtime to do what it thinks is appropriate.

If someone wants to set a specific local range, then a list corresponding to the local range has to be passed as the second argument to the [] operator of the kernel. What that does is call sycl::queue.parallel_for(ndrange<3>{{gr[0], gr[1], gr[2]},{lr[0],lr[1],lr[2]}}, ...);

Drop the feature

I never liked the [] syntax and it is a legacy of us mimicking numba.cuda design. I am in favour of dropping both the [] and the dpex.DEFAULT_LOCAL_SIZE. I have created a separate ticket to track the discussion around how users can provide the global and local range preferences during kernel launch. #790

dpex or dpctl provides a tool that can come up with a local range recommendation based on some heuristics

That will be useful and I have been toying with the idea for a while. My plans were to explore some kind of auto-tuning to help define the local range, but a heuristics driven cost model should be a decent starting point. Let me think a bit more and open a separate discussion thread.

@fcharras
Copy link
Author

fcharras commented Oct 7, 2022

Leave the local range selection to the SYCL runtime

I'm not convinced by this design, because in all cases that requires implicit decisions from the runtime that can only hurt performances and it will misled beginners about the ins and outs of it. The main obstacle being that the SYCL runtime is not aware of the intention of the user regarding having the kernel check boundaries before calling __getitem__ on arrays. IMO the user should be required to set local_size and global_size all the time. The only help that can be provided is exposing a function that suggests a good local_size.

@diptorupd
Copy link
Collaborator

diptorupd commented Mar 11, 2023

numba-dpex 0.20 has deprecated the DEFAULT_LOCAL_SIZE flag and will remove it in the next release. Instead, we now require kernel launch args to be explicitly specified as numba_dpex.Range or numba_dpex.NdRange instances. The choice of what indexer classes to use to launch a kernel is left to a user. Depending on what the user specified a kernel is launched as either a range kernel or as a ndrange kernel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
user User submitted issue
Projects
None yet
Development

No branches or pull requests

5 participants