-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Anti-pattern in dpex.DEFAULT_LOCAL_SIZE #766
Comments
So a possible solution would be to write kernels that can deal with out-of-bound global sizes and then always invoke them with One way to achieve this would be to use It would be great if |
Please refer to the section about selection of work-group size in the GPU optimization guide. Notice that level zero provides a function The actual implementation of the function can be found in https://github.com/intel/compute-runtime/blob/master/level_zero/core/source/kernel/kernel_imp.cpp#L368-L406 I am sure an equivalent function exist for OpenCL backend. |
Per the GPU optimization guide:
but that is not what As a replacement having access to functions such as edit: there are also similar recommendations in the opencl programming guide with some different nuances |
@fcharras @ogrisel @oleksandr-pavlyk Sorry to chime in late. For various reason (including new dad duties), numba-dpex has been in a bit of stasis, but is kicking back into life. Firstly, I am in full agreement that the name What does the TL;DR; Leave the local range selection to the SYCL runtime. I am trying to recall why we had added it in the first place, because the local size parameter is totally optional and the only effect of setting If someone wants to set a specific local range, then a list corresponding to the local range has to be passed as the second argument to the Drop the feature I never liked the
That will be useful and I have been toying with the idea for a while. My plans were to explore some kind of auto-tuning to help define the local range, but a heuristics driven cost model should be a decent starting point. Let me think a bit more and open a separate discussion thread. |
I'm not convinced by this design, because in all cases that requires implicit decisions from the runtime that can only hurt performances and it will misled beginners about the ins and outs of it. The main obstacle being that the SYCL runtime is not aware of the intention of the user regarding having the kernel check boundaries before calling |
numba-dpex 0.20 has deprecated the |
As i understand it, choosing a local size for running a kernel must follow a few rules to ensure that the execution of the kernel fits well with the underlying hardware:
preferably, it should be a multiple of the size of the pools of threads that execute in a lock step at the hardware level (what would be called warp size for nvidia gpus or wavefronts for amd gpus)
and at least be equal to this value, if it is smaller the remaining threads of a warp will remain idle (causing underload and hurting performances). In general, part of the device will remain idle if the group size is not a multiple of the warp size.
clinfo, among other information, can display the values the group size should be a multiple of:
Regarding
opencl
andpython
those values are also exposed by pyopencl.dpex.DEFAULT_LOCAL_SIZE
seems to enforce different rules:512
on my computer, is it hardcoded or does it changes with hardware requirements ?)dpex.DEFAULT_LOCAL_SIZE
does not divide the global size, it will fallback on the largest divisor toglobal_size
that is smaller than the previous default value.Here are a few examples:
If the
dpex.DEFAULT_LOCAL_SIZE
is close enough to a multiple of the recommended value, there should not be a significant impact on performance, and the grief might be counterbalanced because if saves implementing boundaries check in the kernel.But if it is not (e.g. when
global_size
is a prime number, forcing the default local size to 1 (!)) the performance drop could be massive. (only one thread per warp would be effectively used in this case)I think the user should be responsible to choose its global and local work group sizes and adapt the behavior of the kernel at boundaries if necessary, and I think it is a good practice to work with a fixed local size and adapt the kernel, rather than ignoring boundaries and adapting the local_size ? reading like:
And exposing an automatic setting for the local size will be counter productive because it suggests to the user the opposite practice.
If anything,
numba_dpex
could expose the maximum possible local size and the value it is recommended to be a multiple of.The text was updated successfully, but these errors were encountered: