Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eager jitting does not work with numba_dpex #816

Closed
fcharras opened this issue Nov 8, 2022 · 12 comments
Closed

Eager jitting does not work with numba_dpex #816

fcharras opened this issue Nov 8, 2022 · 12 comments
Assignees

Comments

@fcharras
Copy link

fcharras commented Nov 8, 2022

Following IntelPython/dpctl#894 I want to use numba_dpex with eager compilation, so I can access attributes such as preferred_work_group_size_multiple attached to the corresponding compiler.Kernel object that is available in the JitKernel.definitions dictionary.

This partially works since as long as the signature passed to dpex.kernel uses numba.Array(and nothing else) type, some compilation actually takes effect and I can access the attributes.

However, when running the kernel later on, there's a new signature generated with types USMNdArrayType, a new compilation step happens, and a new Kernel is added to the definitions, with the same signature but using USMNdArrayType instead of numba.Array.

As a result, the eager compilation is mostly pointless since the kernel that is eagerly compiled will not be used, instead an additional lazy compilation will take place.

A solution to that might be to relax npytypes_array_to_dpex_array to also accept USMNdArrayType arrays ?

@diptorupd
Copy link
Collaborator

@fcharras I am addressing the issue in my ongoing refactoring of the dpex kernel decorator internals. In my current set of changes, I am deprecating the support for NumPy ndarray as a kernel argument. Instead, we will only support USM-based array types: dpctl.tensor.usm_ndarray, dpnp.ndarray, or any other type that provides a properly defined __sycl_usm_array_interface__ attribute. Thus, the compute follows data programming model is the prescribed option going forward.

Coming to the specialization issue. Dpex does not yet define its own types for specialization and we are dependent on the Numba types such as void, int32, etc. to specialize a kernel. Due to this reason, there is no clean way to tell dpex that a kernel will only be called with a __sycl_usm_array_interface__ complaint array.

Till we have a cleaner way to improve type support for __sycl_usm_array_interface__-based arrays, I propose the following detour for kernel specialization:

  1. Type specialization of kernels continues to use the numba array notation .
  2. Although, numba will infer the type as numba.core.types.Array internally dpex specializes the kernel on the USMNdArrayType, and at the point of dispatch checks if the arguments are of USMNdArrayType.
  3. Two new keyword args will be required when specializing: device_type and backend.
import dpctl
from numba import void, int32
import numba_dpex as dpex

@dpex.kernel(
    func_or_sig=void(int32[:], int32[:], int32[:]), 
    device_type=dpctl.device_type.gpu, 
    backend=dpctl.backend.level_zero
)
def vecadd(a,b,c):
   i = dpex.get_global_id(0)
   c[i] = a[i] + b[i]

import numpy as np 

a = np.arange(100)
b = np.arange(100)
c = np.zeros_like(a)

vecadd[100](a,b,c) # Exception raised

import dpnp

a = dpnp.arange(100)
b = dpnp.arange(100)
c = dpnp.zeros_like(a)

vecadd[100](a,b,c) # Works provided the device and backend specializations match

Point 2 above is a hack to get around missing proper type support in dpex for USM arrays. I am going to start that work next. Till then what do you feel about the above approach?

@fcharras
Copy link
Author

For me that seems to work. One minor remark: why not pass a device argument that would be a SyclDevice rather than device_type + backend ?

To make sure that I understand correctly, with those changes:

  • the JitKernel.definitions dictionary will never contain keys containing USMNdArrayType anymore, only numba.core.types.Array type. (a good thing I think)
  • In this dictionary, the value mapped to a key with numba.core.types.Array types will be in fact the result of two step:
    • step 1: map the key to the corresponding signature with USMNdArrayType arrays
    • step 2: compile the kernel using the signature with USMNdArrayType arrays

I think it's good 👌

Also as I understand, it seems that USMNdArrayType has more potential for specialization with possible parameterization that do not make sense for numba types (like type of memory) but those parameters are not really effective at the moment. So only relying on numba types for the moment seems right.

@diptorupd
Copy link
Collaborator

@fcharras

One minor remark: why not pass a device argument that would be a SyclDevice rather than device_type + backend ?

Ok. Let us use device instead.

the JitKernel.definitions dictionary will never contain keys containing USMNdArrayType anymore, only numba.core.types.Array type. (a good thing I think)
In this dictionary, the value mapped to a key with numba.core.types.Array types will be in fact the result of two step:
step 1: map the key to the corresponding signature with USMNdArrayType arrays
step 2: compile the kernel using the signature with USMNdArrayType arrays

Yes, that is what I was proposing.

After some more thinking and analyzing the existing code, I have identified gaps with the above approach. As you stated, my proposed approach needs to map the numba.core.types.Array key to the corresponding signature with USMNdArrayType arrays. However, USMNdArrayType requires the usm_type to be specified and we will either have to specialize for a specific default usm_type (say device), or else keep the specialization agnostic of the usm_type and allow any USMNdArrayType to be used in the signature ignoring the usm_type.

Instead, how about the following alternative:

import dpctl
from numba import void, int32
from numba_dpex import usm_array
import numba_dpex as dpex

usmarrty = usm_array(dtype=int32, ndim=2, layout="C", usm_type="device", device="gpu")

@dpex.kernel(
    func_or_sig=void(usmarrty, usmarrty, usmarrty)
)
def vecadd(a,b,c):
   i = dpex.get_global_id(0)
   c[i] = a[i] + b[i]

import numpy as np 

a = np.arange(100)
b = np.arange(100)
c = np.zeros_like(a)

vecadd[100](a,b,c) # Exception raised

import dpnp

a = dpnp.arange(100)
b = dpnp.arange(100)
c = dpnp.zeros_like(a)

vecadd[100](a,b,c) # Works provided the device and backend specializations match

@fcharras
Copy link
Author

I also think it's fine. I haven't met usecases requiring use of usm_type other than with value usm_type=device so I can't provide more feedback about how much I think usm_type would need to be exposed or not.

Also there's maybe a middle-ground solution where both interfaces are supported, but the first syntax will create two entries in the dictionary definitions, one with the key defined using the numba types, and the other key defined using numba_dpex.usm_array types where usm_type always defaults to device, with the value being the same kernel object for both of those keys. But when using the second syntax, it only creates one entry in the dictionary.

When I explored this API the features I was looking for was to be able to retrieve the kernel object that was early-jitted using the signature I inputed, and also have the kernel be reused later on if inputs match the signature, in both cases it's possible with this middle-ground.

@diptorupd
Copy link
Collaborator

@fcharras I have pushed to main changes that address the eager compilation issue for kernels. A separate PR that addresses the same issue for func decorator is in the works.

The main changes are that:

  • Kernels can be specialized using usm_ndarray type only and I have removed support specializing using NumPy arrays.
  • Specialization using strings are not supported yet.
  • Instead of JIT compiling a specialized kernel, an exception is raised if a specialized kernel is not called with right arguments.

The docs have not yet been updated. But, you can refer https://github.com/IntelPython/numba-dpex/blob/main/numba_dpex/examples/kernel/kernel_specialization.py.

Will be good to get your feedback.

@fcharras
Copy link
Author

fcharras commented Jan 23, 2023

👍 for the API and usage, this is excellent progress TY !

But unfortunately, for me numba_dpex doesn't work well after the merge of the refacto PR. There is a huge slowdown after this merge. It's not related to cache. I can't really say what's going on :-/ trying to get a minimal reproducer now.

I'd be curious to know if the new code (haven't read yet) would make easier the following usage:

@fcharras
Copy link
Author

I also see segfaults, makes me think that something is wrong with local memory.

@diptorupd
Copy link
Collaborator

I also see segfaults, makes me think that something is wrong with local memory.

The problem most likely is because I without realizing the impact changed the way the global and local range args passed to the __getitem__. We are working on reverting my changes and fixing the issue.

@diptorupd
Copy link
Collaborator

I'd be curious to know if the new code (haven't read yet) would make easier the following usage:

* letting the user to pass manually a kernel to a `SyclQueue` 

Possibly. I will like to look at your use case to be able to answer definitively.

and manually call wait() ? (suggested in On making task configuration and task args available to the user without executing
#769
)

Not yet. The problem is that NumPy ndarrays can still be passed to a kernel. In which case, we do a copy to USM from NumPy prior to the call and copy back after the wait. I deprecated the NumPy support for now. Let me enquire with another set of dpex users if they will be impacted. If there is no impact, then I will not wait an just remove the NumPy calls (the work was already started in #866). Once that is done, we can start returning the SyclEvent object from the __call__ function of JitKernel.

* AOT compilation

Yes, AOT is fixed now. See these examples: https://github.com/IntelPython/numba-dpex/blob/main/numba_dpex/examples/kernel/kernel_specialization.py.

@fcharras
Copy link
Author

fcharras commented Jan 24, 2023

AOT

By AOT I meant also being able to further distribute pre-compiled code in python packages (see relevant numba section). Even better if it could be distributed without a dependency to numba_dpex.

I'm not sure this is something we would really want for our project and I wouldn't see it as a short-term priority but if that's something you have in mind for the future it would be interesting to know.

wait()

If you decide to keep support for numpy ndarrays in the future (something indeed we don't use for sklearn-numba-dpex) do you think a keyword to dpex.kernel to disable it and enable instead exposing the SyclEvent more easily could be considered ?

@diptorupd
Copy link
Collaborator

If you decide to keep support for numpy ndarrays in the future (something indeed we don't use for sklearn-numba-dpex) do you think a keyword to dpex.kernel to disable it and enable instead exposing the SyclEvent more easily could be considered ?

NumPy support is deprecated and I will drop it as soon as 0.20 is out of the door. As for allowing kernels to execute asynchronously, let me track it in a separate issue. We should be able to return dpctlSyclEvent objects from a kernel instead of always calling wait, once NumPy arrays are no longer allowed as kernel arguments.

@diptorupd
Copy link
Collaborator

Closing this issue as the main issue of supporting eager compilation works as expected in main now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants