From c30ea8950a75c59ec8059ecae05ce7c537449c00 Mon Sep 17 00:00:00 2001 From: Diptorup Deb Date: Tue, 20 Feb 2024 11:09:49 -0600 Subject: [PATCH] Cleanups to overview section. --- docs/source/overview.rst | 196 ++++++++++----------------------------- 1 file changed, 50 insertions(+), 146 deletions(-) diff --git a/docs/source/overview.rst b/docs/source/overview.rst index 043614b4ae..093afe7d65 100644 --- a/docs/source/overview.rst +++ b/docs/source/overview.rst @@ -6,12 +6,15 @@ Overview Data Parallel Extension for Numba* (`numba-dpex`_) is a free and open-source LLVM-based code generator for portable accelerator programming in Python. -numba_dpex defines a new kernel programming domain-specific language (DSL) -in pure Python called `KAPI` that is modeled after the C++ embedded DSL -`SYCL*`_. +numba_dpex defines a new kernel programming domain-specific language (DSL) in +pure Python called `KAPI` that is modeled after the C++ embedded DSL `SYCL*`_. A +KAPI function can be JIT compiled by numba-dpex to generate a "data-parallel" +kernel function that executes in parallel on a supported device. Currently, +compilation of KAPI is possible for x86 CPU devices (using OpenCL CPU drivers), +Intel Gen9 integrated GPUs, Intel UHD integrated GPUs, and Intel discrete GPUs. -The following example illustrates a relatively simple pairwise distance matrix -computation example written in KAPI. +The following example presents an example that uses KAPI to code a pairwise +distance computation. .. code-block:: python @@ -35,158 +38,59 @@ computation example written in KAPI. data = np.random.ranf((10000, 3)).astype(np.float32) - distance = np.empty(shape=(data.shape[0], data.shape[0]), dtype=np.float32) + dist = np.empty(shape=(data.shape[0], data.shape[0]), dtype=np.float32) exec_range = kapi.Range(data.shape[0], data.shape[0]) - kapi.call_kernel(pairwise_distance_kernel, exec_range, data, distance) - -Skipping over much of the language details, at a high-level the -``pairwise_distance_kernel`` can be viewed as a "data-parallel" function that -gets executed individually by a set of "work items". That is, each work item -runs the same function for a subset of the elements of the input ``data`` and -``distance`` arrays. For programmers familiar with the CUDA or OpenCL languages, -it is the same programming model referred to as Single Program Multiple Data -(SPMD). As Python has no concept of a work item the KAPI function runs -sequentially resulting in a very slow execution time. Experienced Python -programmers will most probably write a much faster version of the function using -NumPy*. - -However, using a JIT compiler numba-dpex can compile a function written in the -KAPI language to a CPython native extension function that executes according to -the SPMD programming model, speeding up the execution time by orders of -magnitude. Currently, compilation of KAPI is possible for x86 CPU devices, -Intel Gen9 integrated GPUs, Intel UHD integrated GPUs, and Intel discrete GPUs. - - -``numba-dpex`` is an open-source project and can be installed as part of `Intel -AI Analytics Toolkit`_ or the `Intel Distribution for Python*`_. The package is -also available on Anaconda cloud and as a Docker image on GitHub. Please refer -the :doc:`getting_started` page to learn more. - -Main Features -------------- - -Portable Kernel Programming -~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -The ``numba-dpex`` kernel programming API has a design similar to Numba's -``cuda.jit`` sub-module. The API is modeled after the `SYCL*`_ language and uses -the `DPC++`_ SYCL runtime. Currently, compilation of kernels is supported for -SPIR-V-based OpenCL and `oneAPI Level Zero`_ devices CPU and GPU devices. In the -future, compilation support for other types of hardware that are supported by -DPC++ will be added. - -The following example illustrates a vector addition kernel written with -``numba-dpex`` kernel API. + kapi.call_kernel(pairwise_distance_kernel, exec_range, data, dist) + +The ``pairwise_distance_kernel`` function is conceptually a "data-parallel" +function that gets executed individually by a set of "work items". That is, each +work item runs the same function for a subset of the elements of the input +**data** and **distance** arrays. For programmers familiar with the CUDA or +OpenCL languages, it is the programming model referred to as Single Program +Multiple Data (SPMD). Although a KAPI function is conceptually following the +SPMD model, as Python has no concept of a work item a KAPI function +runs sequentially in Python and needs to be JIT compiled for parallel execution. + +JIT compiling a KAPI function only requires adding the ``dpex.kernel`` decorator +to the function and calling the function from the ``dpex.call_kernel`` method. +It should be noted that a JIT compiled KAPI function does not support passing in +NumPy arrays. A KAPI function can only be called using either ``dpnp.ndarray`` +or ``dpctl.tensor.usm_ndarray`` array objects. The restriction is due to a +compiled KAPI function requiring memory that was allocated on the device where +the kernel should execute. Refer the :doc:`programming_model` and kernel +programming user guide for further details. The modification to +``pairwise_distance_kernel`` function for JIT compilation are shown in the next +example. .. code-block:: python - import dpnp + from numba_dpex import kernel_api as kapi import numba_dpex as dpex - - - @dpex.kernel - def vecadd_kernel(a, b, c): - i = dpex.get_global_id(0) - c[i] = a[i] + b[i] - - - a = dpnp.ones(1024, device="gpu") - b = dpnp.ones(1024, device="gpu") - c = dpnp.empty_like(a) - - vecadd_kernel[dpex.Range(1024)](a, b, c) - print(c) - -In the above example, three arrays are allocated on a default ``gpu`` device -using the ``dpnp`` library. The arrays are then passed as input arguments to the -kernel function. The compilation target and the subsequent execution of the -kernel is determined by the input arguments and follow the -"compute-follows-data" programming model as specified in the `Python* Array API -Standard`_. To change the execution target to a CPU, the device keyword needs to -be changed to ``cpu`` when allocating the ``dpnp`` arrays. It is also possible -to leave the ``device`` keyword undefined and let the ``dpnp`` library select a -default device based on environment flag settings. Refer the -:doc:`user_guide/kernel_programming/index` for further details. - -``dpjit`` decorator -~~~~~~~~~~~~~~~~~~~ - -The ``numba-dpex`` package provides a new decorator ``dpjit`` that extends -Numba's ``njit`` decorator. The new decorator is equivalent to -``numba.njit(parallel=True)``, but additionally supports compiling ``dpnp`` -functions, ``prange`` loops, and array expressions that use ``dpnp.ndarray`` -objects. - -Unlike Numba's NumPy parallelization that only supports CPUs, ``dpnp`` -expressions are first converted to data-parallel kernels and can then be -`offloaded` to different types of devices. As ``dpnp`` implements the same API -as NumPy*, an existing ``numba.njit`` decorated function that uses -``numpy.ndarray`` may be refactored to use ``dpnp.ndarray`` and decorated with -``dpjit``. Such a refactoring can allow the parallel regions to be offloaded -to a supported GPU device, providing users an additional option to execute their -code parallelly. - -The vector addition example depicted using the kernel API can also be -expressed in several different ways using ``dpjit``. - -.. code-block:: python - + import math import dpnp - import numba_dpex as dpex - - - @dpex.dpjit - def vecadd_v1(a, b): - return a + b - @dpex.dpjit - def vecadd_v2(a, b): - return dpnp.add(a, b) - - - @dpex.dpjit - def vecadd_v3(a, b): - c = dpnp.empty_like(a) - for i in prange(a.shape[0]): - c[i] = a[i] + b[i] - return c - -As with the kernel API example, a ``dpjit`` function if invoked with ``dpnp`` -input arguments follows the compute-follows-data programming model. Refer -:doc:`user_manual/dpnp_offload/index` for further details. - - -.. Project Goal -.. ------------ - -.. If C++ is not your language, you can skip writing data-parallel kernels in SYCL -.. and directly write them in Python. - -.. Our package ``numba-dpex`` extends the Numba compiler to allow kernel creation -.. directly in Python via a custom compute API - + @dpex.kernel + def pairwise_distance_kernel(item: kapi.Item, data, distance): + i = item.get_id(0) + j = item.get_id(1) -.. Contributing -.. ------------ + data_dims = data.shape[1] -.. Refer the `contributing guide -.. `_ for -.. information on coding style and standards used in ``numba-dpex``. + d = data.dtype.type(0.0) + for k in range(data_dims): + tmp = data[i, k] - data[j, k] + d += tmp * tmp -.. License -.. ------- + distance[j, i] = math.sqrt(d) -.. ``numba-dpex`` is Licensed under Apache License 2.0 that can be found in `LICENSE -.. `_. All usage and -.. contributions to the project are subject to the terms and conditions of this -.. license. + data = dpnp.random.ranf((10000, 3)).astype(dpnp.float32) + dist = dpnp.empty(shape=(data.shape[0], data.shape[0]), dtype=dpnp.float32) + exec_range = kapi.Range(data.shape[0], data.shape[0]) + dpex.call_kernel(pairwise_distance_kernel, exec_range, data, dist) -.. Along with the kernel programming API an auto-offload feature is also provided. -.. The feature enables automatic generation of kernels from data-parallel NumPy -.. library calls and array expressions, Numba ``prange`` loops, and `other -.. "data-parallel by construction" expressions -.. `_ that Numba is -.. able to parallelize. Following two examples demonstrate the two ways in which -.. kernels may be written using numba-dpex. +``numba-dpex`` is an open-source project and can be installed as part of `Intel +AI Analytics Toolkit`_ or the `Intel Distribution for Python*`_. The package is +also available on Anaconda cloud, PyPi, and as a Docker image on GitHub. +Refer the :doc:`getting_started` page for further details.