Skip to content

Commit

Permalink
Cleanups to overview section.
Browse files Browse the repository at this point in the history
  • Loading branch information
Diptorup Deb committed Feb 20, 2024
1 parent 5e09f86 commit c30ea89
Showing 1 changed file with 50 additions and 146 deletions.
196 changes: 50 additions & 146 deletions docs/source/overview.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,15 @@ Overview

Data Parallel Extension for Numba* (`numba-dpex`_) is a free and open-source
LLVM-based code generator for portable accelerator programming in Python.
numba_dpex defines a new kernel programming domain-specific language (DSL)
in pure Python called `KAPI` that is modeled after the C++ embedded DSL
`SYCL*`_.
numba_dpex defines a new kernel programming domain-specific language (DSL) in
pure Python called `KAPI` that is modeled after the C++ embedded DSL `SYCL*`_. A
KAPI function can be JIT compiled by numba-dpex to generate a "data-parallel"
kernel function that executes in parallel on a supported device. Currently,
compilation of KAPI is possible for x86 CPU devices (using OpenCL CPU drivers),
Intel Gen9 integrated GPUs, Intel UHD integrated GPUs, and Intel discrete GPUs.

The following example illustrates a relatively simple pairwise distance matrix
computation example written in KAPI.
The following example presents an example that uses KAPI to code a pairwise
distance computation.

.. code-block:: python
Expand All @@ -35,158 +38,59 @@ computation example written in KAPI.
data = np.random.ranf((10000, 3)).astype(np.float32)
distance = np.empty(shape=(data.shape[0], data.shape[0]), dtype=np.float32)
dist = np.empty(shape=(data.shape[0], data.shape[0]), dtype=np.float32)
exec_range = kapi.Range(data.shape[0], data.shape[0])
kapi.call_kernel(pairwise_distance_kernel, exec_range, data, distance)
Skipping over much of the language details, at a high-level the
``pairwise_distance_kernel`` can be viewed as a "data-parallel" function that
gets executed individually by a set of "work items". That is, each work item
runs the same function for a subset of the elements of the input ``data`` and
``distance`` arrays. For programmers familiar with the CUDA or OpenCL languages,
it is the same programming model referred to as Single Program Multiple Data
(SPMD). As Python has no concept of a work item the KAPI function runs
sequentially resulting in a very slow execution time. Experienced Python
programmers will most probably write a much faster version of the function using
NumPy*.

However, using a JIT compiler numba-dpex can compile a function written in the
KAPI language to a CPython native extension function that executes according to
the SPMD programming model, speeding up the execution time by orders of
magnitude. Currently, compilation of KAPI is possible for x86 CPU devices,
Intel Gen9 integrated GPUs, Intel UHD integrated GPUs, and Intel discrete GPUs.


``numba-dpex`` is an open-source project and can be installed as part of `Intel
AI Analytics Toolkit`_ or the `Intel Distribution for Python*`_. The package is
also available on Anaconda cloud and as a Docker image on GitHub. Please refer
the :doc:`getting_started` page to learn more.

Main Features
-------------

Portable Kernel Programming
~~~~~~~~~~~~~~~~~~~~~~~~~~~

The ``numba-dpex`` kernel programming API has a design similar to Numba's
``cuda.jit`` sub-module. The API is modeled after the `SYCL*`_ language and uses
the `DPC++`_ SYCL runtime. Currently, compilation of kernels is supported for
SPIR-V-based OpenCL and `oneAPI Level Zero`_ devices CPU and GPU devices. In the
future, compilation support for other types of hardware that are supported by
DPC++ will be added.

The following example illustrates a vector addition kernel written with
``numba-dpex`` kernel API.
kapi.call_kernel(pairwise_distance_kernel, exec_range, data, dist)
The ``pairwise_distance_kernel`` function is conceptually a "data-parallel"
function that gets executed individually by a set of "work items". That is, each
work item runs the same function for a subset of the elements of the input
**data** and **distance** arrays. For programmers familiar with the CUDA or
OpenCL languages, it is the programming model referred to as Single Program
Multiple Data (SPMD). Although a KAPI function is conceptually following the
SPMD model, as Python has no concept of a work item a KAPI function
runs sequentially in Python and needs to be JIT compiled for parallel execution.

JIT compiling a KAPI function only requires adding the ``dpex.kernel`` decorator
to the function and calling the function from the ``dpex.call_kernel`` method.
It should be noted that a JIT compiled KAPI function does not support passing in
NumPy arrays. A KAPI function can only be called using either ``dpnp.ndarray``
or ``dpctl.tensor.usm_ndarray`` array objects. The restriction is due to a
compiled KAPI function requiring memory that was allocated on the device where
the kernel should execute. Refer the :doc:`programming_model` and kernel
programming user guide for further details. The modification to
``pairwise_distance_kernel`` function for JIT compilation are shown in the next
example.

.. code-block:: python
import dpnp
from numba_dpex import kernel_api as kapi
import numba_dpex as dpex
@dpex.kernel
def vecadd_kernel(a, b, c):
i = dpex.get_global_id(0)
c[i] = a[i] + b[i]
a = dpnp.ones(1024, device="gpu")
b = dpnp.ones(1024, device="gpu")
c = dpnp.empty_like(a)
vecadd_kernel[dpex.Range(1024)](a, b, c)
print(c)
In the above example, three arrays are allocated on a default ``gpu`` device
using the ``dpnp`` library. The arrays are then passed as input arguments to the
kernel function. The compilation target and the subsequent execution of the
kernel is determined by the input arguments and follow the
"compute-follows-data" programming model as specified in the `Python* Array API
Standard`_. To change the execution target to a CPU, the device keyword needs to
be changed to ``cpu`` when allocating the ``dpnp`` arrays. It is also possible
to leave the ``device`` keyword undefined and let the ``dpnp`` library select a
default device based on environment flag settings. Refer the
:doc:`user_guide/kernel_programming/index` for further details.

``dpjit`` decorator
~~~~~~~~~~~~~~~~~~~

The ``numba-dpex`` package provides a new decorator ``dpjit`` that extends
Numba's ``njit`` decorator. The new decorator is equivalent to
``numba.njit(parallel=True)``, but additionally supports compiling ``dpnp``
functions, ``prange`` loops, and array expressions that use ``dpnp.ndarray``
objects.

Unlike Numba's NumPy parallelization that only supports CPUs, ``dpnp``
expressions are first converted to data-parallel kernels and can then be
`offloaded` to different types of devices. As ``dpnp`` implements the same API
as NumPy*, an existing ``numba.njit`` decorated function that uses
``numpy.ndarray`` may be refactored to use ``dpnp.ndarray`` and decorated with
``dpjit``. Such a refactoring can allow the parallel regions to be offloaded
to a supported GPU device, providing users an additional option to execute their
code parallelly.

The vector addition example depicted using the kernel API can also be
expressed in several different ways using ``dpjit``.

.. code-block:: python
import math
import dpnp
import numba_dpex as dpex
@dpex.dpjit
def vecadd_v1(a, b):
return a + b
@dpex.dpjit
def vecadd_v2(a, b):
return dpnp.add(a, b)
@dpex.dpjit
def vecadd_v3(a, b):
c = dpnp.empty_like(a)
for i in prange(a.shape[0]):
c[i] = a[i] + b[i]
return c
As with the kernel API example, a ``dpjit`` function if invoked with ``dpnp``
input arguments follows the compute-follows-data programming model. Refer
:doc:`user_manual/dpnp_offload/index` for further details.


.. Project Goal
.. ------------
.. If C++ is not your language, you can skip writing data-parallel kernels in SYCL
.. and directly write them in Python.
.. Our package ``numba-dpex`` extends the Numba compiler to allow kernel creation
.. directly in Python via a custom compute API
@dpex.kernel
def pairwise_distance_kernel(item: kapi.Item, data, distance):
i = item.get_id(0)
j = item.get_id(1)
.. Contributing
.. ------------
data_dims = data.shape[1]
.. Refer the `contributing guide
.. <https://github.com/IntelPython/numba-dpex/blob/main/CONTRIBUTING>`_ for
.. information on coding style and standards used in ``numba-dpex``.
d = data.dtype.type(0.0)
for k in range(data_dims):
tmp = data[i, k] - data[j, k]
d += tmp * tmp
.. License
.. -------
distance[j, i] = math.sqrt(d)
.. ``numba-dpex`` is Licensed under Apache License 2.0 that can be found in `LICENSE
.. <https://github.com/IntelPython/numba-dpex/blob/main/LICENSE>`_. All usage and
.. contributions to the project are subject to the terms and conditions of this
.. license.
data = dpnp.random.ranf((10000, 3)).astype(dpnp.float32)
dist = dpnp.empty(shape=(data.shape[0], data.shape[0]), dtype=dpnp.float32)
exec_range = kapi.Range(data.shape[0], data.shape[0])
dpex.call_kernel(pairwise_distance_kernel, exec_range, data, dist)
.. Along with the kernel programming API an auto-offload feature is also provided.
.. The feature enables automatic generation of kernels from data-parallel NumPy
.. library calls and array expressions, Numba ``prange`` loops, and `other
.. "data-parallel by construction" expressions
.. <https://numba.pydata.org/numba-doc/latest/user/parallel.html>`_ that Numba is
.. able to parallelize. Following two examples demonstrate the two ways in which
.. kernels may be written using numba-dpex.
``numba-dpex`` is an open-source project and can be installed as part of `Intel
AI Analytics Toolkit`_ or the `Intel Distribution for Python*`_. The package is
also available on Anaconda cloud, PyPi, and as a Docker image on GitHub.
Refer the :doc:`getting_started` page for further details.

0 comments on commit c30ea89

Please sign in to comment.