From c30ea8950a75c59ec8059ecae05ce7c537449c00 Mon Sep 17 00:00:00 2001
From: Diptorup Deb <diptorup.deb@intel.com>
Date: Tue, 20 Feb 2024 11:09:49 -0600
Subject: [PATCH] Cleanups to overview section.

---
 docs/source/overview.rst | 196 ++++++++++-----------------------------
 1 file changed, 50 insertions(+), 146 deletions(-)

diff --git a/docs/source/overview.rst b/docs/source/overview.rst
index 043614b4ae..093afe7d65 100644
--- a/docs/source/overview.rst
+++ b/docs/source/overview.rst
@@ -6,12 +6,15 @@ Overview
 
 Data Parallel Extension for Numba* (`numba-dpex`_) is a free and open-source
 LLVM-based code generator for portable accelerator programming in Python.
-numba_dpex defines a new kernel programming domain-specific language (DSL)
-in pure Python called `KAPI` that is modeled after the C++ embedded DSL
-`SYCL*`_.
+numba_dpex defines a new kernel programming domain-specific language (DSL) in
+pure Python called `KAPI` that is modeled after the C++ embedded DSL `SYCL*`_. A
+KAPI function can be JIT compiled by numba-dpex to generate a "data-parallel"
+kernel function that executes in parallel on a supported device. Currently,
+compilation of KAPI is possible for x86 CPU devices (using OpenCL CPU drivers),
+Intel Gen9 integrated GPUs, Intel UHD integrated GPUs, and Intel discrete GPUs.
 
-The following example illustrates a relatively simple pairwise distance matrix
-computation example written in KAPI.
+The following example presents an example that uses KAPI to code a pairwise
+distance computation.
 
 .. code-block:: python
 
@@ -35,158 +38,59 @@ computation example written in KAPI.
 
 
     data = np.random.ranf((10000, 3)).astype(np.float32)
-    distance = np.empty(shape=(data.shape[0], data.shape[0]), dtype=np.float32)
+    dist = np.empty(shape=(data.shape[0], data.shape[0]), dtype=np.float32)
     exec_range = kapi.Range(data.shape[0], data.shape[0])
-    kapi.call_kernel(pairwise_distance_kernel, exec_range, data, distance)
-
-Skipping over much of the language details, at a high-level the
-``pairwise_distance_kernel`` can be viewed as a "data-parallel" function that
-gets executed individually by a set of "work items". That is, each work item
-runs the same function for a subset of the elements of the input ``data`` and
-``distance`` arrays. For programmers familiar with the CUDA or OpenCL languages,
-it is the same programming model referred to as Single Program Multiple Data
-(SPMD). As Python has no concept of a work item the KAPI function runs
-sequentially resulting in a very slow execution time. Experienced Python
-programmers will most probably write a much faster version of the function using
-NumPy*.
-
-However, using a JIT compiler numba-dpex can compile a function written in the
-KAPI language to a CPython native extension function that executes according to
-the SPMD programming model, speeding up the execution time by orders of
-magnitude. Currently, compilation of KAPI is possible for x86 CPU devices,
-Intel Gen9 integrated GPUs, Intel UHD integrated GPUs, and Intel discrete GPUs.
-
-
-``numba-dpex`` is an open-source project and can be installed as part of `Intel
-AI Analytics Toolkit`_ or the `Intel Distribution for Python*`_. The package is
-also available on Anaconda cloud and as a Docker image on GitHub. Please refer
-the :doc:`getting_started` page to learn more.
-
-Main Features
--------------
-
-Portable Kernel Programming
-~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-The ``numba-dpex`` kernel programming API has a design similar to Numba's
-``cuda.jit`` sub-module. The API is modeled after the `SYCL*`_ language and uses
-the `DPC++`_ SYCL runtime. Currently, compilation of kernels is supported for
-SPIR-V-based OpenCL and `oneAPI Level Zero`_ devices CPU and GPU devices. In the
-future, compilation support for other types of hardware that are supported by
-DPC++ will be added.
-
-The following example illustrates a vector addition kernel written with
-``numba-dpex`` kernel API.
+    kapi.call_kernel(pairwise_distance_kernel, exec_range, data, dist)
+
+The ``pairwise_distance_kernel`` function is conceptually a "data-parallel"
+function that gets executed individually by a set of "work items". That is, each
+work item runs the same function for a subset of the elements of the input
+**data** and **distance** arrays. For programmers familiar with the CUDA or
+OpenCL languages, it is the programming model referred to as Single Program
+Multiple Data (SPMD). Although a KAPI function is conceptually following the
+SPMD model, as Python has no concept of a work item a KAPI function
+runs sequentially in Python and needs to be JIT compiled for parallel execution.
+
+JIT compiling a KAPI function only requires adding the ``dpex.kernel`` decorator
+to the function and calling the function from the ``dpex.call_kernel`` method.
+It should be noted that a JIT compiled KAPI function does not support passing in
+NumPy arrays. A KAPI function can only be called using either ``dpnp.ndarray``
+or ``dpctl.tensor.usm_ndarray`` array objects. The restriction is due to a
+compiled KAPI function requiring memory that was allocated on the device where
+the kernel should execute. Refer the :doc:`programming_model` and kernel
+programming user guide for further details. The modification to
+``pairwise_distance_kernel`` function for JIT compilation are shown in the next
+example.
 
 .. code-block:: python
 
-    import dpnp
+    from numba_dpex import kernel_api as kapi
     import numba_dpex as dpex
-
-
-    @dpex.kernel
-    def vecadd_kernel(a, b, c):
-        i = dpex.get_global_id(0)
-        c[i] = a[i] + b[i]
-
-
-    a = dpnp.ones(1024, device="gpu")
-    b = dpnp.ones(1024, device="gpu")
-    c = dpnp.empty_like(a)
-
-    vecadd_kernel[dpex.Range(1024)](a, b, c)
-    print(c)
-
-In the above example, three arrays are allocated on a default ``gpu`` device
-using the ``dpnp`` library. The arrays are then passed as input arguments to the
-kernel function. The compilation target and the subsequent execution of the
-kernel is determined by the input arguments and follow the
-"compute-follows-data" programming model as specified in the `Python* Array API
-Standard`_. To change the execution target to a CPU, the device keyword needs to
-be changed to ``cpu`` when allocating the ``dpnp`` arrays. It is also possible
-to leave the ``device`` keyword undefined and let the ``dpnp`` library select a
-default device based on environment flag settings. Refer the
-:doc:`user_guide/kernel_programming/index` for further details.
-
-``dpjit`` decorator
-~~~~~~~~~~~~~~~~~~~
-
-The ``numba-dpex`` package provides a new decorator ``dpjit`` that extends
-Numba's ``njit`` decorator. The new decorator is equivalent to
-``numba.njit(parallel=True)``, but additionally supports compiling ``dpnp``
-functions, ``prange`` loops, and array expressions that use ``dpnp.ndarray``
-objects.
-
-Unlike Numba's NumPy parallelization that only supports CPUs, ``dpnp``
-expressions are first converted to data-parallel kernels and can then be
-`offloaded` to different types of devices. As ``dpnp`` implements the same API
-as NumPy*, an existing ``numba.njit`` decorated function that uses
-``numpy.ndarray`` may be refactored to use ``dpnp.ndarray`` and decorated with
-``dpjit``. Such a refactoring can allow the parallel regions to be offloaded
-to a supported GPU device, providing users an additional option to execute their
-code parallelly.
-
-The vector addition example depicted using the kernel API can also be
-expressed in several different ways using ``dpjit``.
-
-.. code-block:: python
-
+    import math
     import dpnp
-    import numba_dpex as dpex
-
-
-    @dpex.dpjit
-    def vecadd_v1(a, b):
-        return a + b
 
 
-    @dpex.dpjit
-    def vecadd_v2(a, b):
-        return dpnp.add(a, b)
-
-
-    @dpex.dpjit
-    def vecadd_v3(a, b):
-        c = dpnp.empty_like(a)
-        for i in prange(a.shape[0]):
-            c[i] = a[i] + b[i]
-        return c
-
-As with the kernel API example, a ``dpjit`` function if invoked with ``dpnp``
-input arguments follows the compute-follows-data programming model. Refer
-:doc:`user_manual/dpnp_offload/index` for further details.
-
-
-.. Project Goal
-.. ------------
-
-.. If C++ is not your language, you can skip writing data-parallel kernels in SYCL
-.. and directly write them in Python.
-
-.. Our package ``numba-dpex`` extends the Numba compiler to allow kernel creation
-.. directly in Python via a custom compute API
-
+    @dpex.kernel
+    def pairwise_distance_kernel(item: kapi.Item, data, distance):
+        i = item.get_id(0)
+        j = item.get_id(1)
 
-.. Contributing
-.. ------------
+        data_dims = data.shape[1]
 
-.. Refer the `contributing guide
-.. <https://github.com/IntelPython/numba-dpex/blob/main/CONTRIBUTING>`_ for
-.. information on coding style and standards used in ``numba-dpex``.
+        d = data.dtype.type(0.0)
+        for k in range(data_dims):
+            tmp = data[i, k] - data[j, k]
+            d += tmp * tmp
 
-.. License
-.. -------
+        distance[j, i] = math.sqrt(d)
 
-.. ``numba-dpex`` is Licensed under Apache License 2.0 that can be found in `LICENSE
-.. <https://github.com/IntelPython/numba-dpex/blob/main/LICENSE>`_. All usage and
-.. contributions to the project are subject to the terms and conditions of this
-.. license.
 
+    data = dpnp.random.ranf((10000, 3)).astype(dpnp.float32)
+    dist = dpnp.empty(shape=(data.shape[0], data.shape[0]), dtype=dpnp.float32)
+    exec_range = kapi.Range(data.shape[0], data.shape[0])
+    dpex.call_kernel(pairwise_distance_kernel, exec_range, data, dist)
 
-.. Along with the kernel programming API an auto-offload feature is also provided.
-.. The feature enables automatic generation of kernels from data-parallel NumPy
-.. library calls and array expressions, Numba ``prange`` loops, and `other
-.. "data-parallel by construction" expressions
-.. <https://numba.pydata.org/numba-doc/latest/user/parallel.html>`_ that Numba is
-.. able to parallelize. Following two examples demonstrate the two ways in which
-.. kernels may be written using numba-dpex.
+``numba-dpex`` is an open-source project and can be installed as part of `Intel
+AI Analytics Toolkit`_ or the `Intel Distribution for Python*`_. The package is
+also available on Anaconda cloud, PyPi, and as a Docker image on GitHub.
+Refer the :doc:`getting_started` page for further details.