<section id="code-generation-based-on-a-device">
<span id="core-features"></span><h1>Code-generation based on a device<a class="headerlink" href="#code-generation-based-on-a-device" title="Permalink to this headline"></a></h1>
<p>In <code class="docutils literal notranslate"><span class="pre">numba-dppy</span></code>, kernels are written in a device-agnostic fashion making it
easy to write portable code. A kernel is compiled for the device on which the
kernel is enqueued to be executed. The device is specified using a
<code class="docutils literal notranslate"><span class="pre">dpctl.device_context</span></code> context manager. In the following example, two versions
of the <code class="docutils literal notranslate"><span class="pre">sum</span></code> kernel are compiled, one for a GPU and another for a CPU based on
which context the function was invoked. Currently, <code class="docutils literal notranslate"><span class="pre">numba-dppy</span></code> supports
OpenCL CPU and GPU devices and Level Zero GPU devices. In future, compilation
support may be extended to other type of SYCL devices that are supported by
DPC++’s runtime.</p>
<div><div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="kn">import</span> <span class="nn">numba_dppy</span><span class="o">,</span> <span class="nn">numba_dppy</span> <span class="k">as</span> <span class="nn">dppy</span>
<span class="kn">import</span> <span class="nn">dpctl</span>

<span class="nd">@dppy</span><span class="o">.</span><span class="n">kernel</span>
<span class="k">def</span> <span class="nf">sum</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="n">c</span><span class="p">):</span>
<span class="n">i</span> <span class="o">=</span> <span class="n">dppy</span><span class="o">.</span><span class="n">get_global_id</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
<span class="n">c</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">a</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">+</span> <span class="n">b</span><span class="p">[</span><span class="n">i</span><span class="p">]</span>

<span class="n">a</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">random</span><span class="p">(</span><span class="mi">20</span><span class="p">),</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">float32</span><span class="p">)</span>
<span class="n">b</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">random</span><span class="p">(</span><span class="mi">20</span><span class="p">),</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">float32</span><span class="p">)</span>
<span class="n">c</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">ones_like</span><span class="p">(</span><span class="n">a</span><span class="p">)</span>

<span class="k">with</span> <span class="n">dpctl</span><span class="o">.</span><span class="n">device_context</span><span class="p">(</span><span class="s2">&quot;level_zero:gpu&quot;</span><span class="p">):</span>
<span class="nb">sum</span><span class="p">[</span><span class="mi">20</span><span class="p">,</span> <span class="n">dppy</span><span class="o">.</span><span class="n">DEFAULT_LOCAL_SIZE</span><span class="p">](</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="n">c</span><span class="p">)</span>

<span class="k">with</span> <span class="n">dpctl</span><span class="o">.</span><span class="n">device_context</span><span class="p">(</span><span class="s2">&quot;opencl:cpu&quot;</span><span class="p">):</span>
<span class="nb">sum</span><span class="p">[</span><span class="mi">20</span><span class="p">,</span> <span class="n">dppy</span><span class="o">.</span><span class="n">DEFAULT_LOCAL_SIZE</span><span class="p">](</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="n">c</span><span class="p">)</span>
<section id="automatic-offload-of-numpy-expressions">
<h1>Automatic offload of NumPy expressions<a class="headerlink" href="#automatic-offload-of-numpy-expressions" title="Permalink to this headline"></a></h1>
<p>A key distinction between <code class="docutils literal notranslate"><span class="pre">numba-dppy</span></code> and other the GPU backends in Numba is
the ability to automatically offload specific data-parallel sections of a
Numba <code class="docutils literal notranslate"><span class="pre">jit</span></code> function.</p>
<div class="admonition-todo admonition" id="id1">
<p class="admonition-title">Todo</p>
<p>Details and examples to be added.</p>
<section id="controllable-fallback">
<h2>Controllable Fallback<a class="headerlink" href="#controllable-fallback" title="Permalink to this headline"></a></h2>
<p>By default, if a section of code cannot be offloaded to the GPU, it is automatically
executed on the CPU and warning is printed. This behavior is only applicable to <code class="docutils literal notranslate"><span class="pre">jit</span></code>
functions, auto-offloading of NumPy calls, array expressions and <code class="docutils literal notranslate"><span class="pre">prange</span></code> loops.
To disable this functionality and force code running on GPU set the environment variable
<code class="docutils literal notranslate"><span class="pre">NUMBA_DPPY_FALLBACK_OPTION</span></code> to false (e.g. <code class="docutils literal notranslate"><span class="pre">export</span> <span class="pre">NUMBA_DPPY_FALLBACK_OPTION=0</span></code>). In this
case the code is not automatically offloaded to the CPU and errors occur if any.</p>
<section id="offload-diagnostics">
<h2>Offload Diagnostics<a class="headerlink" href="#offload-diagnostics" title="Permalink to this headline"></a></h2>
<p>Setting the debug environment variable <code class="docutils literal notranslate"><span class="pre">NUMBA_DPPY_OFFLOAD_DIAGNOSTICS</span></code>
(e.g. <code class="docutils literal notranslate"><span class="pre">export</span> <span class="pre">NUMBA_DPPY_OFFLOAD_DIAGNOSTICS=1</span></code>) provides emission of the parallel and
offload diagnostics information based on produced parallel transforms. The level of detail
depends on the integer value between 1 and 4 that is set to the environment variable
(higher is more detailed).
In the “Auto-offloading” section there is the information on which device (device name)
this parfor or kernel was offloaded.</p>

.. _core_features:

Code-generation based on a device

In ``numba-dppy``, kernels are written in a device-agnostic fashion making it
easy to write portable code. A kernel is compiled for the device on which the
kernel is enqueued to be executed. The device is specified using a
``dpctl.device_context`` context manager. In the following example, two versions
of the ``sum`` kernel are compiled, one for a GPU and another for a CPU based on
which context the function was invoked. Currently, ``numba-dppy`` supports
OpenCL CPU and GPU devices and Level Zero GPU devices. In future, compilation
support may be extended to other type of SYCL devices that are supported by
DPC++'s runtime.

.. code-block:: python
import numpy as np
import numba_dppy, numba_dppy as dppy
import dpctl
def sum(a, b, c):
i = dppy.get_global_id(0)
c[i] = a[i] + b[i]
a = np.array(np.random.random(20), dtype=np.float32)
b = np.array(np.random.random(20), dtype=np.float32)
c = np.ones_like(a)
with dpctl.device_context("level_zero:gpu"):
sum[20, dppy.DEFAULT_LOCAL_SIZE](a, b, c)
with dpctl.device_context("opencl:cpu"):
sum[20, dppy.DEFAULT_LOCAL_SIZE](a, b, c)
Automatic offload of NumPy expressions

A key distinction between ``numba-dppy`` and other the GPU backends in Numba is
the ability to automatically offload specific data-parallel sections of a
Numba ``jit`` function.

.. todo::

Details and examples to be added.

Controllable Fallback

By default, if a section of code cannot be offloaded to the GPU, it is automatically
executed on the CPU and warning is printed. This behavior is only applicable to ``jit``
functions, auto-offloading of NumPy calls, array expressions and ``prange`` loops.
To disable this functionality and force code running on GPU set the environment variable
``NUMBA_DPPY_FALLBACK_OPTION`` to false (e.g. ``export NUMBA_DPPY_FALLBACK_OPTION=0``). In this
case the code is not automatically offloaded to the CPU and errors occur if any.

Offload Diagnostics

Setting the debug environment variable ``NUMBA_DPPY_OFFLOAD_DIAGNOSTICS``
(e.g. ``export NUMBA_DPPY_OFFLOAD_DIAGNOSTICS=1``) provides emission of the parallel and
offload diagnostics information based on produced parallel transforms. The level of detail
depends on the integer value between 1 and 4 that is set to the environment variable
(higher is more detailed).
In the "Auto-offloading" section there is the information on which device (device name)
this parfor or kernel was offloaded.
