Dispatcher/caching rewrite to address performance regression #912

adarshyoga · 2023-02-15T07:02:38Z

This PR partially addresses the performance regressions described in #886. It contains the following 4 key changes.

(1) ~~The put() function implementation in LRUCache contains a call to get(), which was unnecessary. This PR removes the call to get().~~ Replaced call to get() in LRUCache.put() method with explicit logic to update the linked list keeping track of LRU ordering.

(2) Sha256 hash computation was being performed on every call to build cache key, which in turn was being called on for each dynamic call to a kernel. Instead of computing hash on every dynamic call to a kernel, it can be done once. This PR performs the hash computation for every static instance of a kernel rather than every dynamic instance.

(3) The types of kernel arguments are used as a part of the key when caching. The arguments types were being pre-processed to strip out USM metadata. This functionality was being performed twice for each call, once for building key for kernel module cache and again for kernel bundle cache. This PR changes the logic to perform the pre-processing once per-call.

(4) The function that build the cache key takes variable number of arguments and returns a tuple. The rest of the logic from build key has been moved to dispatcher and func. The key intuition of using variable args is to support the different caches, so far, kernel module cache and kernel bundle cache. Both these caches use different number of keys. (Side note: The build_key function is ideally suited to exist as a static method inside AbstractCache class).

Effects of optimizations:
Evaluated the effect of these changes using kmeans implementation from here.
On Intel ATS GPUs the execution time before the changes is 1.7 seconds. After these changes the execution time reduces to 1.4 seconds. See log below. With numba-dpex 0.19.0 the execution time is 1.1 seconds. These changes partially address the regression introduced after 0.19.0.

Run Log with numba-dpex 0.19.0:

python benchmark/kmeans.py
Running Kmeans numba_dpex lloyd GPU ... done in 1.1 s

Run Log with numba-dpex main:

python benchmark/kmeans.py
Running Kmeans numba_dpex lloyd GPU ... done in 1.7 s

Run Log with this PR:

python benchmark/kmeans.py
Running Kmeans numba_dpex lloyd GPU ... done in 1.4 s

Have you provided a meaningful PR description?
Have you added a test, reproducer or referred to an issue with a reproducer?
Have you tested your changes locally for CPU and GPU devices?
Have you made sure that new changes do not introduce compiler warnings?
If this PR is a work in progress, are you filing the PR as a draft?

numba_dpex/core/caching.py

numba_dpex/core/kernel_interface/dispatcher.py

numba_dpex/core/kernel_interface/func.py

diptorupd · 2023-02-15T15:20:29Z

Thanks @adarshyoga for triaging it.

The func_hash is really needed if we run into a situation where the number of cached versions of a kernel exceeds NUMBA_DPEX_CACHE_SIZE. Note that the cache limit is per kernel. If the limit is exceeded, then we will start to pickle the kernel to disk.

Can you try making the func_hash optional based on a flag? That can save us some cycles.

diptorupd · 2023-02-15T15:55:20Z

numba_dpex/core/kernel_interface/dispatcher.py

-        )
+        if not key:
+            stripped_argtypes = self._strip_usm_metadata(argtypes)
+            codegen_magic_tuple = kernel.target_context.codegen().magic_tuple()


I compared with 0.19.0, even this extra hashing on the codegen_magic_tuple may be adding an overhead. Can you try removing this and just adding the kernel to the key like we had in 0.19

numba_dpex/core/caching.py

…aching.build_key(). This avoids computing hash on every call. (2) moved argtypes list building logic to func.py and dispatcher. Again, avoids list building on every call; (3) Rewrote build_key to take variable args and return tuple. (4) Removed unnecessary call to LRUCache.get() inside LRUCache.put()

…ng functions to a separate cache utils. (2) added docstrings. (3) Replaced get() with explicit logic to update list in LRUCache

chudur-budur · 2023-02-16T23:30:33Z

LGTM! I manually restarted those stuck jobs at teamcity, will merge as soon as we get a pass on those CIs.

Dispatcher/caching rewrite to address performance regression ae994cd

adarshyoga requested review from mingjie-intel and diptorupd as code owners February 15, 2023 07:02