Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dispatcher/caching rewrite to address performance regression #912

Merged
merged 2 commits into from
Feb 17, 2023

Conversation

adarshyoga
Copy link
Collaborator

@adarshyoga adarshyoga commented Feb 15, 2023

This PR partially addresses the performance regressions described in #886. It contains the following 4 key changes.

(1) The put() function implementation in LRUCache contains a call to get(), which was unnecessary. This PR removes the call to get(). Replaced call to get() in LRUCache.put() method with explicit logic to update the linked list keeping track of LRU ordering.

(2) Sha256 hash computation was being performed on every call to build cache key, which in turn was being called on for each dynamic call to a kernel. Instead of computing hash on every dynamic call to a kernel, it can be done once. This PR performs the hash computation for every static instance of a kernel rather than every dynamic instance.

(3) The types of kernel arguments are used as a part of the key when caching. The arguments types were being pre-processed to strip out USM metadata. This functionality was being performed twice for each call, once for building key for kernel module cache and again for kernel bundle cache. This PR changes the logic to perform the pre-processing once per-call.

(4) The function that build the cache key takes variable number of arguments and returns a tuple. The rest of the logic from build key has been moved to dispatcher and func. The key intuition of using variable args is to support the different caches, so far, kernel module cache and kernel bundle cache. Both these caches use different number of keys. (Side note: The build_key function is ideally suited to exist as a static method inside AbstractCache class).

Effects of optimizations:
Evaluated the effect of these changes using kmeans implementation from here.
On Intel ATS GPUs the execution time before the changes is 1.7 seconds. After these changes the execution time reduces to 1.4 seconds. See log below. With numba-dpex 0.19.0 the execution time is 1.1 seconds. These changes partially address the regression introduced after 0.19.0.

Run Log with numba-dpex 0.19.0:

python benchmark/kmeans.py
Running Kmeans numba_dpex lloyd GPU ... done in 1.1 s

Run Log with numba-dpex main:

python benchmark/kmeans.py
Running Kmeans numba_dpex lloyd GPU ... done in 1.7 s

Run Log with this PR:

python benchmark/kmeans.py
Running Kmeans numba_dpex lloyd GPU ... done in 1.4 s

  • Have you provided a meaningful PR description?
  • Have you added a test, reproducer or referred to an issue with a reproducer?
  • Have you tested your changes locally for CPU and GPU devices?
  • Have you made sure that new changes do not introduce compiler warnings?
  • If this PR is a work in progress, are you filing the PR as a draft?

@diptorupd
Copy link
Collaborator

Thanks @adarshyoga for triaging it.

The func_hash is really needed if we run into a situation where the number of cached versions of a kernel exceeds NUMBA_DPEX_CACHE_SIZE. Note that the cache limit is per kernel. If the limit is exceeded, then we will start to pickle the kernel to disk.

Can you try making the func_hash optional based on a flag? That can save us some cycles.

)
if not key:
stripped_argtypes = self._strip_usm_metadata(argtypes)
codegen_magic_tuple = kernel.target_context.codegen().magic_tuple()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I compared with 0.19.0, even this extra hashing on the codegen_magic_tuple may be adding an overhead. Can you try removing this and just adding the kernel to the key like we had in 0.19

…aching.build_key(). This avoids computing hash on every call. (2) moved argtypes list building logic to func.py and dispatcher. Again, avoids list building on every call; (3) Rewrote build_key to take variable args and return tuple. (4) Removed unnecessary call to LRUCache.get() inside LRUCache.put()
…ng functions to a separate cache utils. (2) added docstrings. (3) Replaced get() with explicit logic to update list in LRUCache
@chudur-budur
Copy link
Collaborator

chudur-budur commented Feb 16, 2023

LGTM! I manually restarted those stuck jobs at teamcity, will merge as soon as we get a pass on those CIs.

@diptorupd diptorupd merged commit ae994cd into IntelPython:main Feb 17, 2023
github-actions bot added a commit that referenced this pull request Feb 17, 2023
Dispatcher/caching rewrite to address performance regression ae994cd
@adarshyoga adarshyoga deleted the kmeans_perf_improvements branch February 17, 2023 16:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants