Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solving perferomance regression issue by caching the kernel_bundle #896

Merged
merged 2 commits into from
Feb 13, 2023

Conversation

chudur-budur
Copy link
Collaborator

This PR might solve issue #886.

In version 0.19.0, the create_program_from_spirv() was done only once. See line 483 in compiler.py.

In the current main it's been done everytime when __call__() function was invoked from JitKernel (See here). Therefore we are now caching the kernel_bundle to avoid the repeated call of create_program_from_spirv().

  • Have you provided a meaningful PR description?
  • Have you added a test, reproducer or referred to an issue with a reproducer?
  • Have you tested your changes locally for CPU and GPU devices?
  • Have you made sure that new changes do not introduce compiler warnings?
  • If this PR is a work in progress, are you filing the PR as a draft?

@chudur-budur
Copy link
Collaborator Author

@adarshyoga Could you please test this PR to see if you are getting any slow down?

@adarshyoga
Copy link
Collaborator

Are you caching both llvm bit code and sycl program generated from spirv?

Copy link
Contributor

@mingjie-intel mingjie-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please let Adarsh test before merge. thanks.

.pre-commit-config.yaml Outdated Show resolved Hide resolved
@chudur-budur
Copy link
Collaborator Author

Are you caching both llvm bit code and sycl program generated from spirv?

yes @adarshyoga

@adarshyoga
Copy link
Collaborator

The performance regression in blackscholes no longer happens with this fix. See below times from blackscholes which are close to what we were getting with 0.19.0.


ERF: Numba@jit-loop-par | Size: 524288 | MOPS: 490767591.93 | TIME: 0.002137
Time of first run 0.0021376459999373765
ERF: Numba@jit-loop-par | Size: 1048576 | MOPS: 1192850017.17 | TIME: 0.001758
Time of first run 0.0022624360008194344
ERF: Numba@jit-loop-par | Size: 2097152 | MOPS: 2397245139.53 | TIME: 0.001750
Time of first run 0.002746018000834738
ERF: Numba@jit-loop-par | Size: 4194304 | MOPS: 4262913460.23 | TIME: 0.001968
Time of first run 0.003016001999640139
ERF: Numba@jit-loop-par | Size: 8388608 | MOPS: 6976956836.03 | TIME: 0.002405
Time of first run 0.003975543000706239
ERF: Numba@jit-loop-par | Size: 16777216 | MOPS: 10806267130.05 | TIME: 0.003105
Time of first run 0.006068818000130705
ERF: Numba@jit-loop-par | Size: 33554432 | MOPS: 14378060479.55 | TIME: 0.004667
Time of first run 0.010165161999793781
ERF: Numba@jit-loop-par | Size: 67108864 | MOPS: 16862131587.41 | TIME: 0.007960
Time of first run 0.017497161999926902
ERF: Numba@jit-loop-par | Size: 134217728 | MOPS: 19211915654.84 | TIME: 0.013972
Time of first run 0.03390368500004115
ERF: Numba@jit-loop-par | Size: 268435456 | MOPS: 20912204492.02 | TIME: 0.025673

expected = a * c

device = dpctl.SyclDevice(filter_str)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to get away from this pattern as it leads to lots of skipped tests when the filter string is not supported. Instead of using the filter string parameter, just set the device to dpctl.select_default_device(). In future, we will remove all filter string params and control default device using SYCL environment variables.


kernel = dpex.kernel(check_bool_kernel)

kernel[dpex.Range(a.size)](da, True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test can also be done using a scalar and setting range to 1, in effect launching a sycl single_task.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, you can directly use dpctl constructors and do not need to copy data from NumPy.

@diptorupd
Copy link
Collaborator

@chudur-budur Looks good! Need some minor changes to the test case. Can you also squash your changes into just two commits: one with the caching changes and the other with the test case changes.

Store exec_queue along with kernel_bundle

Properly initialize self._kernel_bundle_cache with NullCache()

Save kernel_bundle with exec_queue as a key
@chudur-budur
Copy link
Collaborator Author

@chudur-budur Looks good! Need some minor changes to the test case. Can you also squash your changes into just two commits: one with the caching changes and the other with the test case changes.

Thanks! I think it's good to go.

@diptorupd
Copy link
Collaborator

@ fcharras Can you please evaluate the PR to see if you see performance improvements in your benchmarks?

@fcharras
Copy link

fcharras commented Feb 13, 2023

After #898 my benchmarks work again and I can compare properly performances before and after #804 .

This PR definitely fixes a major slowdown that would have made the JIT near unusable for me, it's a 👍 for merge

After this PR the performances before and after #804 have the same order of magnitude and is usable again.

But the python overhead is not as negligible as it used to be before #804 . I would not say it completely fixes #886 . As suggested there, maybe the caching instructions can be made a bit more efficient (e.g attaching the codegen to the jitkernel instances). (or any other bottleneck that a profiler can highlight).

The added overhead I see in our KMeans is about 1sc per run, for small workload it's noticeable.

(edit: NB: a KMeans run is several hundred kernel calls so the overhead is about 0.005 sc per kernel call)

@diptorupd diptorupd merged commit 8d0c8ea into IntelPython:main Feb 13, 2023
@diptorupd diptorupd deleted the github-886 branch February 13, 2023 14:32
github-actions bot added a commit that referenced this pull request Feb 13, 2023
Solving perferomance regression issue by caching the kernel_bundle 8d0c8ea
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants