Add NCUObserver #253

csbnw · 2024-04-19T11:13:00Z

Add NCUObserver to measure performance metrics, like you would normally do using the command-line tool ncu.

This observer is an instance of the new PrologueObserver. The PrologueObserver runs before any of the normal benchmarking. In case of NCUObserver, this is because profiling will in most cases affect performance of the kernel and therefore any other measurements by the other observers.

This new code has a dependency on nvmetrics, a small library that uses NVIDIA CUPTI and nvperf. Kerneltuner will work fine even when this library is not found, the NCUObserver will simply print a warning and measure nothing.

kernel_tuner/core.py

kernel_tuner/observers/observer.py

kernel_tuner/observers/ncu.py

stijnh · 2024-04-19T13:42:56Z

kernel_tuner/observers/ncu.py

+        """Create a new ``NCUObserver``.
+
+        :param metrics: The metrics to observe. This should be a list of strings.
+                        You can use ``ncu --query-metrics`` to get a list of valid metrics.


Great! Here is an idea: Would it be much work to add a way to get the list of supported metrics from nvmetrics? Something like NCUObserver.get_available_metrics() could be useful to check which metrics are supported by the current GPU. Not a critical issue, just a thought

A simple way to implement this is to let NCUObserver call the ncu command-line profiler with --query-metrics. It would arguably be better if nvmetrics could do this, as it links to CUPTI etc. anyway. I will try to figure out whether this is possible.

I have also thought about making this observer more user-friendly by providing presets of metrics. It is very tricky, though, as for many metrics there are mandatory suffixes like .sum or .average. We don't know what kind of metrics users are interested in. Moreover, the available metrics may differ per architecture. I think that this observer should mainly be used by expert users that profiled their code before (using ncu or the GUI) and found some metrics that they would like to know for a larger parameter space.

Aren't there at least some metrics that are available on all GPUs that could be listed? I mean to get people started using the NCUObserver at first, and then if they want more they can still query ncu --query-metrics.

I have added the functionality to query metrics to nvmetrics, including a Python interface. You can now do:

nvmetrics.queryMetrics(nvmetrics.NVPW_METRIC_TYPE_COUNTER) nvmetrics.queryMetrics(nvmetrics.NVPW_METRIC_TYPE_RATIO) nvmetrics.queryMetrics(nvmetrics.NVPW_METRIC_TYPE_THROUGHPUT)

These different metric types should be passed with some suffix, these are called NVPW_RollupOp and NVPW_Submetric, but not all combinations are valid, see this struct:

typedef struct NVPW_MetricEvalRequest { /// the metric index as in 'NVPW_MetricsEvaluator_GetMetricNames' size_t metricIndex; /// one of 'NVPW_MetricType' uint8_t metricType; /// one of 'NVPW_RollupOp', required for Counter and Throughput, doesn't apply to Ratio uint8_t rollupOp; /// one of 'NVPW_Submetric', required for Ratio and Throughput, optional for Counter uint16_t submetric; } NVPW_MetricEvalRequest;

@stijnh, how would you suggest proceeding here?
We could for instance use the same metrics that the example uses as default.

benvanwerkhoven · 2024-04-24T08:19:57Z

I think it would be cleaner and more consistent with the current code to filter out prologue observers from the list of observers to use during the default benchmark loop, just like we do with continuous observers on this line:
https://github.com/KernelTuner/kernel_tuner/blob/add-ncu-observer/kernel_tuner/core.py#L348

Could move that to the init though to do it only once, since it won't change during the run.

benvanwerkhoven · 2024-04-24T08:33:34Z

I'll make some changes to see that this would look like

…-ncu-observer

benvanwerkhoven · 2024-04-25T07:19:17Z

We had a small issue with the CI, but that's fixed now.

kernel_tuner/observers/ncu.py

sonarcloud · 2024-04-26T06:54:19Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

csbnw · 2024-04-26T07:09:31Z

I'll make some changes to see that this would look like

These changes indeed make the code more consistent with the existing code, but it also makes it harder to see what new functionality was introduced, just by looking at the diff for this branch. But to be honest, what matters most to me is that the NCUObserver works. (It does, of course!) I am not sure whether you have more refactoring in mind or need anything from my side before this PR is ready to merge.

benvanwerkhoven · 2024-04-26T12:34:33Z

I think I'm happy with how the code turned out now indeed. I guess the prologue observers should be mentioned in the documentation somewhere, but other than that this pull is quite ready to be merged in my opinion indeed.

csbnw · 2024-04-29T07:29:56Z

doc/source/observers.rst

+The NCUObserver can be used to automatically extract performance counters during tuning using Nvidia's NsightCompute profiler.
+The NCUObserver relies on an intermediate library, which can be found here: https://github.com/nlesc-recruit/nvmetrics


Suggested change

The NCUObserver can be used to automatically extract performance counters during tuning using Nvidia's NsightCompute profiler.

The NCUObserver relies on an intermediate library, which can be found here: https://github.com/nlesc-recruit/nvmetrics

The NCUObserver can be used to automatically extract performance counters during tuning using Nvidia's CUDA Profiling Tools Interface (CUPTI) library, to offer insights into performance characteristics like memory access patterns and instruction execution statistics. It behaves much like the Nvidia Nsight Compute CLI (ncu) application, hence the name NCUObserver. This observer relies on an intermediate library, which can be found here: https://github.com/nlesc-recruit/nvmetrics

I wonder whether we should change the name NCUObserver into something like MetricsObserver. Eventually, we may want to also support measuring metrics on non-NVIDIA GPUs, and the observer could then pick the right backend based on whatever runner is used.

Now that I mention runners, I never tested this observer with anything else than pycuda. The observer needs CUDA to be at least somewhat initialized. When testing nvmetrics with pycuda everything worked fine, but not using cupy for instance. Maybe we still need to test this (and add the tests?) or at least mention it somewhere.

In terms of naming, I think I would prefer a more generalized NCUObserver a "ProfilingObserver" because we already use the term 'metrics' for the user-defined derived metrics. For now, I think we can keep the name NCUObserver until we have another "ProfilingObserver" and when we have that it would make sense to think about a more generic type.

I will test with cupy and cuda-python.

Works with CuPy but not yet with cuda-python.

This is the error I get with cuda-python:

Using: NVIDIA GeForce RTX 3050 Ti Laptop GPU Error while benchmarking: vector_add Error while compiling or benchmarking, see source files: kernel_tuner/examples/cuda/temp_oq2ba5hb.c Traceback (most recent call last): File " kernel_tuner/examples/cuda/vector_add_observers_ncu.py", line 57, in <module> tune() File " kernel_tuner/examples/cuda/vector_add_observers_ncu.py", line 51, in tune results, env = tune_kernel("vector_add", kernel_string, size, args, tune_params, observers=[ncuobserver], metrics=metrics, iterations=7, lang='nvcuda') ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File " kernel_tuner/kernel_tuner/interface.py", line 678, in tune_kernel results = strategy.tune(searchspace, runner, tuning_options) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File " kernel_tuner/kernel_tuner/strategies/brute_force.py", line 10, in tune return runner.run(searchspace.sorted_list(), tuning_options) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File " kernel_tuner/kernel_tuner/runners/sequential.py", line 87, in run self.dev.compile_and_benchmark(self.kernel_source, self.gpu_args, params, self.kernel_options, tuning_options) File " kernel_tuner/kernel_tuner/core.py", line 614, in compile_and_benchmark raise e File " kernel_tuner/kernel_tuner/core.py", line 603, in compile_and_benchmark self.benchmark(func, gpu_args, instance, verbose, to.objective, skip_nvml_setting=False) File " kernel_tuner/kernel_tuner/core.py", line 472, in benchmark raise e File " kernel_tuner/kernel_tuner/core.py", line 437, in benchmark self.benchmark_prologue(func, gpu_args, instance.threads, instance.grid, result) File " kernel_tuner/kernel_tuner/core.py", line 355, in benchmark_prologue obs.before_start() File " kernel_tuner/kernel_tuner/observers/ncu.py", line 35, in before_start nvmetrics.measureMetricsStart(self.metrics, self.device) RuntimeError: /home/ben/documents/kernel_tuner/nvmetrics/lib/nv_metrics.cpp:147: error: function cuDeviceGet(&cuDevice, deviceNum) failed with error

Not sure how this can happen. The initialization of the nvcuda backend is very similar to pycuda and by the time we call berfore_start() it should have been initialized for a while already.

In the nvcuda backend, I tried to create a new context and destroy it atexit like pycuda is doing, but that doesn't change anything.

I think it's fixed in the nvmerics version that I just pushed (revision c7c42130). It now calls cuInit, which doesn't seem to hurt, even if Pycuda already initialized CUDA before. To get a valid CUDA context, cuDevicePrimaryCtxRetain is used.

Awesome! I just pulled and rebuilt the nvmetrics library and tested it with every CUDA backend. It all works now!

That's great! I plan to make a 1.0 release of nvmetrics to go along with this MR.

Add NCUObserver

4fe1843

csbnw added the enhancement label Apr 19, 2024

csbnw requested review from stijnh and benvanwerkhoven April 19, 2024 11:13

csbnw self-assigned this Apr 19, 2024

stijnh reviewed Apr 19, 2024

View reviewed changes

This comment was marked as outdated.

Sign in to view

benvanwerkhoven and others added 6 commits April 22, 2024 14:01

add device option

2183d13

add example for NCUObserver

225dae7

expand documentation

4cc68c5

Raise exception when using NCUObserver without nvmetrics

3602286

Rename example to vector_add_observers_ncu

397306f

Rename to prologue_start and prologue_finish

3402808

csbnw force-pushed the add-ncu-observer branch from 8061239 to 3402808 Compare April 24, 2024 06:24

benvanwerkhoven and others added 4 commits April 24, 2024 10:41

suggestion to organize prologue observers differently

35910f5

Merge branch 'master' into add-ncu-observer

59bd891

Update test-python-package.yml

7bb58c3

Merge branch 'master' of github.com:KernelTuner/kernel_tuner into add…

092caad

…-ncu-observer

csbnw commented Apr 26, 2024

View reviewed changes

kernel_tuner/observers/ncu.py Outdated Show resolved Hide resolved

Raise error like in PMTObserver

742faa4

csbnw commented Apr 29, 2024

View reviewed changes

benvanwerkhoven merged commit e83870b into master Jun 4, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add NCUObserver #253

Add NCUObserver #253

csbnw commented Apr 19, 2024 •

edited

Loading

stijnh Apr 19, 2024

csbnw Apr 19, 2024

benvanwerkhoven Apr 19, 2024

csbnw Apr 22, 2024

csbnw Apr 26, 2024

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

benvanwerkhoven commented Apr 24, 2024

benvanwerkhoven commented Apr 24, 2024

benvanwerkhoven commented Apr 25, 2024

sonarcloud bot commented Apr 26, 2024

csbnw commented Apr 26, 2024

benvanwerkhoven commented Apr 26, 2024

csbnw Apr 29, 2024

csbnw Apr 29, 2024

benvanwerkhoven May 23, 2024

benvanwerkhoven May 23, 2024

benvanwerkhoven May 23, 2024

benvanwerkhoven May 24, 2024

csbnw May 24, 2024

benvanwerkhoven May 25, 2024

csbnw May 27, 2024

		The NCUObserver can be used to automatically extract performance counters during tuning using Nvidia's NsightCompute profiler.
		The NCUObserver relies on an intermediate library, which can be found here: https://github.com/nlesc-recruit/nvmetrics

Add NCUObserver #253

Add NCUObserver #253

Conversation

csbnw commented Apr 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

benvanwerkhoven commented Apr 24, 2024

benvanwerkhoven commented Apr 24, 2024

benvanwerkhoven commented Apr 25, 2024

sonarcloud bot commented Apr 26, 2024

Quality Gate passed

csbnw commented Apr 26, 2024

benvanwerkhoven commented Apr 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

csbnw commented Apr 19, 2024 •

edited

Loading