Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add NCUObserver #253

Merged
merged 12 commits into from
Jun 4, 2024
Merged

Add NCUObserver #253

merged 12 commits into from
Jun 4, 2024

Conversation

csbnw
Copy link
Collaborator

@csbnw csbnw commented Apr 19, 2024

Add NCUObserver to measure performance metrics, like you would normally do using the command-line tool ncu.

This observer is an instance of the new PrologueObserver. The PrologueObserver runs before any of the normal benchmarking. In case of NCUObserver, this is because profiling will in most cases affect performance of the kernel and therefore any other measurements by the other observers.

This new code has a dependency on nvmetrics, a small library that uses NVIDIA CUPTI and nvperf. Kerneltuner will work fine even when this library is not found, the NCUObserver will simply print a warning and measure nothing.

@csbnw csbnw self-assigned this Apr 19, 2024
kernel_tuner/core.py Show resolved Hide resolved
kernel_tuner/observers/observer.py Show resolved Hide resolved
kernel_tuner/observers/observer.py Show resolved Hide resolved
kernel_tuner/observers/ncu.py Outdated Show resolved Hide resolved
kernel_tuner/observers/ncu.py Outdated Show resolved Hide resolved
"""Create a new ``NCUObserver``.

:param metrics: The metrics to observe. This should be a list of strings.
You can use ``ncu --query-metrics`` to get a list of valid metrics.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Here is an idea: Would it be much work to add a way to get the list of supported metrics from nvmetrics? Something like NCUObserver.get_available_metrics() could be useful to check which metrics are supported by the current GPU. Not a critical issue, just a thought

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A simple way to implement this is to let NCUObserver call the ncu command-line profiler with --query-metrics. It would arguably be better if nvmetrics could do this, as it links to CUPTI etc. anyway. I will try to figure out whether this is possible.

I have also thought about making this observer more user-friendly by providing presets of metrics. It is very tricky, though, as for many metrics there are mandatory suffixes like .sum or .average. We don't know what kind of metrics users are interested in. Moreover, the available metrics may differ per architecture. I think that this observer should mainly be used by expert users that profiled their code before (using ncu or the GUI) and found some metrics that they would like to know for a larger parameter space.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't there at least some metrics that are available on all GPUs that could be listed? I mean to get people started using the NCUObserver at first, and then if they want more they can still query ncu --query-metrics.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added the functionality to query metrics to nvmetrics, including a Python interface. You can now do:

nvmetrics.queryMetrics(nvmetrics.NVPW_METRIC_TYPE_COUNTER)
nvmetrics.queryMetrics(nvmetrics.NVPW_METRIC_TYPE_RATIO)
nvmetrics.queryMetrics(nvmetrics.NVPW_METRIC_TYPE_THROUGHPUT)

These different metric types should be passed with some suffix, these are called NVPW_RollupOp and NVPW_Submetric, but not all combinations are valid, see this struct:

    typedef struct NVPW_MetricEvalRequest
    {
        /// the metric index as in 'NVPW_MetricsEvaluator_GetMetricNames'
        size_t metricIndex;
        /// one of 'NVPW_MetricType'
        uint8_t metricType;
        /// one of 'NVPW_RollupOp', required for Counter and Throughput, doesn't apply to Ratio
        uint8_t rollupOp;
        /// one of 'NVPW_Submetric', required for Ratio and Throughput, optional for Counter
        uint16_t submetric;
    } NVPW_MetricEvalRequest;

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stijnh, how would you suggest proceeding here?
We could for instance use the same metrics that the example uses as default.

@benvanwerkhoven

This comment was marked as outdated.

@csbnw

This comment was marked as outdated.

@benvanwerkhoven

This comment was marked as outdated.

@csbnw

This comment was marked as outdated.

@csbnw

This comment was marked as outdated.

@benvanwerkhoven
Copy link
Collaborator

I think it would be cleaner and more consistent with the current code to filter out prologue observers from the list of observers to use during the default benchmark loop, just like we do with continuous observers on this line:
https://github.com/KernelTuner/kernel_tuner/blob/add-ncu-observer/kernel_tuner/core.py#L348

Could move that to the init though to do it only once, since it won't change during the run.

@benvanwerkhoven
Copy link
Collaborator

I'll make some changes to see that this would look like

@benvanwerkhoven
Copy link
Collaborator

We had a small issue with the CI, but that's fixed now.

kernel_tuner/observers/ncu.py Outdated Show resolved Hide resolved
Copy link

sonarcloud bot commented Apr 26, 2024

Quality Gate Passed Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

@csbnw
Copy link
Collaborator Author

csbnw commented Apr 26, 2024

I'll make some changes to see that this would look like

These changes indeed make the code more consistent with the existing code, but it also makes it harder to see what new functionality was introduced, just by looking at the diff for this branch. But to be honest, what matters most to me is that the NCUObserver works. (It does, of course!) I am not sure whether you have more refactoring in mind or need anything from my side before this PR is ready to merge.

@benvanwerkhoven
Copy link
Collaborator

I think I'm happy with how the code turned out now indeed. I guess the prologue observers should be mentioned in the documentation somewhere, but other than that this pull is quite ready to be merged in my opinion indeed.

Comment on lines +118 to +119
The NCUObserver can be used to automatically extract performance counters during tuning using Nvidia's NsightCompute profiler.
The NCUObserver relies on an intermediate library, which can be found here: https://github.com/nlesc-recruit/nvmetrics
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The NCUObserver can be used to automatically extract performance counters during tuning using Nvidia's NsightCompute profiler.
The NCUObserver relies on an intermediate library, which can be found here: https://github.com/nlesc-recruit/nvmetrics
The NCUObserver can be used to automatically extract performance counters during tuning using Nvidia's CUDA Profiling Tools Interface (CUPTI) library, to offer insights into performance characteristics like memory access patterns and instruction execution statistics. It behaves much like the Nvidia Nsight Compute CLI (ncu) application, hence the name NCUObserver. This observer relies on an intermediate library, which can be found here: https://github.com/nlesc-recruit/nvmetrics

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether we should change the name NCUObserver into something like MetricsObserver. Eventually, we may want to also support measuring metrics on non-NVIDIA GPUs, and the observer could then pick the right backend based on whatever runner is used.

Now that I mention runners, I never tested this observer with anything else than pycuda. The observer needs CUDA to be at least somewhat initialized. When testing nvmetrics with pycuda everything worked fine, but not using cupy for instance. Maybe we still need to test this (and add the tests?) or at least mention it somewhere.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of naming, I think I would prefer a more generalized NCUObserver a "ProfilingObserver" because we already use the term 'metrics' for the user-defined derived metrics. For now, I think we can keep the name NCUObserver until we have another "ProfilingObserver" and when we have that it would make sense to think about a more generic type.

I will test with cupy and cuda-python.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works with CuPy but not yet with cuda-python.

This is the error I get with cuda-python:

Using: NVIDIA GeForce RTX 3050 Ti Laptop GPU
Error while benchmarking: vector_add
Error while compiling or benchmarking, see source files:  kernel_tuner/examples/cuda/temp_oq2ba5hb.c
Traceback (most recent call last):
  File " kernel_tuner/examples/cuda/vector_add_observers_ncu.py", line 57, in <module>
    tune()
  File " kernel_tuner/examples/cuda/vector_add_observers_ncu.py", line 51, in tune
    results, env = tune_kernel("vector_add", kernel_string, size, args, tune_params, observers=[ncuobserver], metrics=metrics, iterations=7, lang='nvcuda')
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File " kernel_tuner/kernel_tuner/interface.py", line 678, in tune_kernel
    results = strategy.tune(searchspace, runner, tuning_options)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File " kernel_tuner/kernel_tuner/strategies/brute_force.py", line 10, in tune
    return runner.run(searchspace.sorted_list(), tuning_options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File " kernel_tuner/kernel_tuner/runners/sequential.py", line 87, in run
    self.dev.compile_and_benchmark(self.kernel_source, self.gpu_args, params, self.kernel_options, tuning_options)
  File " kernel_tuner/kernel_tuner/core.py", line 614, in compile_and_benchmark
    raise e
  File " kernel_tuner/kernel_tuner/core.py", line 603, in compile_and_benchmark
    self.benchmark(func, gpu_args, instance, verbose, to.objective, skip_nvml_setting=False)
  File " kernel_tuner/kernel_tuner/core.py", line 472, in benchmark
    raise e
  File " kernel_tuner/kernel_tuner/core.py", line 437, in benchmark
    self.benchmark_prologue(func, gpu_args, instance.threads, instance.grid, result)
  File " kernel_tuner/kernel_tuner/core.py", line 355, in benchmark_prologue
    obs.before_start()
  File " kernel_tuner/kernel_tuner/observers/ncu.py", line 35, in before_start
    nvmetrics.measureMetricsStart(self.metrics, self.device)
RuntimeError: /home/ben/documents/kernel_tuner/nvmetrics/lib/nv_metrics.cpp:147: error: function cuDeviceGet(&cuDevice, deviceNum) failed with error

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how this can happen. The initialization of the nvcuda backend is very similar to pycuda and by the time we call berfore_start() it should have been initialized for a while already.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the nvcuda backend, I tried to create a new context and destroy it atexit like pycuda is doing, but that doesn't change anything.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fixed in the nvmerics version that I just pushed (revision c7c42130). It now calls cuInit, which doesn't seem to hurt, even if Pycuda already initialized CUDA before. To get a valid CUDA context, cuDevicePrimaryCtxRetain is used.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! I just pulled and rebuilt the nvmetrics library and tested it with every CUDA backend. It all works now!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's great! I plan to make a 1.0 release of nvmetrics to go along with this MR.

@benvanwerkhoven benvanwerkhoven merged commit e83870b into master Jun 4, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants