-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add NCUObserver #253
Add NCUObserver #253
Conversation
"""Create a new ``NCUObserver``. | ||
|
||
:param metrics: The metrics to observe. This should be a list of strings. | ||
You can use ``ncu --query-metrics`` to get a list of valid metrics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Here is an idea: Would it be much work to add a way to get the list of supported metrics from nvmetrics
? Something like NCUObserver.get_available_metrics()
could be useful to check which metrics are supported by the current GPU. Not a critical issue, just a thought
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A simple way to implement this is to let NCUObserver
call the ncu
command-line profiler with --query-metrics
. It would arguably be better if nvmetrics
could do this, as it links to CUPTI etc. anyway. I will try to figure out whether this is possible.
I have also thought about making this observer more user-friendly by providing presets of metrics. It is very tricky, though, as for many metrics there are mandatory suffixes like .sum
or .average
. We don't know what kind of metrics users are interested in. Moreover, the available metrics may differ per architecture. I think that this observer should mainly be used by expert users that profiled their code before (using ncu
or the GUI) and found some metrics that they would like to know for a larger parameter space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't there at least some metrics that are available on all GPUs that could be listed? I mean to get people started using the NCUObserver at first, and then if they want more they can still query ncu --query-metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added the functionality to query metrics to nvmetrics
, including a Python interface. You can now do:
nvmetrics.queryMetrics(nvmetrics.NVPW_METRIC_TYPE_COUNTER)
nvmetrics.queryMetrics(nvmetrics.NVPW_METRIC_TYPE_RATIO)
nvmetrics.queryMetrics(nvmetrics.NVPW_METRIC_TYPE_THROUGHPUT)
These different metric types should be passed with some suffix, these are called NVPW_RollupOp
and NVPW_Submetric
, but not all combinations are valid, see this struct:
typedef struct NVPW_MetricEvalRequest
{
/// the metric index as in 'NVPW_MetricsEvaluator_GetMetricNames'
size_t metricIndex;
/// one of 'NVPW_MetricType'
uint8_t metricType;
/// one of 'NVPW_RollupOp', required for Counter and Throughput, doesn't apply to Ratio
uint8_t rollupOp;
/// one of 'NVPW_Submetric', required for Ratio and Throughput, optional for Counter
uint16_t submetric;
} NVPW_MetricEvalRequest;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stijnh, how would you suggest proceeding here?
We could for instance use the same metrics that the example uses as default.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
I think it would be cleaner and more consistent with the current code to filter out prologue observers from the list of observers to use during the default benchmark loop, just like we do with continuous observers on this line: Could move that to the init though to do it only once, since it won't change during the run. |
I'll make some changes to see that this would look like |
We had a small issue with the CI, but that's fixed now. |
Quality Gate passedIssues Measures |
These changes indeed make the code more consistent with the existing code, but it also makes it harder to see what new functionality was introduced, just by looking at the diff for this branch. But to be honest, what matters most to me is that the |
I think I'm happy with how the code turned out now indeed. I guess the prologue observers should be mentioned in the documentation somewhere, but other than that this pull is quite ready to be merged in my opinion indeed. |
The NCUObserver can be used to automatically extract performance counters during tuning using Nvidia's NsightCompute profiler. | ||
The NCUObserver relies on an intermediate library, which can be found here: https://github.com/nlesc-recruit/nvmetrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The NCUObserver can be used to automatically extract performance counters during tuning using Nvidia's NsightCompute profiler. | |
The NCUObserver relies on an intermediate library, which can be found here: https://github.com/nlesc-recruit/nvmetrics | |
The NCUObserver can be used to automatically extract performance counters during tuning using Nvidia's CUDA Profiling Tools Interface (CUPTI) library, to offer insights into performance characteristics like memory access patterns and instruction execution statistics. It behaves much like the Nvidia Nsight Compute CLI (ncu) application, hence the name NCUObserver. This observer relies on an intermediate library, which can be found here: https://github.com/nlesc-recruit/nvmetrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether we should change the name NCUObserver
into something like MetricsObserver
. Eventually, we may want to also support measuring metrics on non-NVIDIA GPUs, and the observer could then pick the right backend based on whatever runner is used.
Now that I mention runners, I never tested this observer with anything else than pycuda. The observer needs CUDA to be at least somewhat initialized. When testing nvmetrics with pycuda everything worked fine, but not using cupy for instance. Maybe we still need to test this (and add the tests?) or at least mention it somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In terms of naming, I think I would prefer a more generalized NCUObserver a "ProfilingObserver" because we already use the term 'metrics' for the user-defined derived metrics. For now, I think we can keep the name NCUObserver until we have another "ProfilingObserver" and when we have that it would make sense to think about a more generic type.
I will test with cupy and cuda-python.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works with CuPy but not yet with cuda-python.
This is the error I get with cuda-python:
Using: NVIDIA GeForce RTX 3050 Ti Laptop GPU
Error while benchmarking: vector_add
Error while compiling or benchmarking, see source files: kernel_tuner/examples/cuda/temp_oq2ba5hb.c
Traceback (most recent call last):
File " kernel_tuner/examples/cuda/vector_add_observers_ncu.py", line 57, in <module>
tune()
File " kernel_tuner/examples/cuda/vector_add_observers_ncu.py", line 51, in tune
results, env = tune_kernel("vector_add", kernel_string, size, args, tune_params, observers=[ncuobserver], metrics=metrics, iterations=7, lang='nvcuda')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File " kernel_tuner/kernel_tuner/interface.py", line 678, in tune_kernel
results = strategy.tune(searchspace, runner, tuning_options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File " kernel_tuner/kernel_tuner/strategies/brute_force.py", line 10, in tune
return runner.run(searchspace.sorted_list(), tuning_options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File " kernel_tuner/kernel_tuner/runners/sequential.py", line 87, in run
self.dev.compile_and_benchmark(self.kernel_source, self.gpu_args, params, self.kernel_options, tuning_options)
File " kernel_tuner/kernel_tuner/core.py", line 614, in compile_and_benchmark
raise e
File " kernel_tuner/kernel_tuner/core.py", line 603, in compile_and_benchmark
self.benchmark(func, gpu_args, instance, verbose, to.objective, skip_nvml_setting=False)
File " kernel_tuner/kernel_tuner/core.py", line 472, in benchmark
raise e
File " kernel_tuner/kernel_tuner/core.py", line 437, in benchmark
self.benchmark_prologue(func, gpu_args, instance.threads, instance.grid, result)
File " kernel_tuner/kernel_tuner/core.py", line 355, in benchmark_prologue
obs.before_start()
File " kernel_tuner/kernel_tuner/observers/ncu.py", line 35, in before_start
nvmetrics.measureMetricsStart(self.metrics, self.device)
RuntimeError: /home/ben/documents/kernel_tuner/nvmetrics/lib/nv_metrics.cpp:147: error: function cuDeviceGet(&cuDevice, deviceNum) failed with error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how this can happen. The initialization of the nvcuda backend is very similar to pycuda and by the time we call berfore_start() it should have been initialized for a while already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the nvcuda backend, I tried to create a new context and destroy it atexit like pycuda is doing, but that doesn't change anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fixed in the nvmerics version that I just pushed (revision c7c42130). It now calls cuInit
, which doesn't seem to hurt, even if Pycuda already initialized CUDA before. To get a valid CUDA context, cuDevicePrimaryCtxRetain
is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! I just pulled and rebuilt the nvmetrics library and tested it with every CUDA backend. It all works now!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's great! I plan to make a 1.0 release of nvmetrics to go along with this MR.
Add
NCUObserver
to measure performance metrics, like you would normally do using the command-line toolncu
.This observer is an instance of the new
PrologueObserver
. ThePrologueObserver
runs before any of the normal benchmarking. In case ofNCUObserver
, this is because profiling will in most cases affect performance of the kernel and therefore any other measurements by the other observers.This new code has a dependency on
nvmetrics
, a small library that uses NVIDIA CUPTI and nvperf. Kerneltuner will work fine even when this library is not found, theNCUObserver
will simply print a warning and measure nothing.