-
Notifications
You must be signed in to change notification settings - Fork 863
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NVML_ERROR_NOT_SUPPORTED exception #1722
Comments
Thanks for opening this, which specific devices are you referring to? Is it an older NVIDIA GPU? An AMD GPU? something els? EDIT: This seems to be a somewhat known issue https://forums.developer.nvidia.com/t/bug-nvml-incorrectly-detects-certain-gpus-as-unsupported/30165. We can produce a better workaround |
nvidia smi:
This is a virtual GPU. It seems that some features like temperature monitoring might not be supported for these virtual devices. See for instance page 118 of https://docs.nvidia.com/grid/latest/pdf/grid-vgpu-user-guide.pdf. |
@msaroufim If you approve for an upstream bug-fix I'd be happy to help. |
@msaroufim any update on this? |
Hi @lromor I'm not sure what the right fix is yet. It does like seem like this is problem introduced by NVIDIA |
Hi @msaroufim , I've opened an issue here: https://forums.developer.nvidia.com/t/nvml-issue-with-virtual-a100/220718?u=lromor |
In case anyone gets to a similar issue and would like to have a quick fix, I patched the code with: diff --git a/ts/metrics/system_metrics.py b/ts/metrics/system_metrics.py
index c7aaf6a..9915c9e 100644
--- a/ts/metrics/system_metrics.py
+++ b/ts/metrics/system_metrics.py
@@ -7,6 +7,7 @@ from builtins import str
import psutil
from ts.metrics.dimension import Dimension
from ts.metrics.metric import Metric
+import pynvml
system_metrics = []
dimension = [Dimension('Level', 'Host')]
@@ -69,7 +70,11 @@ def gpu_utilization(num_of_gpu):
system_metrics.append(Metric('GPUMemoryUtilization', value['mem_used_percent'], 'percent', dimension_gpu))
system_metrics.append(Metric('GPUMemoryUsed', value['mem_used'], 'MB', dimension_gpu))
- statuses = list_gpus.device_statuses()
+ try:
+ statuses = list_gpus.device_statuses()
+ except pynvml.nvml.NVMLError_NotSupported:
+ statuses = []
+
for idx, status in enumerate(statuses):
dimension_gpu = [Dimension('Level', 'Host'), Dimension("device_id", idx)]
system_metrics.append(Metric('GPUUtilization', status['utilization'], 'percent', dimension_gpu)) |
I think this is the right solution. Wanna make a PR for it? May just need to add a logging warning as well |
🐛 Describe the bug
Sometimes it can occur that NVML does not support monitoring queries to specific devices. Currently this leads to failing the startup phase.
Error logs
Installation instructions
pytorch/torchserve:latest-gpu
Model Packaing
N/A
config.properties
No response
Versions
Repro instructions
run:
Possible Solution
Deal with those exceptions.
The text was updated successfully, but these errors were encountered: