NVML_ERROR_NOT_SUPPORTED exception #1722

lromor · 2022-07-04T13:16:02Z

🐛 Describe the bug

Sometimes it can occur that NVML does not support monitoring queries to specific devices. Currently this leads to failing the startup phase.

Error logs

2022-07-04T12:33:15,023 [ERROR] Thread-20 org.pytorch.serve.metrics.MetricCollector - Traceback (most recent call last):
  File "ts/metrics/metric_collector.py", line 27, in <module>
    system_metrics.collect_all(sys.modules['ts.metrics.system_metrics'], arguments.gpu)
  File "/usr/local/lib/python3.6/dist-packages/ts/metrics/system_metrics.py", line 91, in collect_all
    value(num_of_gpu)
  File "/usr/local/lib/python3.6/dist-packages/ts/metrics/system_metrics.py", line 72, in gpu_utilization
    statuses = list_gpus.device_statuses()
  File "/usr/local/lib/python3.6/dist-packages/nvgpu/list_gpus.py", line 67, in device_statuses
    return [device_status(device_index) for device_index in range(device_count)]
  File "/usr/local/lib/python3.6/dist-packages/nvgpu/list_gpus.py", line 67, in <listcomp>
    return [device_status(device_index) for device_index in range(device_count)]
  File "/usr/local/lib/python3.6/dist-packages/nvgpu/list_gpus.py", line 26, in device_status
    temperature = nv.nvmlDeviceGetTemperature(handle, nv.NVML_TEMPERATURE_GPU)
  File "/usr/local/lib/python3.6/dist-packages/pynvml/nvml.py", line 1956, in nvmlDeviceGetTemperature
    _nvmlCheckReturn(ret)
  File "/usr/local/lib/python3.6/dist-packages/pynvml/nvml.py", line 765, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.nvml.NVMLError_NotSupported: Not Supported

Installation instructions

pytorch/torchserve:latest-gpu

Model Packaing

N/A

config.properties

No response

Versions

------------------------------------------------------------------------------------------
Environment headers
------------------------------------------------------------------------------------------
Torchserve branch: 

torchserve==0.6.0
torch-model-archiver==0.6.0

Python version: 3.6 (64-bit runtime)
Python executable: /usr/bin/python3

Versions of relevant python libraries:
future==0.18.2
numpy==1.19.5
nvgpu==0.9.0
psutil==5.9.1
requests==2.27.1
torch-model-archiver==0.6.0
torch-workflow-archiver==0.2.4
torchserve==0.6.0
wheel==0.30.0
**Warning: torch not present ..
**Warning: torchtext not present ..
**Warning: torchvision not present ..
**Warning: torchaudio not present ..

Java Version:


OS: N/A
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: N/A
CMake version: N/A

Repro instructions

run:

torchserve --start --foreground --model-store model-store/

Possible Solution

Deal with those exceptions.

The text was updated successfully, but these errors were encountered:

msaroufim · 2022-07-04T16:46:45Z

Thanks for opening this, which specific devices are you referring to? Is it an older NVIDIA GPU? An AMD GPU? something els? EDIT: This seems to be a somewhat known issue https://forums.developer.nvidia.com/t/bug-nvml-incorrectly-detects-certain-gpus-as-unsupported/30165. We can produce a better workaround

lromor · 2022-07-05T10:42:51Z

nvidia smi:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GRID A100DX-40C     On   | 00000000:00:05.0 Off |                    0 |
| N/A   N/A    P0    N/A /  N/A |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

This is a virtual GPU. It seems that some features like temperature monitoring might not be supported for these virtual devices. See for instance page 118 of https://docs.nvidia.com/grid/latest/pdf/grid-vgpu-user-guide.pdf.

lromor · 2022-07-05T10:45:01Z

@msaroufim If you approve for an upstream bug-fix I'd be happy to help.

lromor · 2022-07-11T08:54:36Z

@msaroufim any update on this?

msaroufim · 2022-07-13T23:19:42Z

Hi @lromor I'm not sure what the right fix is yet. It does like seem like this is problem introduced by NVIDIA pynvml.nvml.NVMLError_NotSupported: Not Supported so I believe your best best is commenting on https://forums.developer.nvidia.com/t/bug-nvml-incorrectly-detects-certain-gpus-as-unsupported/30165 which will give someone on the team some buffer to take a look

lromor · 2022-07-18T07:59:56Z

Hi @msaroufim , I've opened an issue here: https://forums.developer.nvidia.com/t/nvml-issue-with-virtual-a100/220718?u=lromor

lromor · 2022-08-18T13:44:53Z

In case anyone gets to a similar issue and would like to have a quick fix, I patched the code with:

diff --git a/ts/metrics/system_metrics.py b/ts/metrics/system_metrics.py
index c7aaf6a..9915c9e 100644
--- a/ts/metrics/system_metrics.py
+++ b/ts/metrics/system_metrics.py
@@ -7,6 +7,7 @@ from builtins import str
 import psutil
 from ts.metrics.dimension import Dimension
 from ts.metrics.metric import Metric
+import pynvml
 
 system_metrics = []
 dimension = [Dimension('Level', 'Host')]
@@ -69,7 +70,11 @@ def gpu_utilization(num_of_gpu):
         system_metrics.append(Metric('GPUMemoryUtilization', value['mem_used_percent'], 'percent', dimension_gpu))
         system_metrics.append(Metric('GPUMemoryUsed', value['mem_used'], 'MB', dimension_gpu))
 
-    statuses = list_gpus.device_statuses()
+    try:
+        statuses = list_gpus.device_statuses()
+    except pynvml.nvml.NVMLError_NotSupported:
+        statuses = []
+
     for idx, status in enumerate(statuses):
         dimension_gpu = [Dimension('Level', 'Host'), Dimension("device_id", idx)]
         system_metrics.append(Metric('GPUUtilization', status['utilization'], 'percent', dimension_gpu))

msaroufim · 2022-08-18T15:16:34Z

I think this is the right solution. Wanna make a PR for it? May just need to add a logging warning as well

msaroufim added bug Something isn't working p1 mid priority labels Jul 4, 2022

lromor mentioned this issue Aug 23, 2022

managing nvml exception #1809

Merged

5 tasks

agunapal closed this as completed in #1809 Aug 26, 2022

SimengLiu-nv mentioned this issue Feb 13, 2023

When I run the trex with process_engine.py, it said pynvml.nvml.NVMLError_NotSupported: Not Supported NVIDIA/TensorRT#2669

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NVML_ERROR_NOT_SUPPORTED exception #1722

NVML_ERROR_NOT_SUPPORTED exception #1722

lromor commented Jul 4, 2022

msaroufim commented Jul 4, 2022 •

edited

Loading

lromor commented Jul 5, 2022 •

edited

Loading

lromor commented Jul 5, 2022 •

edited

Loading

lromor commented Jul 11, 2022

msaroufim commented Jul 13, 2022

lromor commented Jul 18, 2022

lromor commented Aug 18, 2022

msaroufim commented Aug 18, 2022

NVML_ERROR_NOT_SUPPORTED exception #1722

NVML_ERROR_NOT_SUPPORTED exception #1722

Comments

lromor commented Jul 4, 2022

🐛 Describe the bug

Error logs

Installation instructions

Model Packaing

config.properties

Versions

Repro instructions

Possible Solution

msaroufim commented Jul 4, 2022 • edited Loading

lromor commented Jul 5, 2022 • edited Loading

lromor commented Jul 5, 2022 • edited Loading

lromor commented Jul 11, 2022

msaroufim commented Jul 13, 2022

lromor commented Jul 18, 2022

lromor commented Aug 18, 2022

msaroufim commented Aug 18, 2022

msaroufim commented Jul 4, 2022 •

edited

Loading

lromor commented Jul 5, 2022 •

edited

Loading

lromor commented Jul 5, 2022 •

edited

Loading