You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Or add a config file to control system metrics collection.
Motivation, pitch
We notice that there is no option to control system metrics collection (if not, please let me know).
Redundancy
In our case, system metrics collection is redundant because we are already collecting system metrics using other tools like DCGM exporter
Metrics collection is not working on MIG GPUS
Also, we are using MIG GPUs, and MIG GPUs have different UUID and accessibility (maybe #1237 is similar issue). So metrics collection is not working on MIG GPUs.
File "ts/metrics/metric_collector.py", line 27, in<module>
system_metrics.collect_all(sys.modules['ts.metrics.system_metrics'], arguments.gpu)
File "/home/venv/lib/python3.8/site-packages/ts/metrics/system_metrics.py", line 119, in collect_all
value(num_of_gpu)
File "/home/venv/lib/python3.8/site-packages/ts/metrics/system_metrics.py", line 71, in gpu_utilization
info = nvgpu.gpu_info()
File "/home/venv/lib/python3.8/site-packages/nvgpu/__init__.py", line 8, in gpu_info
gpu_infos = [re.match('GPU ([0-9]+): ([^(]+) \(UUID: ([^)]+)\)', gpu).groups() forgpuin gpus]
File "/home/venv/lib/python3.8/site-packages/nvgpu/__init__.py", line 8, in<listcomp>
gpu_infos = [re.match('GPU ([0-9]+): ([^(]+) \(UUID: ([^)]+)\)', gpu).groups() forgpuin gpus]
AttributeError: 'NoneType' object has no attribute 'groups'
Other issues we've experienced
We experienced that if multiple torchserve containers (~30) are starting simultaneously in a single GPU machine, nvidia-smi command hangs. We analyzed the issue and nvidia-smi call in metrics collector seems related.
Torchserve uses nvgpu to collect gpu infos and nvgpu calls nvidia-smi.
I've heard this ask a few times now so marking as high priority - I don't think this is particularly tricky to implement either just might run into some issues where parts of the codebase expect metric logs to exist
🚀 The feature
Add a flag to disable system metrics collection.
Or add a config file to control system metrics collection.
Motivation, pitch
We notice that there is no option to control system metrics collection (if not, please let me know).
Redundancy
In our case, system metrics collection is redundant because we are already collecting system metrics using other tools like DCGM exporter
Metrics collection is not working on MIG GPUS
Also, we are using MIG GPUs, and MIG GPUs have different UUID and accessibility (maybe #1237 is similar issue). So metrics collection is not working on MIG GPUs.
Other issues we've experienced
We experienced that if multiple torchserve containers (~30) are starting simultaneously in a single GPU machine,
nvidia-smi
command hangs. We analyzed the issue andnvidia-smi
call in metrics collector seems related.Torchserve uses
nvgpu
to collect gpu infos andnvgpu
callsnvidia-smi
.So, starting multiple containers can cause a lot of
nvidia-smi
calls and eventuallynvidia-smi
command hangs.Please consider the request, thank you
Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: