Add a flag to disable system metrics collection #2052

henrysecond1 · 2023-01-03T08:23:53Z

🚀 The feature

Add a flag to disable system metrics collection.

Or add a config file to control system metrics collection.

Motivation, pitch

We notice that there is no option to control system metrics collection (if not, please let me know).

Redundancy

In our case, system metrics collection is redundant because we are already collecting system metrics using other tools like DCGM exporter

Metrics collection is not working on MIG GPUS

Also, we are using MIG GPUs, and MIG GPUs have different UUID and accessibility (maybe #1237 is similar issue). So metrics collection is not working on MIG GPUs.

File "ts/metrics/metric_collector.py", line 27, in <module> 
     system_metrics.collect_all(sys.modules['ts.metrics.system_metrics'], arguments.gpu) 
   File "/home/venv/lib/python3.8/site-packages/ts/metrics/system_metrics.py", line 119, in collect_all 
     value(num_of_gpu) 
   File "/home/venv/lib/python3.8/site-packages/ts/metrics/system_metrics.py", line 71, in gpu_utilization 
     info = nvgpu.gpu_info() 
   File "/home/venv/lib/python3.8/site-packages/nvgpu/__init__.py", line 8, in gpu_info 
     gpu_infos = [re.match('GPU ([0-9]+): ([^(]+) \(UUID: ([^)]+)\)', gpu).groups() for gpu in gpus] 
   File "/home/venv/lib/python3.8/site-packages/nvgpu/__init__.py", line 8, in <listcomp> 
     gpu_infos = [re.match('GPU ([0-9]+): ([^(]+) \(UUID: ([^)]+)\)', gpu).groups() for gpu in gpus] 
 AttributeError: 'NoneType' object has no attribute 'groups'

Other issues we've experienced

We experienced that if multiple torchserve containers (~30) are starting simultaneously in a single GPU machine, nvidia-smi command hangs. We analyzed the issue and nvidia-smi call in metrics collector seems related.

Torchserve uses nvgpu to collect gpu infos and nvgpu calls nvidia-smi.

https://github.com/rossumai/nvgpu/blob/v0.9.0/nvgpu/__init__.py#L6-L27

So, starting multiple containers can cause a lot of nvidia-smi calls and eventually nvidia-smi command hangs.

Please consider the request, thank you

Alternatives

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

msaroufim · 2023-01-03T18:30:20Z

I've heard this ask a few times now so marking as high priority - I don't think this is particularly tricky to implement either just might run into some issues where parts of the codebase expect metric logs to exist

msaroufim added enhancement New feature or request triaged Issue has been reviewed and triaged p0 high priority labels Jan 3, 2023

msaroufim assigned namannandan Jan 19, 2023

namannandan mentioned this issue Feb 2, 2023

Add configuration option to disable system metrics collection #2104

Merged

9 tasks

namannandan closed this as completed in #2104 Feb 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a flag to disable system metrics collection #2052

Add a flag to disable system metrics collection #2052

henrysecond1 commented Jan 3, 2023 •

edited

Loading

msaroufim commented Jan 3, 2023 •

edited

Loading

Add a flag to disable system metrics collection #2052

Add a flag to disable system metrics collection #2052

Comments

henrysecond1 commented Jan 3, 2023 • edited Loading

🚀 The feature

Motivation, pitch

Alternatives

Additional context

msaroufim commented Jan 3, 2023 • edited Loading

henrysecond1 commented Jan 3, 2023 •

edited

Loading

msaroufim commented Jan 3, 2023 •

edited

Loading