Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a flag to disable system metrics collection #2052

Closed
henrysecond1 opened this issue Jan 3, 2023 · 1 comment · Fixed by #2104
Closed

Add a flag to disable system metrics collection #2052

henrysecond1 opened this issue Jan 3, 2023 · 1 comment · Fixed by #2104
Assignees
Labels
enhancement New feature or request p0 high priority triaged Issue has been reviewed and triaged

Comments

@henrysecond1
Copy link

henrysecond1 commented Jan 3, 2023

🚀 The feature

Add a flag to disable system metrics collection.

Or add a config file to control system metrics collection.

Motivation, pitch

We notice that there is no option to control system metrics collection (if not, please let me know).

Redundancy

In our case, system metrics collection is redundant because we are already collecting system metrics using other tools like DCGM exporter

Metrics collection is not working on MIG GPUS

Also, we are using MIG GPUs, and MIG GPUs have different UUID and accessibility (maybe #1237 is similar issue). So metrics collection is not working on MIG GPUs.

File "ts/metrics/metric_collector.py", line 27, in <module> 
     system_metrics.collect_all(sys.modules['ts.metrics.system_metrics'], arguments.gpu) 
   File "/home/venv/lib/python3.8/site-packages/ts/metrics/system_metrics.py", line 119, in collect_all 
     value(num_of_gpu) 
   File "/home/venv/lib/python3.8/site-packages/ts/metrics/system_metrics.py", line 71, in gpu_utilization 
     info = nvgpu.gpu_info() 
   File "/home/venv/lib/python3.8/site-packages/nvgpu/__init__.py", line 8, in gpu_info 
     gpu_infos = [re.match('GPU ([0-9]+): ([^(]+) \(UUID: ([^)]+)\)', gpu).groups() for gpu in gpus] 
   File "/home/venv/lib/python3.8/site-packages/nvgpu/__init__.py", line 8, in <listcomp> 
     gpu_infos = [re.match('GPU ([0-9]+): ([^(]+) \(UUID: ([^)]+)\)', gpu).groups() for gpu in gpus] 
 AttributeError: 'NoneType' object has no attribute 'groups'

Other issues we've experienced

We experienced that if multiple torchserve containers (~30) are starting simultaneously in a single GPU machine, nvidia-smi command hangs. We analyzed the issue and nvidia-smi call in metrics collector seems related.

Torchserve uses nvgpu to collect gpu infos and nvgpu calls nvidia-smi.

So, starting multiple containers can cause a lot of nvidia-smi calls and eventually nvidia-smi command hangs.

Please consider the request, thank you

Alternatives

No response

Additional context

No response

@msaroufim msaroufim added enhancement New feature or request triaged Issue has been reviewed and triaged p0 high priority labels Jan 3, 2023
@msaroufim
Copy link
Member

msaroufim commented Jan 3, 2023

I've heard this ask a few times now so marking as high priority - I don't think this is particularly tricky to implement either just might run into some issues where parts of the codebase expect metric logs to exist

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request p0 high priority triaged Issue has been reviewed and triaged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants