testMetricManager is failing in ci-gpu #2136

mreso · 2023-02-15T22:02:15Z

🐛 Describe the bug

Merging of PRs is blocked due to failure of testMetricManager in ci-gpu.

E.g. https://github.com/pytorch/serve/actions/runs/4181567022/jobs/7258520134

Error logs

2023-02-15T21:08:09,773 [INFO ] W-9022-err_batch_1.0 TS_METRICS - QueueTime.ms:0|#Level:Host|#hostname:ip-172-31-61-110,timestamp:1676495289
2023-02-15T21:08:09,773 [INFO ] W-9022-err_batch_1.0 TS_METRICS - WorkerThreadTime.ms:0|#Level:Host|#hostname:ip-172-31-61-110,timestamp:1676495289

Error: 3-02-15T21:08:09,774 [ERROR] epollEventLoopGroup-22-1 org.pytorch.serve.TestUtils$TestHandler - Unknown exception
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer

TorchServeSuite > TorchServe > org.pytorch.serve.ModelServerTest > testErrorBatch PASSED

TorchServeSuite > TorchServe > org.pytorch.serve.ModelServerTest > testMetricManager STANDARD_OUT
Error: 3-02-15T21:08:10,390 [ERROR] Thread-4 org.pytorch.serve.metrics.MetricCollector - Traceback (most recent call last):
File "ts/metrics/metric_collector.py", line 27, in
system_metrics.collect_all(sys.modules['ts.metrics.system_metrics'], arguments.gpu)
File "/home/ubuntu/actions-runner/_work/serve/serve/ts/metrics/system_metrics.py", line 119, in collect_all
value(num_of_gpu)
File "/home/ubuntu/actions-runner/_work/serve/serve/ts/metrics/system_metrics.py", line 90, in gpu_utilization
statuses = list_gpus.device_statuses()
File "/home/ubuntu/actions-runner/_work/_tool/Python/3.8.16/x64/lib/python3.8/site-packages/nvgpu/list_gpus.py", line 67, in device_statuses
return [device_status(device_index) for device_index in range(device_count)]
File "/home/ubuntu/actions-runner/_work/_tool/Python/3.8.16/x64/lib/python3.8/site-packages/nvgpu/list_gpus.py", line 67, in
return [device_status(device_index) for device_index in range(device_count)]
File "/home/ubuntu/actions-runner/_work/_tool/Python/3.8.16/x64/lib/python3.8/site-packages/nvgpu/list_gpus.py", line 18, in device_status
device_name = device_name.decode('UTF-8')
AttributeError: 'str' object has no attribute 'decode'

TorchServeSuite > TorchServe > org.pytorch.serve.ModelServerTest > testMetricManager FAILED
java.lang.AssertionError at ModelServerTest.java:1327

Installation instructions

see ci failure

Model Packaing

see ci failure

config.properties

No response

Versions

see ci failure

Repro instructions

see ci failure

Possible Solution

No response

The text was updated successfully, but these errors were encountered:

agunapal · 2023-02-15T22:16:08Z

I dont see any new release from nvgpu. Interesting how this is failing

mreso · 2023-02-15T22:25:04Z

Yeah, was wondering too. Did we change the environment somehow in the last days? Different kind of instance?

mreso · 2023-02-15T22:27:54Z

But the info seems to come from pynvml which got updated recently https://pypi.org/project/pynvml/

https://github.com/rossumai/nvgpu/blob/fb2eb61b77ad541b118e1209eaab0ab65d55d5f2/nvgpu/list_gpus.py#L16

msaroufim · 2023-02-15T22:29:07Z

I thought pynvml was supposed to be pretty conservative with BC breakings changes, regardless we can pin this dependency and revisit. I was planning on getting rid of a direct dependency on nvgpu period for the next release

mreso · 2023-02-15T22:30:50Z

Sounds good, will create a PR

agunapal · 2023-02-15T22:36:41Z

Great. Thanks to your PR, I noticed that CI-GPU is still running on CUDA 10.2 (PyTorch 1.12) . Created a PR for the same.

ozancaglayan · 2023-02-16T18:45:44Z

I don't know if relevant but nvgpu although quite old, has a 0.9.0 version but the dependency here is pinned to 0.8.0. Is there a particular reason for this?

PS: It's not fixed with 0.9.0 either, but I agree on dropping dep on nvgpu as it pulls in lots of dependencies as well.

mreso · 2023-02-17T01:12:58Z

@ozancaglayan Good question. Its fixed to 0.8.0 for windows only and got pinned after the 0.9.0 release here so it it might be some windows only issue with the newer release. But thats pure speculation.

mreso self-assigned this Feb 15, 2023

msaroufim added bug Something isn't working dependencies Pull requests that update a dependency file ci labels Feb 15, 2023

mreso added a commit that referenced this issue Feb 15, 2023

Freeze pynvml version to avoid crash in nvgpu #2136

a636516

mreso mentioned this issue Feb 15, 2023

Freeze pynvml version to avoid crash in nvgpu #2138

Merged

10 tasks

msaroufim closed this as completed in #2138 Feb 15, 2023

msaroufim pushed a commit that referenced this issue Feb 15, 2023

Freeze pynvml version to avoid crash in nvgpu #2136 (#2138)

c84312b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

testMetricManager is failing in ci-gpu #2136

testMetricManager is failing in ci-gpu #2136

mreso commented Feb 15, 2023

agunapal commented Feb 15, 2023

mreso commented Feb 15, 2023

mreso commented Feb 15, 2023

msaroufim commented Feb 15, 2023

mreso commented Feb 15, 2023

agunapal commented Feb 15, 2023

ozancaglayan commented Feb 16, 2023 •

edited

Loading

mreso commented Feb 17, 2023

testMetricManager is failing in ci-gpu #2136

testMetricManager is failing in ci-gpu #2136

Comments

mreso commented Feb 15, 2023

🐛 Describe the bug

Error logs

Installation instructions

Model Packaing

config.properties

Versions

Repro instructions

Possible Solution

agunapal commented Feb 15, 2023

mreso commented Feb 15, 2023

mreso commented Feb 15, 2023

msaroufim commented Feb 15, 2023

mreso commented Feb 15, 2023

agunapal commented Feb 15, 2023

ozancaglayan commented Feb 16, 2023 • edited Loading

mreso commented Feb 17, 2023

ozancaglayan commented Feb 16, 2023 •

edited

Loading