Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testMetricManager is failing in ci-gpu #2136

Closed
mreso opened this issue Feb 15, 2023 · 8 comments · Fixed by #2138
Closed

testMetricManager is failing in ci-gpu #2136

mreso opened this issue Feb 15, 2023 · 8 comments · Fixed by #2138
Assignees
Labels
bug Something isn't working ci dependencies Pull requests that update a dependency file

Comments

@mreso
Copy link
Collaborator

mreso commented Feb 15, 2023

🐛 Describe the bug

Merging of PRs is blocked due to failure of testMetricManager in ci-gpu.

E.g. https://github.com/pytorch/serve/actions/runs/4181567022/jobs/7258520134

Error logs

2023-02-15T21:08:09,773 [INFO ] W-9022-err_batch_1.0 TS_METRICS - QueueTime.ms:0|#Level:Host|#hostname:ip-172-31-61-110,timestamp:1676495289
2023-02-15T21:08:09,773 [INFO ] W-9022-err_batch_1.0 TS_METRICS - WorkerThreadTime.ms:0|#Level:Host|#hostname:ip-172-31-61-110,timestamp:1676495289

Error: 3-02-15T21:08:09,774 [ERROR] epollEventLoopGroup-22-1 org.pytorch.serve.TestUtils$TestHandler - Unknown exception
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer

TorchServeSuite > TorchServe > org.pytorch.serve.ModelServerTest > testErrorBatch PASSED

TorchServeSuite > TorchServe > org.pytorch.serve.ModelServerTest > testMetricManager STANDARD_OUT
Error: 3-02-15T21:08:10,390 [ERROR] Thread-4 org.pytorch.serve.metrics.MetricCollector - Traceback (most recent call last):
File "ts/metrics/metric_collector.py", line 27, in
system_metrics.collect_all(sys.modules['ts.metrics.system_metrics'], arguments.gpu)
File "/home/ubuntu/actions-runner/_work/serve/serve/ts/metrics/system_metrics.py", line 119, in collect_all
value(num_of_gpu)
File "/home/ubuntu/actions-runner/_work/serve/serve/ts/metrics/system_metrics.py", line 90, in gpu_utilization
statuses = list_gpus.device_statuses()
File "/home/ubuntu/actions-runner/_work/_tool/Python/3.8.16/x64/lib/python3.8/site-packages/nvgpu/list_gpus.py", line 67, in device_statuses
return [device_status(device_index) for device_index in range(device_count)]
File "/home/ubuntu/actions-runner/_work/_tool/Python/3.8.16/x64/lib/python3.8/site-packages/nvgpu/list_gpus.py", line 67, in
return [device_status(device_index) for device_index in range(device_count)]
File "/home/ubuntu/actions-runner/_work/_tool/Python/3.8.16/x64/lib/python3.8/site-packages/nvgpu/list_gpus.py", line 18, in device_status
device_name = device_name.decode('UTF-8')
AttributeError: 'str' object has no attribute 'decode'

TorchServeSuite > TorchServe > org.pytorch.serve.ModelServerTest > testMetricManager FAILED
java.lang.AssertionError at ModelServerTest.java:1327

Installation instructions

see ci failure

Model Packaing

see ci failure

config.properties

No response

Versions

see ci failure

Repro instructions

see ci failure

Possible Solution

No response

@mreso mreso self-assigned this Feb 15, 2023
@agunapal
Copy link
Collaborator

I dont see any new release from nvgpu. Interesting how this is failing

@mreso
Copy link
Collaborator Author

mreso commented Feb 15, 2023

Yeah, was wondering too. Did we change the environment somehow in the last days? Different kind of instance?

@mreso
Copy link
Collaborator Author

mreso commented Feb 15, 2023

@msaroufim
Copy link
Member

I thought pynvml was supposed to be pretty conservative with BC breakings changes, regardless we can pin this dependency and revisit. I was planning on getting rid of a direct dependency on nvgpu period for the next release

@msaroufim msaroufim added bug Something isn't working dependencies Pull requests that update a dependency file ci labels Feb 15, 2023
@mreso
Copy link
Collaborator Author

mreso commented Feb 15, 2023

Sounds good, will create a PR

@agunapal
Copy link
Collaborator

Great. Thanks to your PR, I noticed that CI-GPU is still running on CUDA 10.2 (PyTorch 1.12) . Created a PR for the same.

@ozancaglayan
Copy link
Contributor

ozancaglayan commented Feb 16, 2023

I don't know if relevant but nvgpu although quite old, has a 0.9.0 version but the dependency here is pinned to 0.8.0. Is there a particular reason for this?

PS: It's not fixed with 0.9.0 either, but I agree on dropping dep on nvgpu as it pulls in lots of dependencies as well.

@mreso
Copy link
Collaborator Author

mreso commented Feb 17, 2023

@ozancaglayan Good question. Its fixed to 0.8.0 for windows only and got pinned after the 0.9.0 release here so it it might be some windows only issue with the newer release. But thats pure speculation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ci dependencies Pull requests that update a dependency file
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants