-
Notifications
You must be signed in to change notification settings - Fork 864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
testMetricManager is failing in ci-gpu #2136
Comments
I dont see any new release from nvgpu. Interesting how this is failing |
Yeah, was wondering too. Did we change the environment somehow in the last days? Different kind of instance? |
But the info seems to come from pynvml which got updated recently https://pypi.org/project/pynvml/ |
I thought pynvml was supposed to be pretty conservative with BC breakings changes, regardless we can pin this dependency and revisit. I was planning on getting rid of a direct dependency on nvgpu period for the next release |
Sounds good, will create a PR |
Great. Thanks to your PR, I noticed that CI-GPU is still running on CUDA 10.2 (PyTorch 1.12) . Created a PR for the same. |
I don't know if relevant but nvgpu although quite old, has a 0.9.0 version but the dependency here is pinned to 0.8.0. Is there a particular reason for this? PS: It's not fixed with 0.9.0 either, but I agree on dropping dep on nvgpu as it pulls in lots of dependencies as well. |
@ozancaglayan Good question. Its fixed to 0.8.0 for windows only and got pinned after the 0.9.0 release here so it it might be some windows only issue with the newer release. But thats pure speculation. |
🐛 Describe the bug
Merging of PRs is blocked due to failure of testMetricManager in ci-gpu.
E.g. https://github.com/pytorch/serve/actions/runs/4181567022/jobs/7258520134
Error logs
Error: 3-02-15T21:08:09,774 [ERROR] epollEventLoopGroup-22-1 org.pytorch.serve.TestUtils$TestHandler - Unknown exception
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
TorchServeSuite > TorchServe > org.pytorch.serve.ModelServerTest > testErrorBatch PASSED
TorchServeSuite > TorchServe > org.pytorch.serve.ModelServerTest > testMetricManager STANDARD_OUT
Error: 3-02-15T21:08:10,390 [ERROR] Thread-4 org.pytorch.serve.metrics.MetricCollector - Traceback (most recent call last):
File "ts/metrics/metric_collector.py", line 27, in
system_metrics.collect_all(sys.modules['ts.metrics.system_metrics'], arguments.gpu)
File "/home/ubuntu/actions-runner/_work/serve/serve/ts/metrics/system_metrics.py", line 119, in collect_all
value(num_of_gpu)
File "/home/ubuntu/actions-runner/_work/serve/serve/ts/metrics/system_metrics.py", line 90, in gpu_utilization
statuses = list_gpus.device_statuses()
File "/home/ubuntu/actions-runner/_work/_tool/Python/3.8.16/x64/lib/python3.8/site-packages/nvgpu/list_gpus.py", line 67, in device_statuses
return [device_status(device_index) for device_index in range(device_count)]
File "/home/ubuntu/actions-runner/_work/_tool/Python/3.8.16/x64/lib/python3.8/site-packages/nvgpu/list_gpus.py", line 67, in
return [device_status(device_index) for device_index in range(device_count)]
File "/home/ubuntu/actions-runner/_work/_tool/Python/3.8.16/x64/lib/python3.8/site-packages/nvgpu/list_gpus.py", line 18, in device_status
device_name = device_name.decode('UTF-8')
AttributeError: 'str' object has no attribute 'decode'
TorchServeSuite > TorchServe > org.pytorch.serve.ModelServerTest > testMetricManager FAILED
java.lang.AssertionError at ModelServerTest.java:1327
Installation instructions
see ci failure
Model Packaing
see ci failure
config.properties
No response
Versions
see ci failure
Repro instructions
see ci failure
Possible Solution
No response
The text was updated successfully, but these errors were encountered: