Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model level health check. #1531

Closed
yuanshaochen opened this issue Mar 25, 2022 · 2 comments
Closed

Model level health check. #1531

yuanshaochen opened this issue Mar 25, 2022 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@yuanshaochen
Copy link

yuanshaochen commented Mar 25, 2022

Is your feature request related to a problem? Please describe.

Our serving model worker may stop suddenly and keep failing to restart sometimes due to unknown reason. I want to use the health check to detect the issue and handle it automatically. But I could only find the health check API for the TorchServe service but not the model itself. I am wondering if there is any way to do the model level health check? (I have tried to use Management API /models/<model_name> as health check. But it return 200 even if the model worker already stopped.)

2022-03-25T01:27:36,950 [INFO ] epollEventLoopGroup-3-11 ACCESS_LOG - /10.0.242.40:55676 "GET /models/clickbait_headline_en HTTP/1.1" 200 0
2022-03-25T01:27:35,583 [INFO ] epollEventLoopGroup-3-10 TS_METRICS - Requests2XX.Count:1|#Level:Host|#hostname:clickbait-headline-deployment-679d99bd88-vlgc6,timestamp:null
2022-03-25T01:27:35,583 [INFO ] epollEventLoopGroup-3-10 ACCESS_LOG - /10.0.242.40:55538 "GET /models/clickbait_headline_en HTTP/1.1" 200 0
2022-03-25T01:27:35,562 [DEBUG] W-9000-clickbait_headline_en_0.0.1 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/usr/bin/python3, /usr/local/lib/python3.6/dist-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000]
2022-03-25T01:27:34,605 [INFO ] W-9000-clickbait_headline_en_0.0.1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-clickbait_headline_en_0.0.1-stdout
2022-03-25T01:27:34,605 [INFO ] W-9000-clickbait_headline_en_0.0.1-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-clickbait_headline_en_0.0.1-stderr
2022-03-25T01:27:34,561 [INFO ] epollEventLoopGroup-5-3 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STOPPED
2022-03-25T01:27:34,560 [INFO ] W-9000-clickbait_headline_en_0.0.1 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds.
2022-03-25T01:27:34,560 [WARN ] W-9000-clickbait_headline_en_0.0.1 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-clickbait_headline_en_0.0.1-stdout
2022-03-25T01:27:34,560 [WARN ] W-9000-clickbait_headline_en_0.0.1 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-clickbait_headline_en_0.0.1-stderr
2022-03-25T01:27:34,559 [DEBUG] W-9000-clickbait_headline_en_0.0.1 org.pytorch.serve.wlm.WorkerThread - W-9000-clickbait_headline_en_0.0.1 State change WORKER_MODEL_LOADED -> WORKER_STOPPED
2022-03-25T01:27:34,559 [INFO ] W-9000-clickbait_headline_en_0.0.1 ACCESS_LOG - /10.0.102.170:49450 "gRPC org.pytorch.serve.grpc.inference.InferenceAPIsService/Predictions HTTP/2.0" 13 372211
2022-03-25T01:27:34,559 [INFO ] W-9000-clickbait_headline_en_0.0.1-stdout MODEL_METRICS - PredictionTime.Milliseconds:66.73|#ModelName:clickbait_headline_en,Level:Model|#hostname:clickbait-headline-deployment-679d99bd88-vlgc6,requestID:306ef8c2-f95b-4445-9d78-84a72bb0fc26,timestamp:1648171654
at java.lang.Thread.run(Thread.java:829) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:195) ~[model-server.jar:?]
at org.pytorch.serve.wlm.BatchAggregator.sendResponse(BatchAggregator.java:74) ~[model-server.jar:?]
at org.pytorch.serve.job.GRPCJob.response(GRPCJob.java:46) ~[model-server.jar:?]
at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:335) ~[model-server.jar:?]
at io.grpc.Status.asRuntimeException(Status.java:524) ~[model-server.jar:?]
io.grpc.StatusRuntimeException: CANCELLED: call already cancelled```
@HamidShojanazeri HamidShojanazeri added the bug Something isn't working label Mar 28, 2022
@HamidShojanazeri
Copy link
Collaborator

HamidShojanazeri commented Mar 28, 2022

@yuanshaochen Thanks for opening this ticket. The main issue is that, the health check only check with frontend part and as long as a model is not un-registered this would return 200. We are thinking of an improved metric system that allows us to monitor these kind of situations from metrics.

As a workaround for now, we have recently added a customized describe API that allows for collecting customized meta data from backend/ handler for description. This can help with getting more info from backend to see if the model is actually serving/ failing to run inference. You can read more about it here.

@yuanshaochen
Copy link
Author

Thank you for the work around. It actually adds a health check in handler level. (The particular health check request does not need to go through the model, just return succeed if handler is ready, which also means the model is loaded.) I will try it with our models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants