Model level health check. #1531

yuanshaochen · 2022-03-25T03:08:06Z

Is your feature request related to a problem? Please describe.

Our serving model worker may stop suddenly and keep failing to restart sometimes due to unknown reason. I want to use the health check to detect the issue and handle it automatically. But I could only find the health check API for the TorchServe service but not the model itself. I am wondering if there is any way to do the model level health check? (I have tried to use Management API /models/<model_name> as health check. But it return 200 even if the model worker already stopped.)

2022-03-25T01:27:36,950 [INFO ] epollEventLoopGroup-3-11 ACCESS_LOG - /10.0.242.40:55676 "GET /models/clickbait_headline_en HTTP/1.1" 200 0
2022-03-25T01:27:35,583 [INFO ] epollEventLoopGroup-3-10 TS_METRICS - Requests2XX.Count:1|#Level:Host|#hostname:clickbait-headline-deployment-679d99bd88-vlgc6,timestamp:null
2022-03-25T01:27:35,583 [INFO ] epollEventLoopGroup-3-10 ACCESS_LOG - /10.0.242.40:55538 "GET /models/clickbait_headline_en HTTP/1.1" 200 0
2022-03-25T01:27:35,562 [DEBUG] W-9000-clickbait_headline_en_0.0.1 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/usr/bin/python3, /usr/local/lib/python3.6/dist-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000]
2022-03-25T01:27:34,605 [INFO ] W-9000-clickbait_headline_en_0.0.1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-clickbait_headline_en_0.0.1-stdout
2022-03-25T01:27:34,605 [INFO ] W-9000-clickbait_headline_en_0.0.1-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-clickbait_headline_en_0.0.1-stderr
2022-03-25T01:27:34,561 [INFO ] epollEventLoopGroup-5-3 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STOPPED
2022-03-25T01:27:34,560 [INFO ] W-9000-clickbait_headline_en_0.0.1 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds.
2022-03-25T01:27:34,560 [WARN ] W-9000-clickbait_headline_en_0.0.1 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-clickbait_headline_en_0.0.1-stdout
2022-03-25T01:27:34,560 [WARN ] W-9000-clickbait_headline_en_0.0.1 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-clickbait_headline_en_0.0.1-stderr
2022-03-25T01:27:34,559 [DEBUG] W-9000-clickbait_headline_en_0.0.1 org.pytorch.serve.wlm.WorkerThread - W-9000-clickbait_headline_en_0.0.1 State change WORKER_MODEL_LOADED -> WORKER_STOPPED
2022-03-25T01:27:34,559 [INFO ] W-9000-clickbait_headline_en_0.0.1 ACCESS_LOG - /10.0.102.170:49450 "gRPC org.pytorch.serve.grpc.inference.InferenceAPIsService/Predictions HTTP/2.0" 13 372211
2022-03-25T01:27:34,559 [INFO ] W-9000-clickbait_headline_en_0.0.1-stdout MODEL_METRICS - PredictionTime.Milliseconds:66.73|#ModelName:clickbait_headline_en,Level:Model|#hostname:clickbait-headline-deployment-679d99bd88-vlgc6,requestID:306ef8c2-f95b-4445-9d78-84a72bb0fc26,timestamp:1648171654
at java.lang.Thread.run(Thread.java:829) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:195) ~[model-server.jar:?]
at org.pytorch.serve.wlm.BatchAggregator.sendResponse(BatchAggregator.java:74) ~[model-server.jar:?]
at org.pytorch.serve.job.GRPCJob.response(GRPCJob.java:46) ~[model-server.jar:?]
at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:335) ~[model-server.jar:?]
at io.grpc.Status.asRuntimeException(Status.java:524) ~[model-server.jar:?]
io.grpc.StatusRuntimeException: CANCELLED: call already cancelled```

The text was updated successfully, but these errors were encountered:

HamidShojanazeri · 2022-03-28T20:21:45Z

@yuanshaochen Thanks for opening this ticket. The main issue is that, the health check only check with frontend part and as long as a model is not un-registered this would return 200. We are thinking of an improved metric system that allows us to monitor these kind of situations from metrics.

As a workaround for now, we have recently added a customized describe API that allows for collecting customized meta data from backend/ handler for description. This can help with getting more info from backend to see if the model is actually serving/ failing to run inference. You can read more about it here.

yuanshaochen · 2022-03-30T08:49:17Z

Thank you for the work around. It actually adds a health check in handler level. (The particular health check request does not need to go through the model, just return succeed if handler is ready, which also means the model is loaded.) I will try it with our models.

HamidShojanazeri added the bug Something isn't working label Mar 28, 2022

HamidShojanazeri assigned HamidShojanazeri and lxning Mar 28, 2022

msaroufim closed this as completed May 4, 2022

hgong-snap mentioned this issue Aug 26, 2022

Worker thread stuck in die state #1815

Closed

byeongjokim mentioned this issue Oct 5, 2022

Set model status using torchserve api #1878

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model level health check. #1531

Model level health check. #1531

yuanshaochen commented Mar 25, 2022 •

edited

Loading

HamidShojanazeri commented Mar 28, 2022 •

edited

Loading

yuanshaochen commented Mar 30, 2022

Model level health check. #1531

Model level health check. #1531

Comments

yuanshaochen commented Mar 25, 2022 • edited Loading

Is your feature request related to a problem? Please describe.

HamidShojanazeri commented Mar 28, 2022 • edited Loading

yuanshaochen commented Mar 30, 2022

yuanshaochen commented Mar 25, 2022 •

edited

Loading

HamidShojanazeri commented Mar 28, 2022 •

edited

Loading