You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Our serving model worker may stop suddenly and keep failing to restart sometimes due to unknown reason. I want to use the health check to detect the issue and handle it automatically. But I could only find the health check API for the TorchServe service but not the model itself. I am wondering if there is any way to do the model level health check? (I have tried to use Management API /models/<model_name> as health check. But it return 200 even if the model worker already stopped.)
@yuanshaochen Thanks for opening this ticket. The main issue is that, the health check only check with frontend part and as long as a model is not un-registered this would return 200. We are thinking of an improved metric system that allows us to monitor these kind of situations from metrics.
As a workaround for now, we have recently added a customized describe API that allows for collecting customized meta data from backend/ handler for description. This can help with getting more info from backend to see if the model is actually serving/ failing to run inference. You can read more about it here.
Thank you for the work around. It actually adds a health check in handler level. (The particular health check request does not need to go through the model, just return succeed if handler is ready, which also means the model is loaded.) I will try it with our models.
Is your feature request related to a problem? Please describe.
Our serving model worker may stop suddenly and keep failing to restart sometimes due to unknown reason. I want to use the health check to detect the issue and handle it automatically. But I could only find the health check API for the TorchServe service but not the model itself. I am wondering if there is any way to do the model level health check? (I have tried to use Management API
/models/<model_name>
as health check. But it return 200 even if the model worker already stopped.)The text was updated successfully, but these errors were encountered: