Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Health check failures do not clean up properly #50

Open
markcoatsworth opened this issue May 4, 2023 · 0 comments
Open

Health check failures do not clean up properly #50

markcoatsworth opened this issue May 4, 2023 · 0 comments

Comments

@markcoatsworth
Copy link
Collaborator

markcoatsworth commented May 4, 2023

Describe the bug
When a model job fails a health check, it moves immediately into a FailedState and hence the gateway never sends the request to shut the job down. We should send the shutdown request first. Celery logs as follows:

[2023-05-01 20:59:58,426: WARNING/ForkPoolWorker-3] The model is healthy
[2023-05-03 18:42:23,284: WARNING/ForkPoolWorker-3] Model health verification error:: HTTPConnectionPool(host='172.17.8.109', port=43537): Max retries exceeded with url: /health (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fb5c7984b80>: Failed to establish a new connection: [Errno 111] Connection refused'))
[2023-05-03 18:42:23,285: WARNING/ForkPoolWorker-3] [2023-05-03 18:42:23,285] ERROR in models: Health check for active model OPT-6.7B failed
[2023-05-03 18:42:23,285: ERROR/ForkPoolWorker-3] Health check for active model OPT-6.7B failed
[2023-05-03 18:42:23,429: ERROR/ForkPoolWorker-3] Task tasks.verify_model_instance_health[94ca39cf-8249-46de-ae48-605aae58ec48] raised unexpected: InvalidStateError("Invalid operation for model instance state: <class 'models.FailedState'>")
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/celery/app/trace.py", line 451, in trace_task
R = retval = fun(*args, **kwargs)
File "/app/gateway_service.py", line 55, in call
return self.run(*args, **kwargs)
File "/app/tasks.py", line 11, in verify_model_instance_health
model_instance.shutdown()
File "/app/models.py", line 302, in shutdown
self._state.shutdown()
File "/app/models.py", line 200, in shutdown
raise InvalidStateError(self)
errors.InvalidStateError: Invalid operation for model instance state: <class 'models.FailedState'>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant